Files
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

935 lines
85 KiB
Plaintext

Episode: 1079
Title: HPR1079: Distributed Systems Podcast
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1079/hpr1079.mp3
Transcribed: 2025-10-17 18:39:03
---
Welcome to the Distributed Podcast. My name is Jonathan Oliver.
And I'm Renatham Dylan. Today we're short a couple of co-hosts due to some scheduling conflicts,
but we really wanted to introduce a guest here. He is on the LMAX, he's one of the co-architects,
I guess, of the LMAX Disruptor Project. And we're very excited to talk about that. So with that,
we will ask Mike Barker to introduce himself. Yeah, hi, I'm Mike. I've been
developer at LMAX for about three years. My current role is sort of head of the performance
of stability team. And yeah, we built, I was quite closely involved in the original design
of the Disruptor, but also there's a large amount of infrastructure code that we built around it.
So our own reliable messaging and our own events will start to teach you for which the
Disruptor forms a fairly key part. Now Mike, tell us a little bit about you, some of your
development background, you know, what platforms have you worked on, what programming languages,
you know, a lot of our audience is fairly uniform in terms of the programming languages that
they like to use. So we're curious to just get an outside take here and get some of your background.
Yep, okay. I'm mostly a Java guy. I probably picked up Java first when I was at university about
about 12 or 13 years ago. I've done a fair amount of other stuff as well done, but some pieces of
C and C++ and the odd bit of .NET here and there. So mostly a Java guy, I work for Nortel initially.
It's my first job doing network management systems and then I spend a fair bit of time doing
just sort of consulting type work and systems integration and that type of stuff, a little bit of
J2BE, mostly sort of core Java stuff. And that was probably, and then eventually made my way to
Elmax and got a lot more involved in the high-end performance and concurrency space.
Awesome. Now what platform are you currently working with? What's your development platform of
choice as far as operating system? We're pretty much Linux based across the board, so even,
you know, our development environments running on Fedora, our production environments running on
Redhead Enterprise. It's all Java based, a lot of core Java stuff, and we use the Resident
Application Server as our main app server for the Web-in side of things.
Nice. Nice. And I'm just curious what your IDE is and stuff, because on the .NET side,
it's always, you know, Visual Studio, Windows, SQL Server.
Yeah. I'm an Eclipse guy. We have quite a few people that use IntelliJ, and it's always,
it's always interesting to have various discussions over which ones are best.
You know, for us being mostly .NET guys, you know, anything that JetBrains puts out,
IntelliJ, PyCharm, ReSharp, or anything, you know, they're build server,
we just, I've had nothing but praise for them, because they've just an incredible team.
Yeah, they did some pretty good work. I have to admit that even though I'm an Eclipse user.
Okay. Well, we can end the podcast now, then I guess.
So, well, I guess, yeah, we need, I'd like to know a little bit about the, first of all,
Elmax Distropter. What, you know, can you tell us about the origins of the name?
I've heard where it came from, but I just want to get it straight from the source.
Why Disruptor? What was the motivation behind that name?
It's got a funny story. Originally, when we looked at it, it was very similar to Phasers,
which are a library that was introduced. I think its bait is what fork joins based on.
And it was, and it had a very, very similar pattern to that. So, it was sort of going through
the, the Star Trek analogy of Phasers and Disruptors. The other name is it came from,
we wanted to be a little bit disruptive. We wanted to get people thinking about concurrency
in slightly different ways and thinking about computing in slightly different ways.
So, if you ever look at the website and things like that, we spent a lot of time
trying to explain how it works, and in a lot of detail, rather than just saying,
here's a great framework in a way you go, but trying to say, this is actually
how concurrent, how CPUs behave. Here's how various parts of concurrent systems work,
and this is how the Disruptor works as a result of that. So, it's not just about,
here's a great tool, but here's a way to improve your understanding of computers in general.
Nice. Okay. Well, let's, let's talk for a minute about, you know, what, what problem
were you trying to solve, and then we'll delve into some of the specifics of how it was
implemented and some of the design choices. But what was the issue that caused you to say,
we need to write, I mean, this is, this is every developer right here. Let's write a framework.
Yeah. Yes, it's a very, has a good point. I mean, we, yeah, we probably have a little bit of
the, the not-invented hair syndrome where we work, but where it came from is we were looking at
reliable messaging and durable messaging for our platform. And we've got an event-sourced
architecture for our system. So, all of our core high-performance services run on the basis of,
they're purely asynchronous. They simply respond to inbound asynchronous events.
They have some sort of business logic generally sort of represented loosely like a state machine.
And as they receive events, business logic rules get executed and they admit resulting events.
So, the idea with event sourcing is that your business state inside of that service is always
recoverable by simply replaying a journal of those messages that are coming to your system.
Okay. So, now I need to pause you here for a second because the canonical event sourcing that
has been flying around in at least in the .net space and making its way out from there is
that you have the, you know, domain aggregates and then you dispatch a command to this aggregate.
So, it's a command saying do something and then once that the domain aggregate decides to perform
that action, it publishes one or more events. And so, what you're saying is that events come in
and then events go out. Is that sound right? Like, yeah, I mean, it's largely the same thing.
I think you could interpret invents as commands, particular type of event. The core for us is that
it's a consistent audit stream of messages that are then persisted to some sort of journal.
The idea being that recovery is simply about replaying that journal. So, the key things for us that
when we look at our event sourced architecture, that our journal is very fast and our business logic
is deterministic. So, given some initial state and some set of events, we will always end up with
the same business logic state at the end. So, event sourcing for us was much more about how do we
recover from failure? Okay, so failure recovery got it. Okay. And go ahead. Yep. So, when we started
looking at durable messaging, one of the things we looked at when you start looking at event sourced
architectures is you start realizing not only is there business state inside of a service deterministic,
but actually the set of output events which is left your system is also deterministic. So, it's just
simply a function of your initial state and a set of inbound events. So, it's possible to reproduce
all of your outbound events by replaying your journal as well. So, you start looking at that saying,
well, we could implement durable messaging by whenever we send a message out, some broker type
system will persist all of the outbound messages. And then some downstream system when it wants
to recover can go to that store and pull those messages back. But we felt that since we already
had the messages being written to a journal, we actually already had a store of those events.
It was simply had to go through the business logic in order to create the outbound events again.
So, the simplest solution we came up with was just to have a ring buffer when we send our
messages out. So, in memory, we keep a cache of those events that have been sent.
And in the case where we have some downstream system goes away and comes back again and wants to
late join into that stream, it can simply re-request the messages from that ring buffer.
And that's where the concept of the ring buffer being sent to the disruptor came from.
How big is this ring buffer?
It gets pretty big. Some cases we've got over 20 million entries in the ring buffer.
Okay, we've got pretty much for a pure memory system. So, performance being fairly crucial
to our environment, the only time we ever go to disk is to write to this very fast journal.
We try to keep everything else in memory, which brings in some really interesting design constraints.
So, it's a lot about a lot of the design of some of our high-performance systems,
is about really well understanding the problem domain and trying to understand what that working
set is and keeping that working set such that it's not monotonically increasing.
It's probably the technical term. If you're in a situation where you're trying to keep everything
in memory, you have to very clearly decide about what sort of data do you no longer need in some sort
of when a transaction occurs. A good example is something like a trade. So, in a financial system,
you receive orders from different sets of customers. And it's only when those orders match in terms
of price, do you get a trade occur? But the exchange itself, once a trade has occurred and it's
sent that trade out or sent it off to some other remote system and no longer cares about it,
that trade data is effectively immutable and no longer relevant to the exchange itself.
So, the idea is we are only keeping the current set of orders inside of our exchange and not all of
the trades that have ever occurred. So, that way we keep our working set fairly monotonic. It
doesn't increase over time. It'll increase probably with the number of customers, but that's simply
a capacity planning issue. At that point, you just charge your system and you have different
sets of servers for different sets of customers. Or generally something like buy instrument, for example.
So, you might have one set of servers dealing with your FX instruments and other set of servers
dealing with your equities instruments, for example. Okay, my brain just exploded.
I'm not too much into the financial markets. I do understand them well enough, but not any level of
depth that you guys would go into. Okay, so I do have one technical question, then I want to get
back to the theory behind it, the disruptor. But what kind of RAM requirements then do you have
to have? Here's all this massive stream of events, 20 million ring buffer of 20 million buckets.
What is your 32, 64 gigs around? What are you looking at on your servers?
We're actually not that high. We're around about four, I think we're running it about four gigs,
with a maximum limit of about eight. Wow. Our business logic doesn't actually take all that much
memory. You can probably fit all the business logic into less than a gig, probably even less than 500
megs. The ring buffers do take a lot of space. One of the nice things about finance industry is it's
being the business models have been so heavily refined. There's a fairly good understanding about
the key input and outbound events. Actually, you can cram them down to quite small sizes. Our
actual messages are pretty small, maybe 128 bytes. 20 million, 128 bytes buckets is not too bad.
Just a quick take a question. How do serialized messages like custom serializers or
mathematical library or format? We do a custom serialization. We just define for a given message,
we just define a marshaler, and that gets given an output stream it just writes to. When a message comes
in, we just basically, we don't actually do any serialization out to the journal. When a message comes
in off the wire, what we're storing in the disruptor inside that ring buffer is the serialized form.
Our journaling becomes really, really simple. We all need to know as the size of the message and the
bytes. We just very quickly take the serialized form out of the ring buffer and stick it into a journal
file. But then it does the unmashing side inside the business logic, and that's just basically a class
that's associated with the business object or the business message and just un-serializes it into
a new object. Interesting. So you guys have your own custom serializer. What was that not invented
here syndrome? Was there some performance requirements related to serialization then?
Yeah, it's probably our most expensive in terms of CPU. If it is actually message serialization,
we started initially with things like Java serialization. This was before some of the more
modern ones have turned up, like protocol buffers and thrift. Although we think how we did some
benchmarks, we think our approach is actually a bit quicker, but it doesn't have some of the
nice backwards compatibility features. You're going to open-source that one too? Probably not. It's
very tightly wired into our business logic, unfortunately. For example, Java serialization is
terrible in terms of performance. I mean, good examples. It stores the entire name of the
fully qualified name of the class inside the serialized form. When you're running down it 128
byte messages, the serialized form of the path name for that class can actually be larger than
the data itself. It becomes very, very inefficient. How do you re-associate the data
back with the message without at least storing something that says here is the original class
that it was serialized from? We have an integer. You just have a little byte that's associated
with the message. It's a metadata. Then we have a registry of all of our marshals and
marshals. If you see this number, grab this instance of the class and serialize it or deserialize it.
Nice. Okay. Let me get back to the original motivations. I mean, here's this. We haven't even
really gotten into too much about what the disruptor pattern does, but what was the motivation behind
writing it? Why did you guys write another library or a framework? Yeah. Once we started,
once we wanted to do this, recoverable messaging using these ring buffers, we found that
that the ring buffer was effectively replacement for a queue. So when we send a message out,
it goes into a ring buffer and then a thread on the other side does the dispatch, and that thread
will handle any request to rewind and replay any messages that need to be sent as a result of
some sort of late join. So we basically had a structure that did a similar sort of thing to a queue.
So passing messages from one thread to another. And back before we started the disruptor project,
before we started taking the whole ring buffer stuff for, we did a lot of research on queues
and how queues behave and how long does it take to push a message from one queue to another,
because we have quite a few of these. We probably have 10 or 23 hops on an entire end-to-end flow.
And so you're talking just traditional consumer producer consumer pattern where you've got a
producer here, sticks it into a queue structure, one or more consumers read off of that and then do
their work and push stuff onto another queue. Yeah, so we spent a lot of time researching that.
We looked at all of the standard queuing implementations that we could find,
and we did a lot of latency comparisons, and what's really was really interesting about that
is actually it's incredibly high. We're talking, if you're talking about a business transaction that
sort of got a one millisecond latency goal. Okay, that was the other question. Your SLA is one
millisecond, is that right? Yeah, that's our goal. Okay, so if you've got, say, 10 thread hops
via a queue, we found that some of the average latencies for some of these collections was 30, 40
microseconds. You start adding those up, 10 of those, that's three to 400 microseconds. Just been
moving a message from one thread to another. And it seems also extraordinarily high. So we just,
once we built this ring buffer structure, we benchmarked it. Now first versions were terribly,
terribly slow. But when we started digging into some of the reasons for that and some of the reasons
why the queuing implementations are really, really slow, we started to discover why, and the key
thing for us was that in any concurrent system, the immediate thing that will kill any form of
performance is contention. So it's any form of shared resource where you have multiple threads
trying to write to the same piece of data. So any time you need to, any of the time that happens,
you need to either use a lock or some sort of CAS operation. And once you start introducing multiple
threads doing this, not only does the performance double based on the number of threads, it can skyrocket.
I mean, the difference between an uncontended and a contended thread is probably three or four orders
of magnitude. And just a quick little alert, you said the CAS operation, which is compare and
swap, you know, compare this value. And if it does match, swap, otherwise do something else.
Yeah, so that's the CPU level instruction, which is available into the .NET and Java through
various different APIs, the atomic APIs and Java. And I believe that it's the interlocked APIs in
.NET. So even those under fairly heavy contention can start to slow down unexpectedly. So we came
up with the idea of what's called the single writer pattern. So whenever we're doing any form of
concurrency at ALMAX, we always try and focus on only ever having a single thread writing to
any one piece of memory. See, idea is a thread owns some piece of memory. It might be a business
logic thread, owns the business logic data. The journal will own the journal file. So nothing
else we've ever tried right to that file at the same time. And once you get to that approach,
you don't need any locks. You don't actually need any form of resource, any form of
contention control, either casual locks. The only thing you need to worry about at that point is
things like ordering a visibility. And that was the design approach that we took forward for later
designs of the disruptor. In fact, there's only one place throughout the disruptor code where you
have any form of contention. And that's when you want to try and have multiple threads writing into
the same ring buffer. And then you use a compare and swap. Yeah, and then we use a compare and swap.
Okay, that's easy enough. So then, I mean, you're talking about one millisecond of latency through
this whole thing. I mean, let's take a typical message coming in off the wire, walk me through
this message going through all of the parts of the system then, and then finally being
published out as a result on the other side. I mean, how you said 10 or so thread hops. I mean,
what is this? What does this typical workflow look like? So into end is based on a fix. We use
in the finance base, the fix protocol is fairly common. And that's the one that we're measuring at
the moment for that one millisecond goal. So we've get an order request over fax. So that comes
into a socket based fix server. Did you say facts like a paper fax? No, no, fix if I x. Okay, okay.
That'll be my exit accent one millisecond with a fax. Anyway, yes, okay. Yep. So we get a fix
message off the wire. That needs to be decoded into some form of business object by the fix server.
It has to have a degree of authorization applied to that to ensure that it's being
sent from the right person for the right reasons and basic security requirements.
It's then turned into our internal wire format. And then it's placed into an
into an outbound into a disruptor to be pushed out the multi-cast. We use the LB
Informatica's ultra messaging bus. So it goes over over the ultra messaging bus into an exchange
server that'll take the message off the wire. It will place that into a disruptor.
It will we have four or so threads hanging off of that that main disruptor. The first one is
obviously got to journal the file off the disk for durability. So it takes that message and it
sticks it on. It writes it to disk. Is that what you're saying? Yep. Okay. And to a very fast
journal file. And I would imagine it does that like in a batch. So the messages that come in during
the next 500 milliseconds write all those to disk. Is that sound right? Uh, slightly different.
The batching algorithm we've come up with and the one that's actually baked into the disruptor
is that you basically you batch sort of as the messages arrive. So if you imagine something like
a disk IA, which is a bit slow, it's probably slower than your network IA. You receive the message.
The first time that the first message that turns up, you're just right straight to disk. But you'll
get a natural pause due to the latency of writing to that disk. So when you come back,
you may have a backlog of messages. And there's just take all those and just write them out.
Okay. So every time the disk is just constantly writing as fast as it can. Yeah. And you'll just
get at the sort of natural batching effect. But it's really quite good because if you're
the three-body disk is lower than the three-body network, it allows you to catch up.
And in cases of when you get big bursts of traffic, it alleviates some of those queuing effects
you can get with some of the other, some, you know, some other types of queuing systems.
Now I would imagine that you'd have quite a bit of CPU, it'd be fairly CPU intense because here
you are taking it off the wire, deserializing it into or do you do not even deserialize it to
write it to the disk. You just say, here it is. And raw byte form, write it to the disk.
Yeah, exactly. So the former that comes off the wire, we just take a little bit of an extra
header to it and then just write it off to disk. There's another thread hanging off that
hanging off that ring buffer that's sending over to an HAPA. So our core systems run our primary
secondary. Oh, okay. Master, master, slave. So if one goes down, the other one just picks up
where we're left off. Yeah. So we're sending a stream of messages out to that server.
That's the exact same wire format that has been written to the journal as well. So it's all
you know, very, very consistent across the board. And so that message will arrive in a secondary,
going into a very, you know, almost the mirror of the disrupt that's happening in the primary.
And as soon as that message is seen by a thread, it'll just acknowledge back the sequence number
that it's seen. So once we have the message written off to disk and once we have the message
seen by the secondary, we then allow the business logic 3 to continue. And that's where we do
the unmashing and then start processing the business logic and sending out of ends.
Okay, now I have a question. So you have, it sounds like the ring buffer would be
slow consumers could be problematic. So in other words, you have, here's this network hop and then
you have it. The acknowledgement takes a long time to come. Does that, doesn't that cause the
ring buffer to back up if the consumer, one consumer hasn't acknowledged up to a certain point?
Does that make sense? Yes, yes. It's one of the reliability guarantees that we make though. We say
we won't start processing a piece of business logic until we know we've got it on disk and we know
we've got it in a second memory space. So we do have some catches for that sort of thing. If we
find that the secondary is for some reason down or there's a very, very slight network connection,
after a period of time we'll stop waiting for acknowledgement, I'll do acknowledgements from
the secondary. We'll switch into a mode which we call being at risk. So you know at that point
our operations team know at that point that one of our systems, there's something occurred and
we're running without H.A. support at that point in time. And is there a job to get H.A. back
online as quickly as possible? And the system I would imagine automatically detects when everything's
back online and then switches back to its safe mode. Yes, it's one of the, if you ever pick up
the disruptor code you'll find that it's fairly key to it as sequence numbers. So we actually do
just about everything but on the basis as sequence numbers. So there's like a global sequence.
So you're working with in a strongly consistent environment where,
okay so back up for a second, you mentioned that the incoming messages were ordered. Now I know
that off the wire coming, you know, transcontinental type, you know, from one exchange, you know, I don't
know if you guys work on the the New York Stock Exchange or NASDAQ or anything like that or is it
just the London exchange that you're working with? We run it else is actually our own exchange.
Oh, okay. So but but in any case, you know, it's coming out of the exchange.
Messages ordering is not a guarantee of messaging. And so where, where do the incoming messages,
how do they assign an order? We perform that, perform that role using the disruptor. So we get some
guarantees on ordering. Let me guess everything is using the disruptor in your system. Like the
disruptor is a pattern and you've applied it throughout the entire system. Yeah, pretty much. So all
of our messaging and a lot of our inbound message handling uses a disruptor pretty much across the
board. That the ordering comes from, there are some ordering guarantees that we get from the
messaging bus itself. So if a thread sends a message on a topic, those messages will be ordered
from the same thread, if you like. So that means that if we've got something like an order placement
and an order cancel, if they've come off the same thread, they'll arrive in the same order.
But if it's off a different machine, you lose that guarantee. As long as they're on the same
topic. So that's basically the only order guarantee we get from messages being sent to us.
But basically we have two threads. We have two network buses and the messaging bus we use
has basically one thread per physical network. But they are writing into the same disruptor
and the disruptor forms the role of the sequencer at that point. So it's once they're in the disruptor,
that's now our global order for all of our messages. And for that point on, it's just a single
stream of events all the way through to replicating to our HAPR, replicating out to our disaster recovery
site and being written to journal. Okay, so now these global order, I mean, an integer seems
rather limited at this point, at least a 32-bit integer. I mean, is it like a 64-bit integer,
is this like 128-bit number that just keeps getting bigger and bigger? It's a 64-bit number.
And how close are you to the end of, to wrapping around a 64-bit number?
We estimate, I did a little bit of an estimate actually the other day. And at about 100 million
transactions per second, it'll take about 1400 years. Wow, how many transactions a second? The
1400 years was impressive, but how many transactions a second? That would be at 100 million transactions
per second. Okay, so if, but you guys aren't doing anywhere near that. No, not even close to that.
I was going to say that is a big, I mean, all the financial markets in the world don't do anything
like that, so. Yeah, so it's plenty. And by the time we get close to that, we probably have
native support for 128-bit. Oh, yeah, yeah, but you won't be in charge of it at that point,
so it's somebody else's problem. Yeah. Okay, so I interrupted here and we were talking about this
message that was, we figured out the ordering global sequence number, it's a 64-bit integer,
and then, you know, it's, it's, it's being written to disk. There's an IO thread.
What happens if there's no messages? Is there, is there just like a thread spin that you've got
in there where? The disruptor comes with a couple of different strategies for this. So one of the
things that we found that once you start separating along using using the single right principle
is that you can separate how you wait for messages and the correctness guarantees.
So we have a number of different options. We have one where it will sleep, we have one where it
will use a lock in a notification, wait notify type approach, we have one where it uses
called thread dot yield, another one which is just a plain hard busy spin. So given different
different platforms, different hardware available hardware resources should determine which one
of those you use. A good example is she turned up in the mailing list the other day. If somebody
was had four threads, so four, one disruptor producer and three consumers all running on a
dual call machine with hyper threads, so that's four logical cores. And they were seeing some very,
very weird latency results running some of our performance tests. And the interesting thing is
they had selected, well, we had, it was in our performance test, we had selected the busy spin
strategy. And while you have four logical cores, all the threads should be able to spin
independently and quite easily. What you actually find is with hyper threading, if you've got a hard
spin, it will actually steal resources from the other hyper thread, because it will speculatively
execute that loop. So it's saying it's seeing that condition in that loop and it will pre-fill
the instruction pipeline with a whole bunch of instructions, starving the other hyper thread
of resources. So if you try and use a busy spin on a system that's only just got enough hyper threads
to support all your threads, your start seems a weird latency numbers. So yielding works better
in that situation. So you're not a fan of hyper threading then, are you? It depends on what you're
doing. There are some neat things you can do with hyper threading. There would be a way around this,
if there is an instruction provided on x86 anyway called ports, which allows a busy spinning thread
to tell the CPU to not speculatively execute the condition for that loop. Unfortunately,
it's not available in Java. So the one that works best in that situation is our yielding strategy.
And you apply that and all of a sudden the latency starts coming back. But if you've got plenty of
physical cores, then the busy spin ones are fastest. And then you just have the standard locks and
stuff. If you're very, very starved the resources, so you're in a say web server and you've got quite a
lot of publishing threads or quite a lot of disruptors running, then you might want to think about a
locking strategy or even maybe a sleeping strategy, depending on latency requirements.
Are you able to run the disruptor then as a framework of source for a web server?
Yes, I think we're using it in our web server for when messages come off. So we get an HTTP
request come in. We do a better business logic. And then when we publish the message off to
our core services, the publishing component has actually got a disruptor underneath it that does
all the message dispatch. I believe that somebody's started an open source project to actually
write a full HTTP server using the disruptor to do all of the free management.
Wow. Okay, so sorry. Getting back to this, we've got the disk. Message is journal to the
disk. That's really the starting point for where things begin to happen, right? But you said that
the disk is done concurrently with other executions. So the business logic is actually executed.
You know, the messages arrives off the wire. It's journaled and executed.
It's journaled and replicated. So when we did our initial design,
we focused a lot on the cedar architecture. We have a number of stages. Originally it was sort of
here's a journaling stage. Once we're done journaling, we'll do replication. Once we're done
replication, then we can do business logic. But one of the advantages of the disruptor is that
you can actually parallelize a lot of those sections if you want. So we see journaling and
replication as being sort of at the same level if you like. And so those two actions can happen
in parallel. And they've got different resources, usage as one's using a network resource,
one's using a disk resource. So there's no, yeah, there's no contention that which is great.
But the business logic needs to wait on those. But so the, with the way you set up a disruptor,
as you say, right, okay, I've got those two threads. I want to take the minimum sequence
of those two, we call event handlers. And once that minimum sequence is now available,
I can then start the business logic. So once it's been journaled or replicated?
Once it's done both. So we take the minimum of the two. Got it. So let's say you've pumped in
messages 5 through 10 into your system. You've managed a journal up to message 8, but you've
managed to replicate to message 7. The business logic will come around and say, right, okay,
the minimum of those two sequence numbers is 7. I can process messages 5 through 7 now and
progresses that way. So that way we have sort of a gating. But the gating is very, very lightweight.
It's just looking at sequence numbers. And as far as the sequence number, then, you know,
here's these consumers that are reading the sequence number, at least in a multi-core CPU,
or even dual socket CPUs, things like that, you know, that that number is not a,
may not be representative of the true state of the system. And so is it just the next time it
comes around, it gets the real, you know, here it's reading through 7. But in reality, 8 has just been
written. And so it just comes around the next time, oh, 8's available now. Is that what happens?
Exactly. So as long as, as long as this is with the ordering and visibility constraints,
you just want to make sure that one of the sequence are always, always going to be increasing.
So you will only, it's, you will only ever see messages that have been completed. You won't
accidentally see a message, then have the number go backwards and you need to do something tricky
to forget that out. So the way that, for the way that it works, it does any piece of
necessary work that needs to be done. And then it will update its sequence number. And that's where
the concurrency, the concurrent operations come in to ensure that we order those two operations.
So we make sure we order the writing into the ring buffer to the, to the increment of the
sequence. So that we make sure that the sequence doesn't increment until after that message is
actually there and available to be read. Okay, that's, and then it executes the business logic.
At which point it produces one or more events which go into, let me guess, a ring buffer.
And then you have how many consumers on the other side at once the, you know, I mean, the cool
thing about event sourcing here, at least in this particular case is that if the messages,
let's suppose the machine resets or something like hard crash, whatever, if the message is never
make it out, it never happened. And so you just load it back up. How long does it take to load back up?
You know, here, here your machine, I guess you have dual, you know, you said the master slave
scenario. So it doesn't really matter. But how long does it take to bring a system online,
if the machine reboots, whatever happens? That can depend largely comes down to how well
implemented the business logic is. We did have a problem where it took quite a long time
in terms of, in terms of hours. At one point, we did some optimization work.
We're talking minutes normally. Okay, and is it, is it just like using a snapshotting kind of
mechanism where you say, here's the state for this particular symbol or whatever in the financial
market for this symbol, here's the state at this point in time. So you just snapshot every
few hundred thousand events or something? We snapshot on a daily basis. So overnight, we have a
short downtime, we're in about five minutes at the end of the day, and we do a few operations
in one of them just to take a global snapshot. So we take the entire system and say, here's the
memory state. So in a case of failure, it's just grab that snapshot and grab the next load of
journals and replay them through. And in your case, like downtime in the middle of the night,
doesn't matter, does it? No. Okay, so then it's a perfect opportunity. If you're doing like a 24-7
global trading thing, a five-minute downtime would be catastrophic. Yeah, I mean, there's talk
of trying to move away from that. There are strategies that you can employ as well. We were thinking
about doing something. One of the things I've had in my head is having a tertiary. So you have primary
secondary tertiary, and then you can take the tertiary down and snapshot that and leave the
primary secondary up and running. So you'll just have a snapshot at a particular sequence number.
So this is the state at that point. Yes. And then if those, if the primary secondary go down,
they can grab the snapshot from the tertiary, say, oh, is that sequence number is where it was
snapshot at? I'll just start from the next sequence on in my journals from that point. And then
you can hot run overnight as well. No, but when you bring everything back online, you're bringing
every the state of the entire system back into memory. Is that correct? So it's not like, I mean,
if this was a Facebook scale thing, I mean, keeping everything in memory, well, then again,
they do have lots and lots and lots of servers. But let's just imagine for a moment that they
only have a few dozen servers. Keeping the entire state of the entire system in memory would be
untenable. And so you would have to be working with a disk quite a bit more. I mean, have you ever
considered an architecture like that where let's, you know, here's the short lived business
entities that have, you know, they live for a couple of hours, maybe, and then they just die.
And so we don't need them in memory anymore. How would you do a system like that?
That's something we already do. And that's not something the disruptor itself solves there.
That's just for us, we just sort of focus on a business modeling, sort of a business modeling
problem. And you look at the particular business entities and what their life cycle is.
Again, talking about the difference between orders and trades. So when a trade occurs,
we don't hang on to that. We send a message out saying, here's the trade. That goes out to our
system that reports trades out to a clearinghouse. But it persists it off to a database.
At which point you remove it from memory and it's gone and it's done. And it's only there for
reporting analytical purposes. Yeah. And they can go through a different path and, you know,
trying to do some of the analytics across a trade, you know, a whole big series of trades.
Actually, databases are really good tool for that. It fits quite nicely into a relational model.
So we just use relational database for it. Okay. So then we talk about object life cycle here.
That also brings up an interesting question, at least in garbage collected environments,
you know, the Java.net types. Wouldn't your garbage collection, I mean, you're working with lots
and lots and lots and lots of lots of objects here. I mean, what does the garbage collection latency
look like? And how does that impact your system? It's our biggest cause of latency. It's something
we probably haven't paid as much attention to as we should have early on. There are a number of
strategies that you can employ. We're looking at various third-party JVM implementations that
have better garbage collectors than the one that comes out of the box as a potential short-term
solution. Longer term, I did some experiments and they're actually one of the things that disrupt
it does give you and over and above some of the standard queuing models as it allows you to build
zero garbage systems. So when you build up a disruptor, you supply it a factory to pre-build
the entries into the ring buffer. And that's something that we do. We pre-allocate a whole bunch
of byte arrays there. So when a message comes in, we're not actually allocating anything. We're
simply copying the message from the wire into the byte array inside of the disruptor.
Yeah, in the disruptor there, I think this, I just had a little aha moment, is this,
I mean, the messages in the buffer are pure bytes, is that right? In our case, yes.
Okay, because I was here, I was thinking of like object references going through the entire system
at least in one process space. Yeah, you can, I mean, you can do both. The disruptor allows for
allows for either. But in terms of, if you want to go for very, very low garbage systems and very
sensitive systems, we find the best approaches to use the disruptors for handling the serialized
forms of the messages. So in a typical system, if you look at it from our perspective, you see,
there's a disruptor for all the messages coming in, and that's storing the serialized form
that they arrived onto the network in. And then there'll be a disruptor for messages that going
back out, so outbound events. And they've actually got the serialized form to be written out onto
the network. So we do all the unmashing in the same thread as the business logic. That sort of
seems to be the easiest way to give you a very deterministic memory usage for entries into the
disruptor. Interesting. That's, okay, I'm just, I'm going through that. The only time you ever really
need to deserialize something is when you, as you're when you're doing the business behavior,
because it then needs to be in an object form, and then finally just serialize it back to the
byte stream. And then, and then maybe maybe like a transition going back over to the exchange or to
whatever to interact with an external system. Yeah. So you're just, you're basically, if you look
at most of our disruptor usage, but basically storing the serialized form, there are a couple of
techniques that we've been playing with in terms of actually reducing the garbage created on deserialization.
There's a couple of tools floating out there. Javolution is an example of one where you can actually
rather than creating an object from the serialized form via some sort of stream, is you actually do
sort of it like an overlay style behavior. So that if you imagine you have an interface that finds
your business object, you, you can push a byte array or a byte buffer behind that and you say,
okay, get me the ID. And all it knows is the, the offset into that, into that underlying byte buffer.
And we'll just take the, the eight or so bytes that's needed, pull those out, convert those into
a long and return them to you. So you're only ever dealing with the only time you're actually
going in and creating any data as such is when you actually get down to the primitive level of your,
of your object model. It's an interesting approach. I've prototyped it and it works very, very well.
You do have the constraint though from the developer perspective that a message only has a limited
life cycle. It's only valid for the lifetime of the request. So if you need to take it out and store
it somewhere, you need to construct an object to represent it. But if you mid-model your problem well
enough, you know, they'll probably have a natural home. Now all of these things are just, you know,
here's this extreme optimizations that you found for your business domain. You know, have you
benchmarked this system in terms of number of messages per second per core? I mean, what does this
typically look like? Yeah, we've got some, we've got, you know, we've got quite a decent set of
performance tests. So we try and run as close to sort of very similar to our live load as possible.
Our total throughputs are not as high as some exchanges, but the really interesting
perspective is when you actually look at burst traffic. You can get a lot of messages effectively
arrive in the same TCP packet. So while you may only be dealing with one to two thousand messages
per second, you can often run into situations that you have 30, 40, 50 messages turn up on the same
micro-second. So this is where a lot of the performance stuff comes in, especially around some of the
the smart batching algorithms. So when you get this big burst of messages, you can deal with it as
efficiently as possible. So we've built performance tests that look at the sort of the bursting
behavior we're seeing in production and try and replicate that in our performance tests as well
and try and see if we can get sort of similar latency numbers in our performance testing environment
that we see in production then optimize from there. I guess the big thing here for you guys,
of course, is the latency. And then as far as messages per second, you can always just scale,
scale this thing out to more machines to get to achieve higher numbers.
Yeah, I mean our throughputs fairly, it's fairly stable. For our throughput to jump a large amount,
there actually has to be some sort of business change. So either we're putting new instruments
into the system for people to trade on or provisioning new market makers of the main things that will
drive additional volume. So our throughput numbers are fairly predictable. So normally with the way
we do performance testing, as we don't say how many message per second we can we do, we normally say
we should be able to do, here's our business, our current production flow. Let's put a factor on
that 2, 3, 10, whatever. So we need to be able to do that. Now let's make the latency acceptable.
So it's all about the latency. But even so, I mean in terms of messages per second, I'm just
curious is it, you know, 10,000, 20,000, 50, 100 million messages per second per core that you're
able to do. In terms of the whole system in to end. Let's do just your business logic and then
let's expand it out and do the whole system. The business logic, it's gone up and down from it.
We had initial implementation was able to do millions of messages per second. As we've added
features, that's come down, we've been doing optimizations, bring that back up. It's somewhere
between half a million and 2 million messages per second. That is so cool. The really interesting
thing is if most of that doesn't come from high end optimizations, most of that comes from one
being purely in memory, just go for a pure in memory model, being really, really careful with
your domain modeling. So really understanding the business problem, designing it really well,
and then standard computer science 101, write data structures, the right algorithms can get you
to those sorts of levels. And a lot of people are really surprised by that and that's a single thread.
So we actually only have one business object thread in our system, which surprises a lot of
people and not people think, oh, modern software, you have to go very heavily concurrent. But we
actually find that you can do quite a lot on a single thread, especially if you're avoiding any of
the complexities of locking in care's operations or any form of concurrency or concurrent structures.
I mean, it sounds like the disruptor pattern is just non-blocking on steroids.
Pretty much. Non-blocking, not just disk IO, non-blocking at the CPU level, non-blocking memory,
I mean, you don't even need software transactional memory, STM, anything like that. It's just
non-blocking all the way through the entire system. That is so cool.
Yeah, so that was the basic idea of the design is just to move messages between threads as fast as
possible. Okay, so then from an end to end perspective, how long, what is your, you said your
goal was one millisecond end to end? Did I hear that right? Yeah, and I would imagine you're
achieving that. We're getting close to it. We're getting close to it. We probably can't officially
release our numbers, but our mean is fairly close to that goal and our two nines and four nines
stretch a bit further out with the biggest constraint on that being just being garbage collection.
So that's how our biggest challenge at the moment to bring those numbers down is finding a bit
away, either creating less garbage or finding a way to bring in a garbage collector that's better
than what we've currently have. And when you say end to end, this is still, this is traveling
between multiple, there's several network hops in between. Yeah, so there's the four network hops.
So there's a hop from our fix server that's doing the original message decode to the exchange.
There's the replication hop and acknowledgement back. And then there's the hop back to the fix server
to send the acknowledgement to the order back out to the end user. So four network hops there.
That is so cool. And all approaching one millisecond, that's incredible.
You know, because here we are, at least in my domain, you know, if it's half second,
that's probably pretty good, you know, per message. It's just no big deal. We can do higher than that,
but it's so to hear this kind of volume is always interesting to see. Because at that point,
I mean, it drives your architecture a lot more profoundly than it does in at least my case,
because our latency requirements are so low. It's like, oh, as long as it happens, I mean,
for us, it's the reliable messaging, not necessarily the latency that really gets us.
Yeah. It does focus the mind. It forces you to think about with a lot of the high performance
stuff and the low latency stuff, the most obvious thing is it's all about simplicity. It's, okay,
here's my flow into end. What can I remove from it? The best optimization is the one where you
just don't do something. So especially talking to some of the other people in the industry,
largely it's about pulling out parts of the infrastructure that get in the way. A good example,
and one that comes up a lot especially on just the low latency networking part is kernel bypass.
So you can actually go faster if your message doesn't have to travel through the kernel.
So there's a couple of networking network card vendors, for example, that do a fully
in user space TCP stack. So their device driver just simply talks to a chunk of memory,
bit of memory mapped IO to grab the packets off the wire, and then it does all the TCP decoding
in user space to avoid all of the costs of the kernel in terms of doing all device management.
Now I do have to ask about the operating system choice then. Here, I've worked with Fedora,
I've worked with Debian Ubuntu, various flavors of Ubuntu, but I'm curious why the choice for
Fedora slash Red Hat Enterprise? It was just what we were using at the time, it seemed like the
best choice when we made it. We've been running the same platform production for quite a few years now,
and we've got a big hardware refresh. So we're going back to the drawing board a little bit,
looking at what's going to be the most appropriate solution. The biggest drivers of that nowadays,
what JVM's does it support? There's a number of other Java virtual machines in the market,
Azul has one, IBM has one that we're looking at, and it's what platforms do they support?
It is one of my key constraints. If it doesn't run on Ubuntu server, then maybe we shouldn't really
be considering Ubuntu server. Just about everything runs on Red Hat Enterprise. Suces Enterprise
server seems to be another common one that people are willing to support. So that's probably the
biggest driver. Fedora is quite nice and the fact that you can stay very, very up-to-date with the
latest kernels. One of the things that we really want to do in production, we're still running a
fairly old kernel. We want to go to some of the newer kernels, we have some of the low latency
software preemptive scheduling changes that have come in in more recent kernels, which should give
us a performance boost. As far as Linux goes, I'm going to throw this out there. What kind of
performance would you expect out of the disruptor on a Windows-based system? Dozens of messages
per second perhaps? It's pretty good, actually. We've one of the interesting differences between
Windows and Linux for server-side messaging and things like that. When you try to just get
messages passed backwards and forwards, we actually find that Windows schedule is better,
which is really surprising because people talk a lot about Linux and how good the scheduler is.
And actually, just doing a simple ping-pong style test, we were testing just with a lock-in-fact
and measuring the overall runtime. If you run the same test without using 3Dfinity or anything
like that, the result on Windows system is actually about four or five times better than on Linux.
So one of the things you can also apply to that is you can actually set up 3Dfinity. When you start
sitting 3Dfinity, everything comes back down to the same level and is quite a bit faster again.
So it's interesting that at times, in certain workloads, Windows schedule is quite a bit bitter.
So in your case, with threat affinity, what you're referring to is that here's this thread
and it owns this particular task. You never have to deal with swapping and context switching and
bringing it in and out of main memory and so forth. Is that correct?
You're just telling the kernel in the strongest possible way to say,
keep the thread on this core because it's going to do the same thing for quite some time and I
want it to cache to stay hot and I want it not to be moved about.
Do you have a really big like, your CPUs do they have really good size level 1, level 2 caches on them?
Yeah, we're using some of the Xeon chips and stuff like that, which have the
the much larger caches than the standard desktop chips.
The biggest cost in terms of context switching is not the cost of the OS context switch itself.
It's the fact that if you're on one core and your thread gets swapped out and brought back on
a different core, it's the cost of having to rebuild the cache. So it's the number of cache
misses you getting by having your thread rescheduled quite a lot and we found that with the
Windows scheduler, without any form of thread affinity, the threads tend to stay roughly on the
same core. With Linux, it seems to hop them about quite a bit and you start seeing quite a big cost
in terms of those cache misses coming through and that's where you see the additional performance
hit. You know, and here you talk about keeping everything in memory, you're not just talking about
like in the main, in main RAM space, you're talking about as much as you can in L1, L2.
Yeah, well, being or trying to design your algorithms and trying to think about your data structures
in ways that you make life easy for the compiler and the CPU to keep those sorts of things hot.
A good example is internally and now inside the disruptor, we use arrays rather than link lists,
for example. So as your threads are moving through the disruptor, picking up the events,
you start to get things like cache striding benefits. So it takes the whole array and it puts
it into the L2 cache. If possible, the other thing a CPU can do is if you're moving very
predictably through memory, if you're going in a linear sequence, and I believe if you get two
cache misses in a row or two page faults in the row, one of the other can't remember exactly which,
it will go, ah, you're heading in that direction through memory. What I'll start doing is start
pre-fetching those cache lines into memory. So by the time you get there, it's already hot and
cache. Well, something like a link list, because the pointers in the link list can be to anywhere
in memory. There's no way the CPU can see any sort of predictable memory access profile,
so it can't do any sort of sort of pre-fetching or any optimization of your algorithm.
Now, and you've talked a lot about Java, and at least in the Java world here in the last year or
so there's been quite a bit of talk regarding Oracle and licensing and all these kinds of,
even to the extent that Google has largely used Java, has even considered moving off of them.
And I was curious if you guys had even any considerations along those lines, you know,
is Java an old language, would you upgrade to something newer, or is it just been working great
for you guys, and let's keep going with it? You know, it works for us, and it works well,
when I personally am not in the space where I see a lot of talk in the Java community about
something you were programming languages, so there's a lot of talk about Scala and Closure and
Cruevian, all those sorts of things. And I was originally, you know, big fan of some of these
things, but more and more, especially working with people at Mount Thompson and my boss,
the guy called Dave Farley, the more and more I work with these sorts of guys, I'm more
and more focused on what I'm doing and actually understanding problem in modeling it.
The language almost sort of fades into the background. Often you're, you know, often you'll
come across a pro, you know, across a problem and say, oh, if I had Scala, it'd be really easy,
I'll just create a closure. But I've got a very awkward implementation in Java, but I find
actually starting to focus on actually what's the problem I'm trying to solve. How does this actually
work? What's the business model? The model starts to evolve, and then you write the Java code,
and it's actually still pretty simple. So I think it's, I personally find that the argument to
move to another programming language, for example, because it's easier, it's often a case that
people haven't really put the effort into thinking about the business problem well enough to
model it really well in Java. So it's one of those things that works for me. I'm fairly
confident that Java for, you know, for businesses anyway, who are making an investment in it,
will be fine. It'll be well maintained. They're throwing lots and lots of people resources into
testing and development to keep the Java ecosystem going. I know they've done some damage to the
community, and I believe that they're actively trying to repair that. Yeah, with Oracle, you never know.
Yeah. Okay. Well, okay. So then another question I have is regarding the testing strategy,
because here you have this unfiltered byte stream coming in from your FX server. And, you know,
rogue packets or, you know, just malformed messages, things like that. How do you prevent that?
I could see a situation where you have this malformed request coming in, and everything's in
memory. And then all of a sudden, your business logic, you know, throws an exception or something
like that. And all of a sudden, now you have to, how do you prevent that? Or is that even a problem?
Up front of validation in a lot of automated testing is basically our approach. I know you can
use some, you know, you could use things like STM and stuff like that. So if you get a message in
it, make up just state, you simply roll back to the previous, previous version and discard the whole
event. We tend not to do that because of the high memory allocation and garbage collection costs
of building a system like that. So we spend a lot of time doing a lot of automated testing.
To talk about, yeah, process for a little bit, we, the way we generally develop is very story-based.
So we have a new piece of functionality come through from the business, they want us to add.
We do some modeling, we try and understand the problem. And then we build out a set of automated
acceptance tests for that that run into end. We then write those tests, put them in, they'll
fail initially, do the necessary work to get them to pass and then commit a whole lot into
our source control system, which then deploys several copies of the whole system onto a fairly
large grid of machines and then just runs all of these tests. We've got about about 4,000 odd.
And it takes around about 35 to 40 minutes to run all of those tests.
But the interesting thing is we also, by doing a huge number of tests, especially on a single system,
as we get a lot of the noise that if you like. So we don't really see too many problems with
corrupted packets. But the interesting one is, it finds a lot of concurrency bugs. We found
concurrency bugs in our messaging products in MySQL, just the result of every single time we do a commit
to our source control system, we're bringing the whole system up and running a huge barrage of
tests concurrently. So we have maybe 20 or 30 clients driving a single system at a time,
each running their own individual tests. And so at that point, it drives out all these concurrency
issues. And you said years in MySQL, is that the database you guys are using just for message
reporting and so forth? Yeah, storing trade histories, users own order histories,
things like just the basic crud style, things like instrument metadata, account information,
and principle information, things like all of our financial reporting, those sorts of things
we use the database for. Because it's on a different cycle, it's got different
zero requirements, and also things like financial reporting just falls naturally into the
relational model. Now, as far as the database size, then, I mean, are you starting to run into
other little issues with the database where you have to start sharding the load because you have
so much going into it? Or are these tables getting extremely large, you know, terabytes in size?
Which is sharding strategy you look like there? They're actually the volume of data and now
and our databases is a lot lower than we expected it to be. Early on, we did put a lot of engineering
if it and the matter that we're actually in production, put a lot of resource into the database,
and it's barely being touched. So, you know, we're doing a bit of a re-fresh and a re-architecture
fairly soon. We're going to pull away some of that resource and make it available for other systems.
Our really heavy volume IO is actually all the generally very simple files. We tend to not
use databases for the stuff that we're trying to write 1,000, 2,000 times a second. We find
actually journal files tend to be a much better solution for that type of problem because they're
often written sequentially and often read sequentially as well, so we very, really need some of the
advanced indexing and searching features. And as far as, you know, here's these journal files,
what about data corruption and so forth, silent data corruption. What file system are you guys
running on? We're running X3 at the moment, EXD3, and that was pretty well. Most of the time we're
trying to run replicas. It seems to be for us the easiest way to give you redundancy is to try
and do a lot of the replication at the application level. So, like we're seeing with the, so we've
got primary secondary, we've also got a DR system for Xchange. We're looking to sort of build
that out, same with our, we've got a system for storing a historic market data. What we've
do is got a system in production, a system in DR, and they're both making their own copies of the
data. So you can make sure you can do various checks to make sure that you're not getting some
sort of silent corruption or something's broken in that space. So at that point, do you guys
like do any sort of hashing on the, on the files every so often to make sure that there's,
there's no corruption going on within them, or is it just, just the replication it takes care of
for you? We haven't done so much hashing, it's something I think we should probably put in. But
good examples is we do a lot of testing of the replication and journaling systems. So one of
the tests we do after we've run our 4,000 acceptance tests is we shut the system down,
taking a snapshot, bring it back up, didn't run a bunch of tests through to make sure the memory
state's still consistent. We then shut it down, delete the snapshots of false adjourn or replay,
and make sure again that the memory state's still consistent, run a few tests through it that way.
So we tend to focus a lot more on testing in the continuous integration space than we have on some
of the other things like taking hashes and doing reliability checks in production. But it's
something we're definitely looking into, probably do more of on the future.
This is amazing. So I guess are there any other aspects of the disruptor that you wanted to
comment on and put out there? I know you guys have a number of different papers and there's an
InfoQ presentation on this, but are there other aspects that you wanted to just put out there to
help us understand a little bit more about it? I'm trying to think of what would be a good
thing to mention. I think the key things with the disruptor, if you're wanting to look at
an understanding concurrency, I mean the main thing for us, especially when it comes to high
performance concurrency, is the single writer pattern. That is pretty much number one is just any
way that you can avoid contention, do it, and the disruptor is an example of us doing that.
Probably the most interesting aspect in terms of performance,
one, the ability to reuse the slots in the ring buffer. To be able to have a choice between
having perhaps the style of having newly constructed and mutable message events, to having
a byte arrays that can be mutated and overwritten and reused in a way that's completely safe as well.
So we take care of the concurrency control there. Probably the third is the parallelization
capabilities, the idea that like what we do with the journalist and the replication that you've
got two independent resources that are under contended. Both of those event processors can read
from the ring buffer in a free, uncontended fashion, do things in parallel, and then
gating on these things to find out where they're up to is fairly simple by simply looking at
the minimum of those two sequences. I have a question there just to make sure I get this right.
So you have one producer, two consumers. Consumer one needs to indicate that it's done
back to the producer and consumer two also needs to indicate or no, the producer doesn't even care,
does it, it just keeps going. There is an indication. Everything is done off sequence number.
So if you pick up the dystropic code, there is a class called sequence. And so the
the disruptor itself has a sequence, which we call the cursor. And then every independent event
handler has its own sequence. So the way that we avoid any form of right contention is that
when a message is consumed by an event handler, rather than going back and telling the ring buffer
or the sequencer that it's consumed that message, it just increments its own counter saying,
here's where I'm up to. And that's a structure inside of the disruptor. So that way you don't
wrap around and get ahead of the slowest consumer. Yep. So the disruptor itself, you pass to the
disruptor. I see the sequences that you want to say, these are the things to gate on. You know,
don't wrap around the ring buffer until you see that this particular sequence number has reached
a certain point. So we have the wrap protection built in as well. But you can even switch it off.
You don't have to, you know, you can specify a sequence that's, you know, you're an empty
set of sequences. So it doesn't even, doesn't even worry about that too much. One of the other very,
very nice benefits of doing that, what, doing it that way by having the event handler increments
as it's sequence once it's done, is you get some transactional behaviour that you don't get with
queues. So if you imagine when you're trying to shut a system down cleanly at the end of the day,
one of the common things to do is you allow the system to queues. So you turn off the input
taps and let all the messages drain out the system. You wait to all your queues are empty,
and then you shut the system down, knowing you're in a known good state. There is a slight
corner case with that as that the queue could be empty, but the thread that's taken the message
from the queue may not have finished its business logic yet. So it may be yet to write a message
out to a queue further down the line by updating a sequence after all the work is complete.
All we have to do is just track track and make sure all the sequence numbers are up to the same
value. And at that point we can be assured that the system is quiesced. So you set down the producer
regardless and then just wait till the consumer indexes are up to that to that cursor level.
And then you can be assured that even that both the display is empty and also the running threads
have finished the piece of logic that they are working on. This is all very fascinating. I mean
the reading for it as I was going through it was like just very revealing to see the
insides of the CPU and stuff because it's a lot of stuff is so close to the metal that it
doesn't really matter at least in a lot of what I've been working with. But in your case the CPU
architecture matters a lot. And I mean I could see a pattern like this being adopted
straight into the JVM after a period of time. Yeah I mean we'd love that to happen. I know
the couple of guys from our company went over to Java one last year and there was a lot of people
who were very enthusiastic about it. Yeah this should be in the standard libraries it would be
really great. So hopefully it'll be interesting to see if that does happen. It'll take time I think
if it does. Java 9, Java 10 maybe. Yeah probably. Yeah if anything I mean there's still
still I mean an hour and about three to infinity at the moment in Java. Maybe Java 9.
I know the 0MQ guys are I guess their hope is that it'll be adopted straight into the
into the core into the Linux kernel. And so they have. Go ahead. Yeah I think there are there are
as interesting there actually are other structures that are ring buffer based and floating around
and device drivers and stuff. The really amusing comment that we had from one guy I think it was
actually a guy from a zool and he was looking at our disruptor architecture and we're explaining
it all to him he goes that looks like a high speed network card. That was really you know quite
interesting that actually the design itself is not new at all. It's simply just an application of
a lot of old ideas and principles into a new domain. And then actually there's probably if you dig
around there's probably a lot of other places you know generally in the high performance computing
space it actually got some of the designs already. So for me you know I look at this architecture
and it really sounds I mean messaging is just the core piece that goes through out it permeates
the entire system messaging. And I mean that seems to be the you know with the high performance
you know as opposed to just HTTP rest you know everyone talks about how rest is going to be the
what we can just use rest and do HTTP and that's going to be you know your your system will be
as scalable as the web. But I mean in the end it all comes down to messaging. Yeah yeah absolutely
and one of the probably the one of the things that I find most important about messaging is that
and one of the things we try to do especially in our high performance spaces of our architecture is
be absolutely strict with asynchronous behavior. So avoid the request reply style of messaging. So
avoid having the case where you've got an inbound you've got a thread sending a message and it has
to stop and wait for the response event. We try to be very very strict about saying right we send
an event out we're not going to wait for the response it'll turn up sometime later or maybe we
don't care about it but we're not going to stop and block that thread until that occurs. We find
that's one of the the biggest constraints to scalability because often your piece of logic
is very very short and getting it dispatched onto onto a thread for the disrupted to push out
asynchronously compared to the amount of time it has to wait for the message to go out in the
wire across the network into the service back out and back to you is is order of magnitude is
different. So if you're talking about trying to scale a system up in terms of the number of threads
the amount of cores that you need available to process all your inbound requests. If all of that
logic processing is purely asynchronous then actually it will scale a lot better because the amount
of time your thread is tied up waiting for stuff you know shrinks incredibly. It's one of these
approaches that takes a little bit of time up front it's a bit of a cost initially you have to start
thinking in terms of asynchronous models probably the state machine model is one of the best
that we've seen so you send an event out you move your system into a particular state and event
comes back with some sort of correlation and ID you find the appropriate piece of business data
and then you just adjust it state again based on that event coming in and but once you adopt it
you sort of pay that over that you sort of pay that sort of mental mental overhead or that
mental cost in terms of your understanding but once you've paid it it's pretty much that across
the board the cost of doing sort of synchronous request reply type systems is that they seem very
very attractive and easy early on they're ridiculously difficult to make scale well that's one of
the one of the lessons we've learned we introduced in our infrastructure sort of a request
reply mechanism and we regretted it in a sense. That's typically the case. Rhino do you have any
other questions for Mike about the disruptor pattern? Not really I've been historically taking
notes all of the entire talk and it's like okay I don't have any questions because my own
ring buffer is overloaded with information I have to do. Yeah that happens from time to time
when I was reading the papers on the disruptor it was like whoa this is really good stuff you know
because non-blocking IO wasn't a big deal but at least again for what I'm doing but I can see how
by taking that out of the picture getting rid of the blocking I should say at every a single level
can make it scale to crazy ridiculous levels and I would imagine that in most programming languages
that you could achieve generally you know similar degrees I guess it depends on the language
implementation but you know high high performance I guess there is a dot net implementation of
the disruptor that's you know they they claim you know a few hundred thousand messages a second
that they can squeeze out of it as compared to the job but where you guys are mentioning you know
several million per second I'm doing pretty well actually they we've met the guy actually
Olivier who's doing the dot net port he's done a really good job of it as well he stays really
current with what we're doing I believe I mean he was it was it was a couple of factors different
at most it's definitely less than a more of magnitude difference I know when he was first starting
haven't seen his performance numbers recently but I know when he was first starting we were quoting
25 million and he was quoting 17 million those sorts of differences so it's staying very current
it's staying and it's still providing I think a lot better performance than some of the internal
dot net queuing implementations yeah those are I mean you know here we're we're excited now with
dot net 40 we have we have a non blocking or I should say a concurrent queue where we don't have
to do our own locking and stuff and it does it all internally but it's still just locks and
and compare and swap operations under the hood yeah yes I mean it's interesting just a benchmark
these things I think that's what that's what one of the things I find most interesting is
especially in the Java spaces people tend to look at what people are telling them and then
don't do any tests the number of times I've seen talks on concurrent frameworks and new language
features and new tools is they'll talk about how great the tool is and how well works with multi
core systems but there won't be any numbers there won't be any numbers that reflect how much better
it is and that's one of the things we've tried to do from day one with the disruptor is you
have performance testing sort of front and center and say you know not claim anything unless we've
got numbers to back it up and you know I think I think it's not only makes the argument more
convincing but it's actually pushed us to try harder and harder to actually make those numbers look
better and better well now with the disruptor I mean I would imagine that the patterns fairly
well established and integrated into your architecture and so do you spend a lot of time tweaking
the disruptor itself or do you spend most of your time now working with the ancillary and when I say
ancillary what I really mean is the the primary systems like the business logic and so forth
it's more in the business logic and we have a we have an infrastructure wrapper over that that
does the things like the marshaling it has a couple of layers so it makes it makes events look
bit like method calls so you just pass a bunch of objects into a method call that takes care of
the dispatch and the messaging and the reliability and the reinventing on the other side
but that's been fairly stable for a while as well the biggest thing for us especially from the
response perspective now is the garbage collection that's what we're what we're largely fighting and
then other things we're putting a lot of effort into a things like monitoring stuff like that so
there are little tweaks happening to the disruptor we released 2.8 a little while ago we ran into
a problem with the situation where you've got multiple threads writing into a single disruptor
a lot of customers a lot of people using disruptor were coming back and saying on systems where
they've got lots of threads all quite heavily continuing on the disruptor they're seeing very very
poor performance numbers and I was able to replicate those numbers and so we've had to I had to
do a tweak recently to change the way that the multi-threaded claim strategy works such that
we could deal with that contention in a you know a fairly clean and efficient way
and it's one of the interesting observations about contention not only does it make things slow
it makes things really really complicated if you look at the code for the multi-threaded claim
strategy versus the single-threaded claim strategy is just so much more complicated because you
have to deal with you know with these multiple things trying to write to the same piece of data
and then you have to deal with the compared swaps and atomicity at the CPU level and so forth
well one of the one of the questions I had just to wrap things up you know you talked about the
master slave a little bit I was just curious to to find out a little bit more about that how
you know primary system failure how does this I mean is there just some kind of heartbeat where the
slave just picks up and says oh they left off at this I mean if you're if you're going and doing
the same code on both systems and only one of them is producing out actually to the wire the slave
I would imagine we just watch that and say okay now I know where it left off and it knows where
to start up we're we're still actually running it's a bit more of a warm standby so it's a manual
switch over operation at the moment the the way that it works internally is that the state is
exactly the same so the primary and secondary look exactly the same and actually to the secondary
it's just receiving the same stream of events happens to be over a slightly different channel
but it's still going into the same instance of the disruptor the same business logic the same
journaling and actually the same replication and events being sent out all we've done on the
actual sending threads on the other side of the disruptor is we just switch off the actual scene
so it's just basically doing no-op operations on the other side so the state is replicated across
the two oh so the dispatcher like everything like it's got it does the same business operations
and then when it goes out to publish over the wire it just says no op yeah so there's just basically
a boolean switch am I on no don't actually write to the disc or don't actually write to the network
device so the the the the ring buffers that sit in front of the publishers for messages going
back out on the wire are still filled up with the same set of events and this is where the
determinism becomes very very important so making sure you have determinist business deterministic
business logic is very important so in the case that if we failed over to the secondary and
somebody tried to late join in again we've got the same set exact buffer of messages for them to
replay if they need it we're granted your latency is a lot higher you know during a switch over
operation but hopefully that's you know I mean do you have have you ever had to pull the switch and
you know the primary is got some problems and had to shut it down and let the bring the secondary
up and have it take over no we haven't as yet we haven't had an incident like that in production
but I'm certain happiness happened one day have you have you run tests for that just to make sure you
can I mean this is what like Netflix does and stuff they're like hey let's do let's do like
they're using the Amazon cloud and they say well let's shut this system down and see what happens
and see if everything else like our failovers work properly yeah we'd like to our latency
requirements are very very tight so something like that will probably perturb the system a bit too
much from a latency perspective but I think there's things things that we'd like to do we'd like to
we'd like to be able to hot deploy for example and the only way to do that is to have some form of
cut over in the middle of the day so but the the way that we we test this is again we have
we have an automated system test which you know shuts one of the systems down promotes the
other one ensures that messages are still flying between the two and then bring the other one
back in and has it like joins switches back over between the two and ensures that the state's
correct so we're confident that it works has it you know have we done a production not yet we do
have to do DR failover tests in production but we tend to do those offline so that we don't have
to actually worry about it really affecting things yeah well this has been a great
conversation I've really enjoyed a lot of the the details that you've been been giving here
you know I guess in many respects if if our listeners don't necessarily if they're not as
familiar with the disruptor pattern it almost seems like a precursor to this podcast to go and
just look at what what's going on and then listen to this because it's it's this is a big fire hose
to listen to without any background yeah there's there's a it's amazing how you know it's actually
quite a simple piece of code if you have a chance to look at it's probably only about two three
thousand lines of code in total but you know and it actually feel very simple but it's it's an
abstraction over some you know quite quite deep con deep concepts and that's why we've been a lot
of time writing about it and talking about it and doing presentations on it because there's
actually a lot of learning there as well and that's what we think is actually more valuable
my personal opinion is that I've seen a lot of people talking about concurrency in the industry
and they use the term our tools are woefully inadequate and I don't believe that personally I think
our tools are pretty good I think it's as developers our understanding is woefully inadequate we
don't spend enough time digging down and the experts are not helping us understand the building
tools for us to use but not actually increasing our own understanding and one of the important
reasons for that is that actually at the CPU level all concurrency is the same all concurrency is
shared memory concurrency and there's a bunch of instructions that help you with ordering invisibility
and every tool that you run into including a disruptor and STM actor models all these sorts of
things are simply abstractions over that concurrency model but being an abstraction there's a translation
cost between that abstraction and what's actually happening and if you don't understand where that
translation cost occurs then you're not going to know what's going to be most appropriate for your
system so it's about sort of understanding what's happening lower down given a tool that I'm going
to use what the cost I'm going to be paying do those costs become fairly light in terms of the
type of application I'm building or do they become very very heavy well thank you very much this
is like I said it's been an excellent to listen to and this seems like one of those where I
want to have a follow up here but it's going to be a while because to wrap my brain around all the
the implications here is it's going to take a while so thank you again okay you're welcome and
if you want to do a follow up no problem just let me know when you'd like to do it all schedule
a year out or something like that yeah yeah I mean probably just on closing there's obviously the
the disruptor page it's on Google code there's also we'll make sure there's plenty of links up
there but there's a quite a long series of blog posts written by a colleague of mine called Trisha G
about understanding the various components of the disruptor a lot of people say that they've
read the paper founded a bit dry and academic but have read Trisha's posts and they're a lot more
accessible and they provide a lot more understanding a lot more quickly and then Trisha and I've also
done a couple of talks about some of the internals that disruptors so I make sure those are all
available linked linked to from the Google code website and we could put those in the show notes
as well I've gone over most of that stuff the paper I found I thought it was quite fascinating
just because it was such a different way of thinking about the problems from the traditional models
in any case this has been a great podcast so read not to any any final closing thoughts
well not really aside from the fact that this really detailed and really nice
explanation is of the structure seems to be like almost a tailed dress before building an event store
excellent okay so that's it for today and we'll see you next time
you have been listening to Hacker Public Radio at Hacker Public Radio does our
we are a community podcast network that releases shows every weekday on day through Friday
today's show like all our shows was contributed by an HBR listener like yourself
if you ever consider recording a podcast then visit our website to find out how easy it really is
Hacker Public Radio was founded by the digital dark pound and the economical and
computer cloud HBR is funded by the binary revolution at binref.com all binref projects are
crowd-responsive by lunar pages from shared hosting to custom private clouds go to lunar pages.com
for all your hosting needs unless otherwise stasis today's show is released under a creative
commons, attribution, share a life, do those own lives please