Building fault-tolerant, low-latency exchanges

This article is the first in a series where we’ll be sharing our experience of building fault-tolerant, highly-performant marketplaces.

Before we discuss how we build bespoke marketplaces at Adaptive, let’s consider an alternative: buying an off-the-shelf solution.

Building a bespoke solution has the obvious benefit that we can tailor it to match requirements. For example, in a marketplace, we might introduce a novel matching algorithm or order type that domain experts think will improve liquidity. These are the kinds of customisation that are impossible, or at least prohibitively expensive, to achieve with an off-the-shelf solution whilst retaining the underlying intellectual property.

On the other hand, building a bespoke solution is riskier. Building marketplaces typically requires a large team with in-depth knowledge both at the domain level and at the infrastructure level. An organisation may need to hire a team if it doesn’t have enough capacity in-house. Hiring great people takes time and effort. Factors like this can lengthen the time to market beyond what it takes to develop a solution from scratch. What if a competing marketplace innovates first? How many people are needed to maintain the solution after its initial delivery? Can those people be retained?

At Adaptive, we offer what we think is a compelling third option: a hybrid approach.

Between 2012 and 2020, together with various clients, we’ve built many trading systems, including exchanges and RFQ workflows, across different asset classes using similar architectures (read more about them in this blog post). In the most recent of these deliveries, we had a budget of 100 microseconds to respond to a NewOrderSingle FIX message, i.e., a “place order” instruction, with execution reports at the 99th percentile. Below is a birds-eye view of the key logical components in this system. We’ll revisit it later in more detail.

Over those five years, it became increasingly clear which parts of the architecture remained constant between projects. To offer better value to our clients and to accelerate their deliveries, we wanted to stop building these parts from scratch on each new project; therefore, back in 2017, Adaptive began to invest in a solution: Hydra Platform. Hydra Platform provides an opinionated architecture and building-blocks for the typical logical components that one would find in a trading system. It allows teams to focus on business logic and deliver value to the client straight away without having to worry about low-level details or cross-cutting concerns like how fault-tolerance or disaster recovery will work.

Hydra Platform has and continues to be a compelling option for our clients.

  • It supports bespoke functionality and customisation.
  • It de-risks project delivery.
    • Projects require smaller development teams.
    • Projects use a proven architecture where we’ve already established patterns for cross-cutting concerns like fault-tolerance.
    • Projects cost less up-front than designing and developing the underlying patterns and infrastructure from scratch.
  • It reduces the time to market.
    • From the very first day, project teams focus on the business problems they are solving, i.e., the essential complexity, rather than building the underlying infrastructure, i.e., the accidental complexity.
    • Smaller development teams require less effort to hire.
    • Projects can leverage building-blocks for the most common logical components in marketplaces (which makes Hydra Platform an excellent fit for building RFQ engines, exchanges, internalisation engines, etc.).

In the following sections, we’ll describe how we glue Hydra Platform components together while preserving fault-tolerance across the whole system. By the end, we’ll end up with a map of the architecture of projects built with Hydra Platform. In future articles, we will zoom into particular areas of this map, discuss how we’ve distilled our experience to build specific Hydra Platform components, and demonstrate how easy they are to use.

To find out more about the evolution of this architecture and why we’ve used it over more traditional architectures, please read our whitepaper on application-level consensus.

The clustered heart of the architecture

A clustered engine sits at the heart of marketplaces built on Hydra Platform.

Each node of the clustered engine contains several stateful, application-level modules. These might be reference data, risk management, and a matching engine, for example. A module processes a consensus-agreed sequence of commands using deterministic business logic to generate a sequence of events that is identical across all nodes.

Modules behave a bit like the idealised, pure function signature below, which is similar to the signature for a fold. On a single thread, modules process each command in the context of the current state to produce a sequence of events and the next state.

Over the last decade or so, similar, log-based patterns have emerged across the whole stack. For example, event-sourcing has become popular on the back-end, and Redux and Elm-based applications have become popular on the front-end. Based on our experience, we attribute this to two main factors.

  • Deterministic, single-threaded code is easy to understand; especially when paired with clearly defined inputs, outputs and state.
  • Time-travel debugging makes it easy to track down and fix bugs, and it is trivial to implement in systems where one can replay commands/events/actions.

We place our modules at the core of a hexagonal architecture to eliminate infrastructure-level concerns from our business logic. It makes testing straightforward and reduces lock-in to messaging infrastructure. We implement adapters for a few simple interfaces that our business logic exposes. These adapters plug the business logic into Hydra Platform so that it runs with fault-tolerance.

 

Divergence protection

Earlier, we described modules as processing “each command in the context of the current state to produce a sequence of events and the next state”. When clustered business logic is accidentally not deterministic, it may cause divergence of these outputs, i.e., the next state and the emitted events might differ across cluster nodes. Non-determinism occurs when a developer uses something that isn't the same across all nodes or isn’t the same if we replay the log, for example, system time.

To prevent divergence, we use two strategies in Hydra Platform. First, we use static-analysis tooling to nudge developers into writing deterministic code. Second, we deploy monitoring software to detect divergence at runtime.

 

High-performance replication, consensus and distribution

Hydra Platform uses a highly-efficient RAFT implementation, built on top of Aeron, to replicate, persist and sequence the commands that the FIX and web gateways send to the engine. It also uses Aeron to distribute and replay events to downstream components efficiently.

Performance and protocol-design gurus Martin Thompson and Todd Montgomery, of Disruptor and 29West fame respectively, built Aeron for low-latency messaging. To get the best out of it, we use zero-allocation, zero-copy messaging codecs similar to SBE, Flatbuffers and Cap’n Proto on top.

Let’s look at some numbers!

We recently ran a series of benchmarks to measure round-trip latency between different kinds of Hydra Platform components. We ran our benchmarks on:

  • “m5d.metal” EC2 instances,
  • bare metal with kernel-based networking, and
  • bare metal with kernel bypass.

In any system, latency measurements will change as the load upon it changes. Therefore, it is essential to consider the load a system was under when its latency was measured and not look at latency figures in isolation. We ran each benchmark under a load of 100,000 round-trips per second from a single client, where each message was 100 bytes in length.

Between non-clustered Hydra components, on two machines, our measurements showed that 99.99% of all round-trips, that is, two hops, were less than:

  • 200 microseconds on EC2,
  • 50 microseconds on bare metal, and
  • 23 microseconds on bare metal with kernel bypass.

Between a Hydra Platform engine clustered across three machines and its client on another, our measurements showed that 99.99% of all round-trips, including hops for consensus, were less than:

  • 175 microseconds on bare metal, and
  • 73 microseconds on bare metal with kernel bypass.

 

Limitations on the size of the clustered engine state

The programming model embraced by Hydra Platform limits the business logic inside our cluster to processing commands using only its (“snapshottable”) in-memory state. The finite nature of main memory imposes an upper limit on the size of this state. Therefore, we avoid storing unbounded collections, e.g., trade executions, inside it. Instead, we record events to disk using our event log.

As our business logic is on a single thread, expensive queries slow down the processing of latency-sensitive commands like cancelling an order. Wherever it is sensible, we avoid executing queries inside the clustered engine.

If a downstream component needs data, but we’re not willing to store it in the clustered engine or query it, where do we obtain such data? Let’s look at an example.

The admin gateway and web trading gateway both present live pages of tabular data based on user-defined queries using Hydra Platform’s LiveQuery. They source real-time data from the streaming tail of the event log. However, this streaming tail doesn’t give enough information to answer questions like, “what ten trades have the highest prices in the last 48 hours on BTC/USD?”. Answering questions like this requires historical data too.

While it is possible to replay two days of data in an exchange environment, we don’t recommend it! Consider the following questions.

  • From what position in our (unindexed) event log should we replay?
  • How expensive is it to ingest two days worth of data to compute the result set?
  • At what rate will we receive these queries?

 

Accumulating data in read-models

If replaying is not an option, where do we obtain historical data? The answer is that we continuously update a data-structure to answer our queries efficiently. In CQRS-speak, we call this data-structure a “read-model”.

In the case of historical trade execution data, the history service acts as our read-model. It maintains an indexed table inside a relational database. It converts each “trade executed” event that the engine appends to the event log into a row insert. We store the table in a denormalised form to avoid the computation of expensive joins when the database executes queries.

The history service exposes an API for other components to issue queries. Upon receiving a request, it converts the query into SQL and executes the query against its database. It then transforms the result set into messages that it sends directly back to the component that issued the request.

 

Read-model resilience and recovery

We run redundant instances of the history service in an active/active configuration to avoid introducing a single point of failure. Whenever a service writes a row out to its database, it writes a high-water mark alongside it that indicates how far into the event log the service has consumed. When the service restarts, it queries its database for this high-water mark and replays only the unprocessed events from the event log.

 

External connectivity

At the fringes of our initial component diagram, coloured in blue and red, are gateways that provide connectivity with external systems.

In line with the cluster, we tend to use the hexagonal pattern in our gateways too. Again, we implement adapters over the interfaces that the business logic exposes to plug it into Hydra Platform.

Gateways and clients vary between projects. Hydra Platform allows developers to build custom gateways and clients quickly using its building-blocks.

There are several gateways shown in the original diagram.

  • We built some gateways using the Hydra Platform FIX Gateway building-block.
    • The FIX Order Management (OM) Gateway allows customers to modify orders.
    • The FIX Market Data (MD) Gateway gives customers access to live updates for selected order-books.
    • The FIX Drop Copy Gateway integrates with customers’ back-office infrastructure to support reconciling trades.
  • We built other gateways using the Hydra Platform Web Gateway building-block.
    • The Web Trading Gateway provides similar functionality to the FIX OM and MD gateways but exposes a JSON API over WebSocket.
    • The Admin Gateway exposes an API and Web GUI for market operators to configure and observe the marketplace, e.g., they might use this functionality to list instruments or to see live orders.

 

Protocol transformation

Gateways typically transform one protocol into another, e.g., from FIX into an efficient internal protocol and vice-versa. We often split gateways into two parts: an inbound flow and an outbound flow. This split is similar to the division in CQRS.

The inbound flow receives messages from external sessions that might mutate the state of the system, e.g., a command to place an order. First, it validates these commands, e.g., checking that a place order command references a tradable instrument. Next, it transforms these commands. Externally, we may expose heavyweight types, e.g., strings to represent instrument identifiers. In contrast, internally, we might use more efficient representations, e.g., 64-bit integers, that we might need to look up in a map. Finally, it forwards the transformed commands to the clustered engine.

The outbound flow receives events from the clustered engine. It transforms these events into messages of the external protocol, e.g., FIX messages, and disseminates them to external sessions. It might also maintain a cache of data to service queries.

 

Keeping the clustered engine lean by farming out work to gateways

Gateways can be useful for offloading work from the clustered engine. For example, we built an exchange-like system where we had to apply a markup to prices before disseminating them. Each customer observed slightly different prices (with tailored markups). We put this calculation logic in gateway instances rather than inside the clustered engine, as it allowed us to scale (more) horizontally to match the number of concurrent customer connections.

 

In-memory read-models for joining data

Sometimes transformations in gateways, like price markup, require more information than is available in the individual events that we transform. For example, we might need price and customer tier to calculate a price with markup. Price information but not all customers’ tier information may be available on something akin to an “order placed” event. Therefore, we need to obtain and maintain, via the processing of events, this customer tier data separately to join together with our “order placed” events later. For this purpose, in addition to servicing queries, gateways sometimes maintain in-memory read-models.

 

Gateway resilience and recovery

Unlike the history service that is backed by a durable database, there is no high-water mark for an in-memory read-model on restart. All the state is gone. To seed our in-memory read-model at startup time, we “prime” it by either querying the clustered engine, if it is reasonably quick and a small amount of data, or by requesting the data from another durable read-model like the history service.

We usually run redundant instances of gateways in an active/active configuration. If we have a business requirement that only one gateway should accept connections at any time, e.g., to avoid differences in latency for modifying orders, we can run groups of gateway instances in an active/passive configuration. Hydra Platform will detect failures of active gateways and “promote” passive instances. It frees the application layer, and application development team, from dealing with the complexity around failover orchestration.

 

Conclusion

We’re at the end! We’ve covered all of the components in our initial diagram.

In the diagram below, we unfurl the inbound and outbound parts of our gateways, that we mentioned earlier, to represent our components slightly differently. It shows that, if we squint, we have a unidirectional data flow when we structure our gateways in this manner. This flow, when coupled with deterministic business logic, extends some of the benefits of the clustered engine that we mentioned earlier, such as time-travel debugging, to everything downstream of it.

Thank you for reading this article. We hope that you:

  • have picked up a high-level understanding of some of the components we use to build highly-performant, fault-tolerant trading systems; and
  • understand where Hydra Platform sits on the buy vs build spectrum.

In future articles, we’ll cover various aspects of Hydra Platform in more detail and show how simple its APIs are to use. In the meantime, you can also find out more in this blog series about Hydra Platform by our CTO.

If you would like more information about Aeron or Hydra Platform, please don’t hesitate to reach out to me on LinkedIn or click the button below.

 

Zachary Bray

Senior Software Engineer, Adaptive Financial Consulting Ltd

×

Contact us

By pressing "Send" I agree that I am happy to be contacted by Adaptive Financial Consulting. I can unsubscribe at any time. You can read more about our privacy policy here.