Operating Aeron and Hydra Platform based solutions
This article is the fourth in a series of four blog posts on Aeron and Hydra Platform:
- Part I: History of Aeron and Hydra Platform. Read here >>
- Part II: 5 trading solutions we delivered on top of Aeron and Hydra Platform. Read here >>
- Part III: Aeron and Hydra Platform in a trading platform. Read here >>
- Part IV: Operating Aeron and Hydra Platform based solutions.
In the previous blog post I talked about the role Aeron and Hydra Platform play in trading platform architecture and we looked at some of the platform components we offer to accelerate and de-risk the development of trading systems for our clients.
Today we’ll look at the operational perspective and the tools we have on our belt to operate such a platform.
The following diagram provides an overview of the modules deployed on one of the matching engine cluster nodes. In Hydra Platform, we package Aeron components (in green on the diagram) alongside Hydra Platform operational components (in dark blue) in a generic docker container. Application developers can then overlay a Docker layer to add the application business logic, in this case, containing the match engine. Docker offers a convenient and standardised way to package and configure components. In terms of performance overhead, we use kernel bypass, so the network I/O stack of Docker does not have an impact on latency as we bypass it.
What follows is an overview of the components included in our standard Hydra Platform docker container:
Pre-configured Aeron modules including the Aeron driver, Archiver and consensus module (cluster). This allows developers to get up and running quickly with a deployed version of a service based on Aeron. A standardised and battle tested deployment layout also reduces support effort as each of our deployment is consistent across projects and clients.
Performance and monitoring
- Hydra Platform thread affinity profiles: this module allows us to set a performance profile on the container so the same container can be used in “desktop” mode for development or “nitro” mode for production where hot threads are pinned to cores for optimal performance.
- Hydra Platform monitoring probe: Aeron provides a low level API (stats-monitor) to collect various Aeron related metrics. Hydra Platform uses the same mechanism to expose its metrics, and we provide a monitoring probe capable of collecting these stats efficiently, so they can be sent to an external monitoring system in a standard format. The probe also exposes the health status of the node. Bespoke business logic metrics can also be set, and Hydra Platform will expose them via this probe.
- Hydra Platform latency monitoring: Hydra Platform collects component level latency metrics stored as histograms, and various latency statistics are exposed via the monitoring probe.
- Hydra Platform cluster authentication: Aeron Cluster provides hooks to implement an authentication protocol that Hydra Platform implements with an authentication module based on SCRAM (Salted Challenge Response Authentication Mechanism)
- Hydra Platform encryption: this module provides encryption for data in transit and at rest. It leverages IPSEC for data in transit and file system encryption for data at rest.
- Hydra Platform data maintenance: Hydra Platform provides tools to automatically purge message logs to recover disk space on the system and enable the system to run for long periods of time without maintenance downtimes.
- Hydra Platform backup/restore: this Hydra Platform tooling allows operators to backup cluster recovery plans and restore them. This can be used to extract data from one environment and move it to another.
- Hydra Platform Disaster Recovery (DR): Hydra Platform provides a standalone component capable of connecting remotely to a cluster from a secondary data center. This component replicates a persistent stream asynchronously. In the event of a primary site failure, it is possible to seed a secondary cluster in the DR site using the replicated data. For Cloud-based deployment, this approach allows our clients to very significantly reduce hosting costs as the DR site can be provisioned on demand, instead of being active all the time. This alone can half the hosting cost compared to an on premise solution, where two sites are required all the time.
- Hydra Platform time travel: One of the great aspects of this architecture is its deterministic nature. If a business logic exception occurs, it is possible for a developer or operator to “replay” the transactions with a debugger attached and get the exact same exception to occur deterministically. In most systems it generally takes more time to reproduce bugs than to actually fix them. With this design, reproducing a bug is generally a matter of replaying the log. We have built tooling in Hydra Platform to make this process straightforward so any developer can extract a recovery plan from the cluster and replay it at will in a development environment. I could not stress enough how powerful this is. This is a game changer compared to more traditional architectures where you rely on application logs to troubleshoot issues.
- Hydra Platform stream introspection: This tool works hand in hand with Hydra Platform codecs and is able to decode message streams and turn them into human readable format. This allows developers and operators to get a better understanding of messages flowing into the system, for development or operational purpose.
- Hydra Platform divergence detection: The business logic hosted in an Aeron cluster node must be deterministic. Subtle non-deterministic bugs can make their way into the business logic and stay undetected for a while. This module monitors Aeron Cluster output streams and can detect divergence at the transaction level. If a follower node diverges, the module prevents the node from becoming a leader and raises an alert.
- Hydra Platform poison message: Since business logic is deterministic, unhandled exceptions in the cluster will take down the whole cluster (i.e., each node will fail at the same transaction). When such incidents happen, operators can use Hydra Platform inspection tooling to investigate the issue and apply remediations. Such an issue can generally be fixed by rolling out a patch to the code. Alternatively, if a workaround is available, the tooling allows operators to remove the poison message from the log before restarting the system.
- Hydra Platform snapshot modification: Hydra Platform repositories generate binary snapshots. This tool allows operators to turn a binary snapshot into a SQL Database. The database can be used to inspect snapshot data. If required, the data can be modified in the database and the tooling can rebuild a binary snapshot from the modified database.
As pointed out in this blog series Aeron and Aeron Cluster provide a significant advantage for the financial community. These components have been continuously refined and enhanced during the past 6 years of development and deployed by a wide range of firms. Hydra Platform further accelerates and de-risk the development of trading platforms, providing a mature set of libraries and tools for developers and operators. Our clients are using this tooling to rapidly deliver market-defining trading venues and infrastructure today. Get in touch to see how Hydra Platform can help accelerate your delivery.
The series is also available for download as a white paper. In this paper, we provide an overview of Aeron & Aeron Cluster, and look at some of the projects we have developed using Aeron during the past 3 years. We then provide an overview of Hydra Platform which contains some of the building blocks we produced to complement Aeron and further accelerate the development of systems we build. While we do not go into deep technical details, this document is particularly suited for developers, architects and CTOs.
Co-founder and Chief Technology Officer,
Adaptive Financial Consulting