In our earlier posts, we talked about embracing asynchrony, as well as the benefits of viewing all requests as ones that result in streams of responses. One thing we haven’t talked about yet, but which was a major feature of Reactive Trader, was actually detecting system unavailability, as well as overall system health.

Heart-beating

Raising an event when a stream goes quiet – fails to yield an event within it’s agreed service level – is one thing. But we also need to make sure we don’t attempt to connect to back end services that are currently unavailable. Typically the underlying push-based message-oriented-middleware that are typically used to be reactive financial applications have their own concept of connectivity which means, as the developer, you can tell if you are connected to the middleware itself. But services that you build on top of it, such as authentication and various stores of reference data, or a calendar service, may not necessarily get first-class support from the middleware API with which to communicate their health. In this case it is necessary to introduce some sort of heart-beating so that your client application can both tell when a back-end service is available during system initialisation (particularly critical during development when you are starting both server and client on the same machine at the same time) as well as when it goes away.

Our sample application, Reactive Trader, was not implemented over any particular middleware and so does not have a middleware broker. Instead, it connects directly to a single back-end server using SignalR, optimistically trying to connect via WebSockets. We use the SignalR API’s concepts of connectivity when demonstrating system availability and recovery.

A simple approach to heart-beating is usually the best, and its important to be sympathetic to the underlying transport layer that you are using. Typically communicating over TCP, and then over some other forms of reliable transport mechanisms, a delayed heartbeat has not been dropped, just delayed by congestion. A heartbeat every 5 – 10 seconds is reasonable, and waiting 1.5 to 2.5 times the expected delay before alerting the user that a back-end service is unavailable works well.

A word of caution, however, when implementing your re-connection logic. A particularly problematic issue we commonly see when a back-end service becomes unavailable, or is perceived to be unavailable, to a large set of clients is caused by the rush of clients connecting to it when it comes back. This causes a flood of inbound traffic, and then a resulting (and much greater) flood of outbound data from server to clients. This can cause congestion on the internal LAN or DMZ network links, causing dropped packets and TCP retransmission which may cause those all important heartbeats to be missed, causing your client API to once again believe the back end server has been disconnected and for it to try to re-connect once more. We have seen and heard about more than a few trading systems being taken offline by a problem like this. A random delay of up to a significant period – at least as long as the heartbeat period itself – by individual clients before reconnecting to a back-end service can reduce the vulnerability of your system design to this problem. This particular approach isn’t something we have implemented in Reactive Trader.

System Health

So while talking about heart-beating of system services, we also want to state that it is important to not introduce too much complexity when representing system components. A complex system model of health – for instance one that reveals the complexity of the downstream system architecture with its myriad databases, networks, buses, and servers to the UI when any one of them fails would be very difficult to test and verify all the different combinations of failure states. One pattern that works well is to break your backend API into large modules, with smaller services to allow easier consumption. For example, in our sample application we would have three different modules – reference/referential, streaming and blotter. These could come and go independently, and indeed may even run in production on separate machines – possibly even with multiple instances active at the same time with different clients connected to different machines. This model would enable us to disable trading while keeping the blotter active, or vice versa, which is a simple decomposition of functionality that maps well to our business user’s understanding of the domain.

reactive-trader-system-areas

 

Here you can see the two main areas of functionality in Reactive Trader – streaming prices and the blotter. Both of these areas are dependent on the third back-end service – reference data. We can visually inform the user easily that some particular functionality is unavailable while leaving the other, functioning part of the system, running.

Asynchronous Gates

At one point during our talk we had some fun in highlighting that with the right use of an asynchronous API, it is possible to greatly improve the traditional poll-sleep-retry loop that is so common when we want to prevent calls to systems that are unavailable.

There are quite a few things wrong with the code shown, and Rx allows us to implement a much more elegant solution, especially when implemented over a push-based message-oriented middleware.

Here we see a query set up on a stream of system status notifications. We don’t have to block our thread, and when a notification is yielded that matches our query – here where the status is equal to Available – we immediately (rather than waiting for up to 2 seconds!) make a call to the client service.

A more sophisticated implementation would take into account the difference in the first call to the back-end system and subsequent calls made after a change in system availability. In the first, we presume that the backend has been available for some time and that we are the one just coming online and so do not need to introduce a random delay when making our network call. When making subsequent calls to the client service, after a change in system availability, we would introduce just such a delay.