Trading Technology for Capital Markets by Adaptive

15th December 2013

Design principles for SBE, the ultra-low latency marshaling API

A SBE flyweight behaves a bit like a stencil, you position it over a wall (byte array) at the right place (offset) and then you can paint (encode) very quickly!

If you have not heard about Simple Binary Encoding (SBE) before you should read this overview first.

The design of SBE applies some interesting techniques, combines them in a simple solution and is super hardware friendly.

Streaming

CPUs really like predictable memory access patterns and when they detect one they start pre-fetching data, loading it into low level caches before it is actually required.

It’s important to understand that when a CPU has to get data from a high level cache, or worse main memory, it is going to wait a while: orders of magnitude longer than the time it would take to process a few instructions.

With SBE, messages are encoded and decoded in a specific order:

Within the fields section, fields are encoded in the order specified by schema.
Then repeating groups, again in the order specified in the schema.
Finally variable length fields, in the order specified by the schema.

WARNING: It is YOUR responsibility when you code against a class generated by SbeTool (we call them Flyweight) to ensure that you encode and decode in the order specified by the schema. Failing to do so could at best reduce performance, at worst return invalid data during decoding or corrupt data in the buffer during encoding.

Note that at the moment the API will let you encode and decode out of order but we have plans to improve that and throw errors if we detect an invalid sequence, at least with debug builds.

This is a constraint, a small one we believe, that helps simplifying the flyweight design and make it more hardware friendly.

No copy

Lots of serialization APIs use some form of DTOs (data transfer object): when you deserialize a message from the wire you get a DTO and then map this DTO to some business entity.

SBE does not work this way: the flyweight writes directly to the underlying buffer during encoding and reads directly from the buffer during decoding.

Here is an example: let say we have an order message defined in a SBE schema, then run SbeTool, we get an order flyweight. [gist id=7957984]

The previous statement does not store 72 in the order flyweight, what it does is encode 72 in its byte representation (which depends of the orderId primitive type and of the endianess) and store it directly in the underlying buffer.

A flyweight behaves a bit like a stencil, you position it over a wall (byte array) at the right place (offset) and then you can paint (encode) very quickly!

No allocation

The order flyweight we talked about previously can be reused indefinitely, to encode and decode different messages. It means that you do not need to allocate additional flyweights at runtime.

When you decode a field of one of the primitive types, nothing is allocated, it’s only a stack operation.

Additionally when you decode a field of type array you do not get a new array allocated and given back you: you provide your own buffer (that you can reuse on your side) and the flyweight will copy data to your buffer. Again, this allows your system to not allocate.

Why limiting or preventing allocation? To limit or suppress GCs, which will slow down your encoding and decoding operations and more importantly pause all threads during stop-the-world collections (ie. slowing down the whole system).

Note: when you look at the benchmark with Google Protocol Buffers, it shows that SBE is significantly faster, but there is a more subtle aspect: GBP allocates so it will trigger GCs and slow down the overall system. This is another big advantage for SBE.

NOT thread safe

Message flyweights are not thread safe, by design. If you want to decode multiple order objects concurrently, you simply need to make sure each thread has its own order flyweight.

This means that all methods on flyweights will never take a lock, perform CAS operations or insert memory fences and will run at full speed independently of other threads on the system (no contention).

Compiler friendly

SBE codebase, and the generated code contains small methods that are easy to optimize for the JIT. When you look at the assembly generated by hotspot or the CLR, for instance for the encode and decode methods in the example program, you will see that most, if not all calls on the car flyweight methods get inlined.

Here is an example with the original code: [gist id=7957909]

In the non optimized assembly we can see all the method calls: [gist id=7957916]

Which gets optimized away and inlined with an optimized build [gist id=7957933]

Note: I noticed that the 64bit CLR JIT does a very poor job at optimizing this exact same code compared to the 32bits version. Hopefully RYUJit will help…

We also have some branching logic in the Java generated code for endianess but since it is known at compile time (it’s defined in the schema) and the platform endianess is known at runtime, hotspot gets this nicely optimized away.

Fast array access

Reading integers of different sizes from a byte array in C++ is simple: apply an offset to your byte pointer, cast the pointer to the type you need and dereference, job done. [gist id=7958191]

Life is not that simple in managed languages with standard arrays and all accesses pay the cost of bound checking. To work around those (performance) limitations Java uses the Unsafe class, which basically perform pointer operations under the hood and gets inlined (resulting in the same assembly code than C++). [gist id=7958273]

For .NET we use unsafe code (encapsulated in DirectBuffer) and perform the same operations than the C++ code. [gist id=7958282]

Endianess

Endianess specifies the order bytes are stored for a given primitive type. Most hardware use little endian and network historically use big endian.

C++ uses a macro to apply endianess, which compiles to a single x86 instruction bswap.

Java uses integer.reverse, which gets optimized away as well as bswap.

In .NET we have not found any intrinsic access to bswap… If you know any BCL code which use it under the hood (or would be optimized by the JIT) let me know!!

Performance tip: if you use SBE to exchange data between 2 boxes (99% of the time that’s the case) make sure to check the endianess of your hardware and if it is the same on both boxes use this endianess for the schema. SBE will have less work to do and will perform better.