Sub-microsecond P99 is the wrong goal for most matching engines. The right goal is P99.99 under saturation — how the engine behaves when the market is in crisis and the inbound message rate just tripled. Optimizing for the median tells you nothing; you care about the tail, and the tail is where engineering decisions stop being theoretical.
The thing that matters isn't what you think
When people benchmark matching engines, they report median order-to-trade latency. This number is almost useless operationally. A matching engine with a 1μs median and a 500ms P99.99 is strictly worse than one with a 4μs median and a 40μs P99.99. The second one can ship; the first one can't.
Here's where that tail latency actually comes from, in roughly descending order of impact:
- Allocator pauses on the hot path — heap allocation during matching is a time bomb
- Lock contention under burst load, even on mostly-read paths
- Kernel syscall overhead (read, write, epoll) when the message rate exceeds what batching can hide
- Cache misses on the order book data structures themselves
- Cross-NUMA memory access when the matching thread and the network interrupt aren't pinned to the same socket
Optimization work that doesn't address items 1–3 is mostly cosmetic.
The techniques that consistently work
These are the ones we reach for on every engine rewrite. In priority order:
Pre-allocated object pools. Never Box::new on the hot path. Order structs, price level nodes, and trade records come out of a pool sized to the expected peak message rate × longest hold time. If you run out of pool capacity, you're already in a bad state — fail loudly.
Lock-free single-producer-single-consumer queues for the network-to-engine handoff. A Vyukov-style SPSC queue with cache-line-aligned slots will outperform any mutex-based queue by two orders of magnitude on the tail.
io_uring or kernel bypass (DPDK) for the network ingress. Regular epoll + read is fine until you're above ~500K msg/s; after that you're paying per-syscall overhead on every burst. io_uring gets you to a few million msg/s on commodity hardware. DPDK takes you further, at significant operational cost.
Per-symbol sharding on multi-core. If your matching is symbol-independent (which it usually is), one hot core per symbol with CPU pinning and NUMA-aware allocation beats any work-stealing scheduler.
Warmup before you measure. Cold caches on the first 10K messages will make every benchmark lie to you. Burn in the hot paths before the measurement window opens.
What we don't do anymore
A few things that are popular and that we have consistently regretted:
- SIMD for price comparisons. The gains are real but the code complexity is brutal and the bugs are subtle. We only reach for SIMD if profiling says it's the top bottleneck, which it almost never is.
- Custom allocators. We used to ship a bump allocator for hot-path structures. The gains over a well-tuned object pool were marginal and the ergonomic cost was high. Object pools won.
- Hand-rolled atomic data structures. In 2019 this was necessary. In 2026 the
crossbeamandparking_lotcrates are so good that writing your own is almost always worse than the library version.
The unglamorous part
The single biggest latency win on our last engine rewrite wasn't any of the above. It was fixing an accidental heap allocation inside a Debug format string that only fired on an error path that triggered 1% of the time. P99.99 dropped from 2.1ms to 38μs from removing one format!() call.
The lesson, which we relearn every engagement: profile first, theorize second. The hot path of a real system is rarely where your intuition says it is. Reach for perf, reach for flamegraph, reach for tracing with histogram exporters — and only start optimizing after the data tells you where to look.