FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
40% Positive
Analyzed from 304 words in the discussion.
Trending Topics
#path#consumer#hot#hardware#between#producer#slow#byte#throughput#cpu

Discussion (11 Comments)Read Original on HackerNews
I managed to hit a p50 round-trip time of 56.5 ns (for 32-byte payloads) and a throughput of ~13.2M RTT/sec on a standard CPU (i7-12650H).
Here are the primary architectural choices that make this possible:
- Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.
- Hardware Sympathy: Every control structure (message headers, atomic indices) is padded to 128-byte boundaries. False sharing between the producer and consumer cache lines is structurally impossible.
- Zero-Copy: The hot path is entirely in a memfd shared memory segment after an initial Unix Domain Socket handshake (SCM_RIGHTS).
- Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.
The core is C++23, and it exposes a C ABI to bind the other languages.
I am sharing this here for anyone building high-throughput polyglot architectures and dealing with cross-language ingestion bottlenecks.
Sure, the “hot path” is probably very fast for all, but what about the slow path?
I wouldn't be surprised if somebody develops a cross-language framework with this.