Back to News
Advertisement
Advertisement

⚑ Community Insights

Discussion Sentiment

40% Positive

Analyzed from 304 words in the discussion.

Trending Topics

#path#consumer#hot#hardware#between#producer#slow#byte#throughput#cpu

Discussion (11 Comments)Read Original on HackerNews

riyaneelβ€’3 days ago
I am the author of this library. The goal was to reach RAM-speed communication between independent processes (C++, Rust, Python, Go, Java, Node.js) without any serialization overhead or kernel involvement on the hot path.

I managed to hit a p50 round-trip time of 56.5 ns (for 32-byte payloads) and a throughput of ~13.2M RTT/sec on a standard CPU (i7-12650H).

Here are the primary architectural choices that make this possible:

- Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.

- Hardware Sympathy: Every control structure (message headers, atomic indices) is padded to 128-byte boundaries. False sharing between the producer and consumer cache lines is structurally impossible.

- Zero-Copy: The hot path is entirely in a memfd shared memory segment after an initial Unix Domain Socket handshake (SCM_RIGHTS).

- Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.

The core is C++23, and it exposes a C ABI to bind the other languages.

I am sharing this here for anyone building high-throughput polyglot architectures and dealing with cross-language ingestion bottlenecks.

zekriocaβ€’about 3 hours ago
Why report p50 and not p95?
riyaneelβ€’about 2 hours ago
Tail latency p99.9 (122ns) are reported
BobbyTables2β€’about 2 hours ago
Would be interesting to see performance comparisons between this and the alternatives considered like eventfd.

Sure, the β€œhot path” is probably very fast for all, but what about the slow path?

riyaneelβ€’about 1 hour ago
eventfd always pays a syscall on both sides (~200-400ns) regardless of load. Tachyon slow path only kick in under genuine starvation: the consumer spins first, then FUTEX_WAIT, and the producer skips FUTEX_WAKE entirely if the consumer still spinning. At sustainable rates the slow path never activates.
JSR_FDEDβ€’about 4 hours ago
What would need to change when the hardware changes?
riyaneelβ€’about 2 hours ago
Absolutely not, the code following all Hardware principles (Cache coherence/locality, ...) not software abstraction. That not means the code is for a dedicated hardware but designed for modern CPUs.
Fire-Dragon-DoLβ€’about 2 hours ago
Wow, congrats!
riyaneelβ€’about 1 hour ago
Thanks!
Fire-Dragon-DoLβ€’about 1 hour ago
I will be discussing this at work on monday, will let you know what they think.

I wouldn't be surprised if somebody develops a cross-language framework with this.

riyaneelβ€’33 minutes ago
Would love to hear the feedback