Bypassing the kernel for 56ns cross-language IPC

riyaneel•3 days ago

I am the author of this library. The goal was to reach RAM-speed communication between independent processes (C++, Rust, Python, Go, Java, Node.js) without any serialization overhead or kernel involvement on the hot path.

I managed to hit a p50 round-trip time of 56.5 ns (for 32-byte payloads) and a throughput of ~13.2M RTT/sec on a standard CPU (i7-12650H).

Here are the primary architectural choices that make this possible:

- Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.

- Hardware Sympathy: Every control structure (message headers, atomic indices) is padded to 128-byte boundaries. False sharing between the producer and consumer cache lines is structurally impossible.

- Zero-Copy: The hot path is entirely in a memfd shared memory segment after an initial Unix Domain Socket handshake (SCM_RIGHTS).

- Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.

The core is C++23, and it exposes a C ABI to bind the other languages.

I am sharing this here for anyone building high-throughput polyglot architectures and dealing with cross-language ingestion bottlenecks.

zekrioca•about 3 hours ago

Why report p50 and not p95?

riyaneel•about 1 hour ago

Tail latency p99.9 (122ns) are reported

BobbyTables2•about 2 hours ago

Would be interesting to see performance comparisons between this and the alternatives considered like eventfd.

Sure, the “hot path” is probably very fast for all, but what about the slow path?

riyaneel•about 1 hour ago

eventfd always pays a syscall on both sides (~200-400ns) regardless of load. Tachyon slow path only kick in under genuine starvation: the consumer spins first, then FUTEX_WAIT, and the producer skips FUTEX_WAKE entirely if the consumer still spinning. At sustainable rates the slow path never activates.

JSR_FDED•about 3 hours ago

What would need to change when the hardware changes?

riyaneel•about 1 hour ago

Absolutely not, the code following all Hardware principles (Cache coherence/locality, ...) not software abstraction. That not means the code is for a dedicated hardware but designed for modern CPUs.

Fire-Dragon-DoL•about 2 hours ago

Wow, congrats!

riyaneel•about 1 hour ago

Thanks!

Fire-Dragon-DoL•44 minutes ago

I will be discussing this at work on monday, will let you know what they think.

I wouldn't be surprised if somebody develops a cross-language framework with this.

riyaneel•30 minutes ago

Would love to hear the feedback

Bypassing the kernel for 56ns cross-language IPC

⚡ Community Insights

Discussion (11 Comments)Read Original on HackerNews