ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
59% Positive
Analyzed from 4125 words in the discussion.
Trending Topics
#length#byte#bytes#encoding#https#bit#data#varint#bits#don

Discussion (82 Comments)Read Original on HackerNews
But testing proved that when you move to SIMD instructions, ULEB128 (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#ty...) or sentinel values (https://github.com/kstenerud/bonjson/blob/main/bonjson.md#lo...) win every time because of the parallelization opportunities.
The true irony is that even SIMD text parsing would outperform this! SIMD is that powerful.
I spent WAYYYYYYYY too much time exploring this...
On top of that, for the vast majority of performance/cost parameter spaces, you're better off both in developer ergonomics and speed/space slapping zstd across a flatter binary format, supposing no better tool fits your use case better. Especially if your messages aren't exceptionally tiny. You're not using them in a raw DB or doing raw bulk analysis on varints (else basically zero choices of parameters make varints win out), so you're transferring them somewhere and decoding them. That decoding step, even for highly optimized solutions like bijou64, is on par with (slightly better than, if you have an older datacenter link) your raw network. If you spend 1s on networking, you spend 1s on parsing. That's a bad tradeoff almost always, and that assumes a good varint solution.
Even when varints make sense for some set of perf/cost parameters, it's still only for developer ergonomics 99.9999% of the time. Even simple changes like operating on a sequence of values rather than a single scalar enable vastly better CPU/space tradeoffs, and being willing to craft a proper data layout usually offers huge gains on top of that.
It's interesting that you pick delta encoding (or, its natural extension, double-delta encoding often being valuable) for time-series databases as an example. That's an obvious case where you have a solution which is extremely cheap in storage/network/CPU. Varints suck comparatively, almost always.
Not to rip on them too much, especially since it's nice to have primitives available which let you not have to do hard thinking for literally every problem, but they're not amazing and not a great default.
It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.
The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.
ULEB128 works in SIMD because there's only one dependent bit per byte, so you can speculatively decode and then correct later cheaply. Bijou requires you to check the first byte and then branch based on the value using all 8 bits in the decision matrix (to handle branches 0-247, 248, 249, 250, 251, 252, 253, 254, 255). This absolutely DESTROYS any parallelization opportunities.
Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.
Even though decoding the lengths must be serial (since's there's no unambiguous way to differentiate a tag and data byte), it's still doable within the wider SIMD registers, so there's some theoretical efficiency gain to be had (depending on the shape of the data).
On a general note, the continuation bit and prefix byte forms are equivalent, you just broadcast the prefix byte and compare against an increasing vector to convert it to a mask. Yeah, there's probably more fiddly SIMD if there are multiple prefixes in the register, but doable (it's just not byte-parallel, you eg. unroll the serial decode loop 8 times or whatever your maximum output byte width is, and mask out).
Simplified:
Can you explain this part a bit? I feel like intuitively (and therefore probably incorrectly) these should have the same difficulties.
The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!
The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.
I'm wary of introducing these forced-canonical encodings by someone hyper focused on "efficiency" and "security" without reconsidering additional use cases.
EDIT: BUT, BER-TLV does permit overlong encodings. And I once found and reported a Yubikey 4 bug related to this. My source code comment for the workaround:
The downside is the encoding size. LEB128 quickly grows to 2 bytes, but stays at 2 bytes all the way to 2^14. This is important if you're using these numbers as tags/identifiers as we were in the multicodec [1] project, or for network message lengths. bijou64 only gives you 500 <= 2 byte numbers.
[1]: https://github.com/multiformats/multicodec
If you only want to encode uint64 numbers LEB128 could easily be tweaked to fit in 9 bytes in several ways:
- using the offset trick described in this article would remove non-unique encodings (0x80 0x00 would encode 128)
- never allowing encodings longer than 9 bytes would mean the MSB of any ninth byte would always be zero, so you could reuse that, and store 8 bits in any ninth byte, for a total of 7 bits in each of the first eight bytes plus 8 in the ninth = 64
Both tweaks would lose LEB128’s property that you can find where each number starts from any byte in the stream, but the encoding discussed here doesn’t have that property either.
This looks neat, but if encoding/decoding performance is important, payload size isn't and the integer is bounded, I would just put a fixed-size integer into the payload as-is.
LEB128 (and JSON for that matter) can encode integer values of arbitrary length. This doesn't, which may or may not be important but it's different.
I'll admit that I do not do any cryptographic work with my library and therefore canonical representations aren't a huge concern in my use-cases. I merely provide various configurable limits (max value length, max depth, max items per collection) in an effort to prevent infinitely long documents from hogging my tokenizers indefinitely.
https://news.ycombinator.com/item?id=44456073 - Corrected UTF-8 (2025-07-03, 54 comments)
This "corrected UTF-8" has other problems, but I thought it's interesting how the shifted-offset idea carries over.
Values 0-127 are a single byte, but if that first byte has the continuation bit set, not only does that indicate the next byte has 7 more bits to contribute, it also moves the base up to the next window.
10000000 00000000 is the only way to represent 128.
10000000 10000000 00000000 is the only way to represent 16512.
Does this encoding have a name?
Clearly not.
vu128: https://john-millikin.com/vu128-efficient-variable-length-in...
metric/imperial varint: https://dcreager.net/2021/03/a-better-varint/
vectorizing VByte: https://arxiv.org/abs/1503.07387
[0]: https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki
> A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more bytes which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are more significant than bits taken from the later bytes.
from https://www.sqlite.org/fileformat2.html#varint
The one you linked for SQLite4 (abandoned project) is probably a better approach. I recall that the author has said that SQLite3's varint implementation is regretful.
The first is what they describe here: as an attack. It's like why would anyone ever overflow a buffer with shellcode.
The second is that they are implementing a spec that requires appending a varint length-prefixed field to a buffer but don't really care about the space optimization, don't know the field's length when they start appending it, and don't want to put the field into a second, temporary buffer or slide it down into place. https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08f47081... vs say my own code which does the slide: https://github.com/scottlamb/retina/blob/6972ac4261ce7bf5b58...
If you can choose a fixed number of bytes for the length prefix, you can skip that number, do the encoding and find out the length, and then come back and fill in the length-prefix after.
But you actually don't know how many bytes it will take without doing all of the work to know the payload length (since larger payloads take more bytes to represent the length).
If you allow overlong representation you can reserve a few bytes and sometimes it'll just be the effective no-op bytes. If you don't, you won't be able to.
It's uncommon but I've definitely seen it done (with media containers like Matroska, not actually LEB128) in extremely high-throughput systems that can't spare any cycles.
I happen to be guilty of a variant of this, where I don't bother emitting a 16-bit floating point number instead of a 32-bit one in my CBOR encoder even if it can be represented exactly. That one is laziness.
https://webassembly.github.io/spec/core/binary/values.html
Maybe a robust parser would benefit from a strict mode, disallowing over-wide encoding.
Either way, a properly written decoder (and it's like ten lines) should really not have any problems with it. I was agreeing with you.
Edit: to clarify, I was talking about the author's argument being strange, not yours.
An adveserial package can claim to have a 255 tagged integer but not actually have any followup, tricking the payload parser into an incorrect offset and reading straight off into followup memory.
It's a classic thing to check for when dealing with variable length strings or binary, but it may not cross the mind when it's hiding in the Bijou64_decode(*buff, *cr) function.
In a contrived example of a pbuf {length:int, payload:byte[1]}
LEB128 can trick you into reading the payload as part of the length, but then hopefully trigger a code check against invalid buffer read. (or one byte outside the struct if the payload is also malicious)
Binou64 can trick you to read 7 bytes into other memory, before any buffer size validation is done.
It's then not uncommon to log with a helpful; "buffer with length: 26624894573377(7 bytes of stolen data) is invalid", or just crash.
It's to the point that Bijou64_decode should perhaps take "end_adress" or "max_read" to catch this kind of attack.
(If you dont validate a malicious pbuf, you're in for a bad time regardless of integer format, but these int formats add their own way to trigger a buffer overrun despite a proper check.)
The upsides: the size of the integer is apparent upon reading the first byte, and every number has exactly one canonical representation. I wish C strings had been standardized around something similar, instead on null termination.
> ...adversarial input, which is rarely in the test suite.
This made my scratch my head. My tests for quite pedestrian APIs often contain adversarial input of obvious shapes. I though that for anything security-related (like the author's project) testing against adversarial input would be be a prominent part.
They might have a different definition of adversarial than you.
> My tests for quite pedestrian APIs often contain adversarial input of obvious shapes.
This doesn't seem like what I would call adversarial.
This seems like standard negative testing or boundary value analysis - which I would be shocked if they didn't do.
Forgetting to do the range check on the first_byte==255 case and just letting it do 64-bit wraparound is exactly as much of a plausible bug as missing range checks on LEB128. Any test suite with the goal of covering canonicality will trivially cover both properly; and a programmer that implements things by reading 7 words into the spec, saying "oh yeah I got this" and goes to implement what seems simple, will write a broken version of both.
Perhaps the biggest benefit is just not being associated with a format that tolerates non-canonicality in other places (though, if bijou64 gains traction, it'll only be a matter of time for wraparound-check-less versions to start appearing in places where the wraparound is fine); and I guess also it being less annoying to implement the canonicality check, though hopefully people writing security-sensitive software aren't ones to skip out on correctness checks due to annoyingness.
In a sense, bijou64 could perhaps even be more problematic - it invites not doing any range checks for the smaller inputs because they obviously don't need it, and so you can just forget to special-case the max length case; whereas LEB128 makes you already care about it at the first point it is actually LEB128.
(of course, the format does still have other benefits; enforced canonicality is just...not one of them)
I'm not saying SQLite's varint implementation is ideal for every application. It's just an implementation that is one of the most used implementations, if not the most (I'd bet it is by a large margin though). It just seemed like a missed opportunity to compare it with the implementation they landed on.
EDIT: Just wanted to add, thanks for sharing that link. Interesting!
For example you can use a denormalized zero to signal null.
You can still define a canonical encoding where denormalizations have specific meaning or signal an error.
Why not little endian like modern CPUs?
Given that the context up to this point had been representation of integers, I initially trip on this. :)
* It does not require canonicality, it allows multiple encodings of the same value. To make things even more fun, the QUIC spec requires shortest encoding in some uses but not others.
* It uses 2 bits rather than cutting out a range.
* It only encodes values up to 62 bits long.
So, some similarities but also some differences.
[1]: https://www.rfc-editor.org/rfc/rfc9000.html#name-variable-le...
1: https://kizu.dev/svg-linked-parameters-workaround/ 2: https://www.seaofclouds.com
> This causes problems for signed data if you ever want to do things like compression since you need to know the exact bytes that were signed.
If you are verifying a signature by taking some logical data structure, turning it into a byte string, and calling the verification primitive on those bytes, you likely have a design error. You should instead collect bytes, verify the signature, and then parse the bytes after verifying the signature. And remember to include enough context in those bytes so a different message signed for a different purpose by the same key doesn’t confuse you.
Here's two passing tests for software craftmanship:
1) it looks decades into the past 2) it looks decades into the future
And say you have it as part of some other data. If you want to be able to hash it by the raw memory bytes, many different ways to represent a number becomes a problem.
If you don't do this properly, you end up with things like: - SAML XSW attack due to XML signature wrapping - ASN.1 BER/DER signature forgery - Bitcoin transaction malleability attacks