FR version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
43% Positive
Analyzed from 1753 words in the discussion.
Trending Topics
#unicode#utf#bit#invalid#code#string#encoding#surrogate#grapheme#characters

Discussion (29 Comments)Read Original on HackerNews
As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes.
Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense.
> We made emoji an atomic node type.
That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically.
> This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades.
I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.
UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable.
Surely certain people did know, but those people weren't in a position to do anything about it.
Specifically, there were surely people who knew that because historical Chinese place names, Japanese nicknames, and so on, were not included in the original "Unicode" (it wasn't called UCS-2 yet) it was insufficient for complete expression of Asian languages.
There were also many people who objected to Han unification, which is a different problem.
But all of these objections were discarded because of the overwhelming mandate for a fixed-width encoding. The original "Unicode" was conceived as a "16-bit" initiative. Its 16-bit-ness was an essential aspect of the design and the Unicode Consortium did what they had to do to fit all scripts and characters "in modern use" into 16 bits.
From the Wikipedia article on Han Unification[1]:
> Some of the controversy stems from the fact that the very decision of performing Han unification was made by the initial Unicode Consortium, which at the time was a consortium of North American companies and organizations (most of them in California), but included no East Asian government representatives. The initial design goal was to create a 16-bit standard, and Han unification was therefore a critical step for avoiding tens of thousands of character duplications.
[1] https://en.wikipedia.org/wiki/Han_unification
Yeah, I think that's fair. I didn't really think this through as I was writing it.
I'm not even so sure "ending up with nonsense" here is the worst outcome. It might be unavoidable with this approach and if that had been the only problem this bug might have been less memorable.
The real problem—which I mention didn't articulate/emphasize particularly well—was that these invalid surrogate pairs were getting passed into `encodeURIComponent` somewhere deep in the stack and choking catastrophically on them. That was the "real" bug at the end of the day, but the invalid surrogate pairs and the way they were getting created on the way were a fun journey to untangle.
- https://george.mand.is/invalid-surrogate-pairs/
I thought it was something that's easier to play with and feel than necessarily just read about.
Because invalid UTF-16 strings could show up in places within Windows, someone made a UTF-8 variant called "WTF-8", which allows unmatched surrogate pairs to survive a round trip.
I recently ported a program from python to rust and the original author used string regexes. Input and output document encoding mattered but the characters that needed to be matched were always lower ASCII. The python program could have used binary regexes, but instead forced an input encoding (UTF-8) and made the user choose an output encoding. When the input comes from an unknown process or legacy data, however, you don’t always get the luxury of assuming the encoding. Switching to binary regexes and ignoring encoding altogether simplified logic, eliminated classes of errors, and made the program work in scenarios it couldn’t earlier. Getting rid of the last decoding/encoding code gave me so much relief, especially when all of the whacky encoding tests I had already written continued to work.
If I'm remembering correctly, we briefly explored a solution where we told Python "This is a UTF-16LE encoded string" so the count would match, but I think we learned/realized the endianness is actually dictated by the client's machine (Going from memory here). Ultimately we just changed the solution so the client was the source of truth about lengths and counts.
These threads are surfacing all kinds of things I forgot about and didn't add in that blog post. Maybe I need to write another, haha.
emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.
21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding).
> at some point before Unicode
No, in the early days of Unicode.
> run length encodes
Um… what? RLE is a data compression thing, UTF-16 has nothing to do with it.
Although, conveniently this means that UTF-8 bytes 0xF8 through 0xFF are always nonsense so the third party Rust type `ColdString` uses leading bytes 0xF8 through 0xFF in its 8 bytes of representation to indicate "I am an inline UTF-8 string, but, the UTF-8 starts in the next byte with a total length of N bytes" where N = byte - 0xF8
This leaves the continuation marker bits alone so ColdString can use those in that front byte to indicate "I am not actually inline data, I'm a pointer, rotate me so these indicator bits are my LSB and zero out them out to make me a 4 byte aligned pointer".
Which leaves all other 8 bytes values for the valid UTF-8 strings, which all begin with either ASCII or a byte between 0xC2 and 0xF4 inclusive.
> 21-bit, actually
Less than that. https://en.wikipedia.org/wiki/Code_point#In_character_encodi...:
“The Unicode code space is divided into seventeen planes (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 2¹⁶) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112”
That makes it log(1,114,112)/log(2) bit. That’s about 20,09.
(https://www.unicode.org/versions/Unicode17.0.0/ assigns 159,801 of them to characters)
Was already bad enough that instead of bytes, we have to worry about code points. Now even that isn’t enough?
It would have been expensive, but all characters should have been fixed size 64bit values.
You're making the same mistake that numerous people made before you: thinking that it's as simple as using arrays of large enough numbers. First they thought that two bytes per symbol would be enough, then four. Spoiler alert: it wasn't. And eight won't work either.
It would have been a non-starter, and then we'd all be dealing with Shift-JIS, BIG5, and FSM knows how many different codepages to this day. UTF-8 is about as elegant as it gets, though Java and JS still managed to fuck that up too (they both encode every codepoint outside the BMP as surrogate pairs in UTF-8)
I can’t comment on Java, but JS I know reasonably well and I can’t think of any place it uses CESU-8.
Author went for Intl.Segmenter too: https://github.com/cheeaun/phanpy/issues/1491
[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.
Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding.
It was really `encodeURIComponent` that didn't handle it gracefully.
If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):
encodeURIComponent("\uD83E\uDD20")
If you give it an invalid surrogate pair, it will throw an actual error:
encodeURIComponent("\uDD20\uD83E")
Before I'd looked that up I was going to say: I feel like "don't allow an invalid Unicode string to exist all" feels like a separate/bigger problem to me from "handling it fine" when they do get created. To the extent I can hand JavaScript an invalid combination of code units in a variety of other scenarios, returning a � felt fine.
e.g. // valid String.fromCodePoint(0xd83e, 0xdd20) // invalid, but "�" is ... fine? String.fromCodePoint(0xdd20, 0xd83e)