DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
54% Positive
Analyzed from 17914 words in the discussion.
Trending Topics
#code#compiler#pointer#language#int#don#standard#undefined#behavior#write

Discussion (593 Comments)Read Original on HackerNews
Here's a way weirder example:
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!
> It barely scratches the surface.
I agree. The point of the post is not to enumerate and explain the implications of all 283 uses of the word "undefined" in the standard. Nor enumerate all the things that are undefined by omission.
The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.
And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.
The (one!) exploitable flaw found by Mythos in OpenBSD was an impressive endorsement of the OpenBSD developers, and yet as the post says, I pointed it at the simplest of their code and found a heap of UB.
Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.
FTA:
> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C and C++ code has UB.
I presume you're referring to this code:
The only signal handler find installs is for SIGINFO, and it uses the SA_RESTART flag, so EINTR can be ruled out. The pid argument is definitely valid as you can't reach the above if it wasn't, and there's no other way for the child process to be reaped[1], so no ECHILD.A check should probably be added in case the situation changes in the future, triggering spooky action at a distance, or were that code to be copy+pasted somewhere where the invariants didn't hold. But I think the current code in its current context is, strictly speaking, correct as-is.
[1] OpenBSD lacks the kernel features for such surprises that might theoretically be possible on Linux.
> And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.
And I 100% agree. UB is way overused by these standards for how dangerous it is, and as a consequence using C (and C++) for anything nontrivial amounts to navigating a minefield.
It's fair to blame the programmer for the choice of programming in a language like this, if it was in fact their choice. As you've so eloquently put, choosing those languages is essentially equivalent to choosing UB, so starting a new project with one of them is 100% blameworthy when the UB is inevitably found.
What are you talking about? UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen". It was "If you do this, that will happen".
The reason for the hack is that very early C compilers just always spill, so you can write MMIO driver code by setting a pointer to point at the MMIO hardware and it actually works because every time you change x the CPU instruction performs a memory write.
Once C compilers got some basic optimisations that obvious "clever" trick stops working because the compiler can see that we're just modifying x over, and over and over, and so it doesn't spill x from a register and the driver doesn't work properly. C's "volatile" keyword is a hack saying "OK compiler, forget that optimisation" which was presumably a few minutes work to implement, whereas the correct fix, providing MMIO intrinsics in the associated library, was a lot of work.
Why should you want intrinsics here? Intrinsics let you actually spell out what's possible and what isn't. On some targets we can actually do a 1-byte 2-byte and 4-byte write, those are distinct operations and the hardware knows, so e.g. maybe some device expects a 4-byte RGBA write and so if you emit four 1-byte writes that's very confusing and maybe it doesn't work, don't do that. On some targets bit-level writes are available, you can say OK, MMIO write to bit 4 of address 0x1234 and it will write a single bit. If you only have volatile there's no way to know what happens or what it means.
As a nit pick, I don't think this is correct use of "spill". Register spilling refers to when a compiler's code generator runs out of registers and needs to store variables in memory instead. In the MMIO case you are reading/writing via a pointer, so this is unrelated to registers and spilling behavior.
Volatile on a non pointer value is not for MMIO, though, that’s typically for concurrency like with interrupts.
The C and C++ languages would be very slow by modern standards if you insist that reading or writing via a pointer must result in immediate fetches or stores to memory.
> Volatile on a non pointer value is not for MMIO, though, that’s typically for concurrency like with interrupts.
You're holding it wrong. Perhaps you've been holding it wrong for so long and so confidently that you've distorted the world around you -- indeed on MSVC on x86 or x86-64 that actually happened -- but, you're still holding it wrong.
Source?
People used to use it for thread synchronization before proper memory barrier primitives (see https://mariadb.org/wp-content/uploads/2017/11/2017-11-Memor... ) were available. It was not entirely reliable for this purpose.
https://www.gnu.org/software/c-intro-and-ref/manual/html_nod...
You need to distinguish between a UB and a race, and I think that's something that discussions of UB miss. Take any C program and compile it. Then disassemble it. You end up with an Assembly program that doesn't have any UB, because Assembly doesn't have UB.
UB is a property of a source program, not the executable. It means that the spec for the language in which the source is written doesn't assign it any meaning. But the executable that's the result of compiling the program does have a meaning assigned to it by the machine's spec, as machine code doesn't have UB.
A race is a property of the behaviour of a program. So it's true to say that your C program has UB, but the executable won't actually have a race. Of course, a C compiler can compile a program with UB in any way it likes so it's possible it will introduce a race, but if it chooses to compile the program in a way that doesn't introduces another thread, then there won't be a race.
To be pedantic, old hardware like 6502 family chips (Commodore 64, Apple II, etc) had illegal instructions which were often used by programmers, but it was completely up to the chip to do whatever it wanted with those like with UB.
Intentionally, with an expected effect? I'd need a citation for that.
Well, sure, that's what volatile means - that the value may be changed by something else. If it's a global variable then the something else might be an interrupt or signal handler, not just another thread. If it's a pointer to something (i.e. read from a specific address) then that could be a hardware device register who's value is changing.
The concept of a volatile variable isn't the problem - any language that is going to support writing interrupt routines and memory mapped I/O needs to have some way of telling the compiler "don't optimize this out" since reading from the same hardware device register twice isn't like reading from the same memory location twice.
I think the problem here is more that not all of the interactions between language features and restrictions have been fully thought out. It's pretty stupid to be able to explicity tell the language "this value can change at any time", and for it to still consider certain uses of that value as UB since it can change at any time! There should have been a carve out in the "unsequenced side effect" definitions for volatile variables.
As noted, there’s almost 300 usages of the word undefined in the standard. Believing that it’s possible to correctly define all the carve outs necessary correctly and have the compiler implement the carve outs successfully is about as logical as believing UB is humanly avoidable in written code.
That said, your “common parlance” definition of “data race” is not the definition used by the C standard, so your last sentence is at best misleading in a discussion of standard C.
> The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
(Here “conflicting” and “happens before” are defined in the preceding text.)
However, this is not at all what UB means in C (or C++). The compiler is free to optimize away the entire block of code where this printf() sequence occurs, by the logic that it would be UB if the program were to ever reach it.
For example, the following program:
Can be optimized to always print "y is 8" by a perfectly standard compliant compiler.I don’t see how. I was trying to explain why it’s reasonable for a volatile read to be a side effect, after which the C rule on unsequenced side effects applies, yielding UB as you say.
> An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously.
A compliant compiler is only free to optimise away, where it can determine there are no side-effects. But volatile in 5.1.2.3 has:
> Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects.
Lots of people mistakenly think that C and C++ are "really flexible" because they let you do "what you want". The truth of the matter is that almost every fancy, powerful thing you think you can do is an absolute minefield of UB.
If you want to be standards correct, yes you have to know the standard well. True. And you can always slip, and learn another gotcha. Also true. But it's still extremely flexible.
Take signed integer overflow, for example. Making it UB might've made sense in the 1970s when PDP-1 owners would've started a fight over having to do an expensive check on every single addition. But it's 2026 now. Everyone settled on two's complement, and with speculative execution the check is basically free anyways. Leaving it UB serves no practical purpose, other than letting the compiler developer skip having to add a check for obscure weird legacy architectures. Literally all it does is serve as a footgun allowing over-eager optimizations to blow up your program.
Although often a source of bugs, C's low-level memory management is indeed a great source of flexibility with lots of useful applications. It's all the other weird little UB things which are the problem. As the article title already states: writing C means you are constantly making use of UB without even realizing it - and that's a problem.
Maybe this already exists, even? A stripped down version of C? A more advanced LLVM IR? I feel like this is a problem that could use a resolution, just maybe not with enough of a scale for anyone to bother, vs. learning C, assembly of given architecture, or one of the new and fancy compiled languages.
[0] https://github.com/project-everest/vale
[1] https://nickbenton.name/coqasm.pdf
Rust is a somewhat more thorough attempt to actually course-correct.
>unsequenced side effects on the same scalar object are UB
>6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.
Read 5.1.2.4.3:
"If A is not sequenced before or after B, then A and B are unsequenced."
"Evaluations A and B are indeterminately sequenced when A is sequenced either before or after B, but it is unspecified which."
With a footnote saying this:
"9)The executions of unsequenced evaluations can interleave. Indeterminately sequenced evaluations cannot interleave, but can be executed in any order."
I.e the standard makes a distinction between "unsequenced" and "indeterminately sequenced". And with no mention of side effects on "indeterminately sequenced" being UB it leads me to conclude that your example is not UB.
Well, yes; but when the C standard authors wrote like this, they surely had in mind "the reads could be in either order, therefore the output could display the polled values in either order". Not C++ nasal demons.
And yeah, being able to say "reading is a side effect" is important when for example you interact with certain memory-mapped devices.
Edit: thread=thread of execution. I’m not making a point about thread safety within a program.
I'm also not convinced (yet) that the example really is UB: I agree reading a volatile is "a side effect" in some sense, and GP cited a paragraph that says just that. But GP doesn't clearly quote that it's a side effect on the object (or how a side effect on an object is defined). Reading an object doesn't mutate it after all.
But whatever language lawyer things, the code is obviously broken, with an obvious fix, so I'm not so interested in what its semantics should be. Here is the fix:
C could've specified something like "arguments are evaluated left-to-right" or "if two arguments have the same expression, the expression is [only evaluated once]/[always evaluated twice]". But it didn't, so the developer is left gingerly navigating a minefield every time they use volatile.
If you are using volatile you are reading from a device port mapped to that address.
Since C doesn't mandate in which order function arguments are evaluated, you don't know which argument will be read from port first.
How can that be anything but UB?
See my comment here - https://news.ycombinator.com/item?id=48205760
The lack of argument sequencing feels utterly petty however.
It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.
> an unaligned pointer in itself is UB
Yup. Per the "Actually, it was UB even before that" section in the post.
> UB is not on the HW, it has nothing to do with crashes or faults
Yeah. I tried to convey this too, but I'm also addressing the people who say "but it's demonstrably fine", by giving examples. Because it's not.
It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.
Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.
I'd clarify this with "They understand that all values are just bytes".
> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.
It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined".
A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB. It's not that there isn't any progress (Signed twos-complement is coming, after all), it's that there is (I believe) much pushback from compiler authors (who dominate the standards) who don't want to make UB into IB.
Not if those 4 bytes span a cacheline boundary, that will most likely result in 1/2 throughput compared to loading values inside a single cacheline. And if it causes cache-misses it takes up twice the L2 or L3 bandwidth.
Even worse, if the int spans two pages, it will need two TLB lookups. If it's a hot variable and the only thing you use from those pages, it even uses up an additional TLB entry, that could otherwise be used for better perf elsewhere, etc.
And if you're on embedded (and many C programs are), Cortex-M CPUs either can't handle unaligned accesses (M0, M0+) or take 2-3 times as long (split the load into 2x2 byte or 1x2 + 2x1 byte)
And crucially until DR#260 https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm this was a reasonable guess as to what the pointers are. Probably not a wise guess because it's not how your C compiler worked even then, but a reasonable guess if you didn't think too hard about this.
One way I like to think about this is that all C's types are just the machine integers wearing crap Halloween costumes. Groucho glasses for bool, maybe a Lincoln hat for char, float and double can be bright orange make-up and a long tie. But the pointers are different, because unlike the other types those have provenance.
5 == 5, 'Z' == 'Z', true == true, 1.5f == 1.5f, but whether two pointers are equivalent does not depend solely on their bit pattern in C.
Just creating the pointer, though, should not be UB, even though it apparently is. It should not even be IB.
PCs yes, but there are many other things C is compiled to for which this is not true.
Can someone point to where the standard states this?
This type of UB is fine and nobody really complains about hardware differences leading to bugs.
However, over time aggressive readings of UB evolved C into an implicit "Design by Contract" language where the constraints have become invisible. This creates a similar problem to RAII, where the implicit destructor calls are invisible.
When you dereference a pointer in C, the compiler adds an implicit non-nullable constraint to the function signature. When you pass in a possibly nullable pointer into the function, rather than seeing an error that there is no check or assertion, the compiler silently propagates the non-nullable constraint onto the pointer. When the compiler has proven the constraints to be invalid, it marks the function as unreachable. Calls to unreachable functions make the calling function unreachable as well.
You're conflating undefined behavior with implementation-defined behavior. If it was only to do with what we think of as normal variance between processors, then it would be easy to make it implementation-defined behavior instead.
The differentiating factor of undefined behavior is that there are no constraints on program behavior at that point, and it was introduced to handle cases where processor or compiler behavior cannot be meaningfully constrained. One key class is of course hardware traps: in the presence of compiler optimizations, it is effectively impossible to make any guarantees about program state at the time of a trap (Java tried, and most people agreed they failed); but even without optimizations, there are processors that cannot deliver a trap at a precise point of execution and thus will continue to execute instructions after a trapping instruction.
It's not only C-level is it. There's no (guarantee across architectures for) machine code for that either.
You can, and the results are machine specific, clearly defined and well-documented. Ancient ARM raises an exception, modern ARM and x86 can do it with a performance penalty. It's only the C or C++ layer that is allowed to translate the code into arbitrary garbage, not the CPU.
But then I would assume you are aware of unaligned pointers, and have a sane way to parse that data, rather than read individual parts of it from a raw pointer.
I am curious, what would be a legitimate reason for an unaligned pointer to int?
-Denial: "I know what signed overflow does on my machine."
-Anger: "This compiler is trash! why doesn't it just do what I say!?"
-Bargaining: "I'm submitting this proposal to wg14 to fix C..."
-Depression: "Can you rely on C code for anything?"
-Acceptance: "Just dont write UB."
Unaligned access? Packed structs. Compiler will magically generate the correct code, as if it had always known how to do it right all along! Because it has, in fact, always known how to do it right. It just didn't.
Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so. Alternatively, just disable it straight up: -fno-strict-aliasing. Enjoy reinterpreting memory as you see fit. You might hit some sharp edges here and there but they sure as hell aren't gonna be coming from the compiler.
Overflow? Just make it defined: -fwrapv. Replace +, -, * with __builtin_*_overflow while you're at it, and you even get explicit error checking for free. Nice functional interface. Generates efficient code too.
The "acceptance" stage is really "nobody sane actually cares about the C standard". The standard is garbage, only the compilers matter. And it turns out that compilers have plenty of extremely useful functions that let you side step most if not all of this. People just don't use this because they want to write "portable" "standard" C. The real acceptance is to break out of that mindset.
Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.
A better way to think about UB is as a contract between developer and implementation, so that the implementations can more easily reason about the code. How would you optimize:
(x * 2) / 2
An optimizer can optimize this out for a signed integer, because it doesn't have to consider overflow, but with a unsigned integer it can not. UB is a big reason why C is the most power efficient high level language.
I'd do the math myself and just write x.
I don't even use * for multiplication anymore, I use __builtin_mul_overflow and then check the result. Anyone who doesn't is gonna hit the overflow case one day, and they'll be lucky if their program isn't exploited because of it. I've been making an effort to use all the overflow checking builtins by default in most if not all cases. I've also been making Claude audit every single bare arithmetic operation in my projects. He's caught quite a few security issues already, and overflow checking dealt with them all.
This particular contract between developer and implementation is totally worthless and doing more harm than good. It encompasses regular everyday normal things like multiplication and addition. All things that our brains literally rely on in order to reason about the code. Can't even add numbers without the compiler screwing it up.
Programmers need to deal with overflow at all times. Can't calculate an offset without dealing with overflow. Can't calculate a size without dealing with overflow. It's simply everywhere in systems programming, which is what C was designed to do. The consequence of ignoring this is usually that your program gets mercilessly exploited.
All this for some efficiency gains. The cost/benefit analysis is way off here. Things should be correct, first and foremost. Then the compiler should give us the necessary sharp tools to make it fast, if needed. It shouldn't be making it fast at the cost of turning the entire language into a memetic vulnerability machine.
Packed structs are dangerous. You can do unaligned accesses through a packed type, but once you take the address of your misaligned int field, then you are back into UB territory. Very annoying in C++ when you try to pass the a misaligned field through what happens to be generic code that takes a const reference, as it will trigger a compiler warning. Unary operator+ is your friend.
Gotta work with the structure directly by taking the address of the packed structure itself.
Taking the address of the field inside the structure essentially casts away the alignment information that was explicitly added to stop the compiler from screwing things up. So it should not be done.Mercifully, both gcc and clang emit address-of-packed-member warnings if it's done. So the packed structures are effectively turning silently broken nonsense code into sensible warnings. Major win.
It can be left as implementation defined, which means that the compiler can't simply do arbitrary things, it needs to document what it would do.
Take, for example, signed-integer overflow: currently a compiler can simply refuse to emit the code in one spot while emitting it in another spot in the same compilation unit! Making it IB means that the compiler vendor will be forced to define what happens when a signed-integer overflows, rather than just saying, as they do now, "you cannot do that, and if you do we can ignore it, correct it, replace it or simply travel back in time and corrupt your program".
> Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.
Same here; I built a few non-trivial things that passed the first attempt at tooling (valgrind, UBsan with tests, fuzzing, etc) with no UB issues found.
So we have the next best thing: builtins and flags. So long as those cover all the undefined behavior there is, we can live with it. Compiler gets to be "conformant" and we get to do useful things without the compiler folding the code into itself and inside out.
It does say so, actually, since C99 TC3 (DR 283).
Something that bothers me is the Venn diagram of people that think abstraction is slow and error prone and people that only write portable C.
How many C implementations do you actually need to compile against? I don't think I've seen more than 3 outside Unix software from the 90s. Using non portable extensions is in fact totally doable for your application and you should probably do it, and just duplicate/triplicate code where you have to. It's not that hard to write and not hard to read.
Nowadays most are indeed clang and GCC forks, or MSVC.
> -Acceptance: "Just dont write UB."
The point of my article is that this is not possible. This cannot be our end state, as long as humans are the ones writing the code. No human can avoid writing UB in C/C++.
Ok, let's try it. I pointed GPT 5.5 at the smallest part of cosmopolitan as I could find in two seconds, net/finger. 299 lines.
describesyn.c:66: q + 13 constructs a pointer that can point well beyond the array plus one element.
C23 6.5.6p9:
> If the pointer operand and the result do not point to elements of the same array object or one past the last element of the array object, the behavior is undefined
Now… you may be trolling, but I do feel like this disproves your assertion. Not you, not me, not Theo de Raadt, can avoid UB.
> the compiler generating code that checks for pointer overflow.
Do you need to check for that specifically? What pointer are you constructing that is not either pointing at a valid object correctly aligned (not UB), or exactly one past the element of an array?
Do you mean for the latter, in case you have an array that ends on the maximum expressible pointer address?
I'm a bit unclear on what you mean by "pointer overflow". From mentioning 56 bit address spaces I'm guessing you mean like the pointer wrapped, not what I pointed to in cosmopolitan, above?
Ok, to be clear that it's not just that one type, if you forgive that one:
net/http/base32.c:64: read sc[0] even if sl=0. I assume this is never called with sl=0, so could be fine.
net/http/ssh.c:355: pointer address underflow? Should that be `e - lp`?
net/http/ssh.c:209/229: double destroy of key. can this code path have non-null members, meaning double free? Looks like it, since line 207 does the parsing and checks that parse worked.
net/http/ssh.c:123: uses memset, which assumes that it sets member variable pointers to NULL (per my post, depending on that means depending on UB), and later these pointers are given to free(), so that's UB.
I won't look deeper into net/http, but presenting just the possibly incorrect remaining comments from jippity:
Does this depend on the project, or part of a project? I'm wondering how far that scales, I don't know labor intensive it is -- maybe you can just look at the output and see that nothing funny is happening?
Just switch to a saner language.
And before I get attacked for being a Rust shill, I meant Java :P
The bar is so low it's floating near the center of the Earth.
If all you want is C but less insane then the obvious answer here is Zig.
If all somebody want is a turn key replacement to C/C++ ecosystem, then there is nothing like that in the world that I’m aware of.
And where's the fun in that?
Unedefined behaviour means "we couldn’t settle on a best default trade-off with fine-tuning as a given option so we let everyone in the unknown".
Because the last time I looked it appeared to need some godawful slow bytecode interpreter that took up thousands of kilobytes of RAM.
https://www.graalvm.org/latest/reference-manual/native-image...
In the past you could use e.g. Excelsior JET.
Did you looked at java 1.2 at 1998 last time? Because after that there is compiler which produce some very efficient profile-guide-optimized code and do tricks like de-virtualization which is not possible with static compiler with support of multiple compilation units (like C++).
Really, there was time in history when HotSpot-compiled JVM bytecode was faster than everything that gcc could produce for comparable tasks. Yes, now this gap is reversed again, as both gcc and clang become much more clever, but still gap is not very wide now.
Or you just not skip the introductory pages, that tell you what the language philosophy of C is, and why there is UB. Yes, UB can be a struggle, but the first four steps are entirely unnecessary. It means that you do not actually understand the core concepts of the very same language you are using, which is kinda stupid.
When that started happened people became alarmed (oMG UB iS TeH BAD!) and since some old UB machines still had industry support (of organisations that actually participated in ISO meetings instead of arguing online) there was never any movement on defining de-facto usage as de-jure and the alarmist position became the default.
Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.
I guess I am too young, and also too much a purist, because I start from the impression of what the language is, not what the implementations happen to do.
> Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.
-O0
http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html
"STORAGE_ERROR This exception is raised in any of the following situations: (...) or during the execution of a subprogram call, if storage is not sufficient."
It's not safe though because throwing an exception, panicking, etc, is still a denial of service. It's just more deterministic than silently overwriting the heap instead. If the program is critical then you need to be able to statically prove the full size of the stack, which you can do with C and C++ with the right tools and restrictions.
The Ada language specification says the Ada programmer can expect any Ada compiler when used in fully compliant mode to properly raise STORAGE_ERROR when a stack overflow occurs.
Only the Ada compiler writer has to deal with this, not every single programmer on every single program and platform (the UB behaviour of some languages).
In the case of GCC/GNAT the compiler manual provides insight on how to be in compliant mode per target regarding stack overflow, what are the limitations if any. You have tools to monitor and analyze you Ada code in this respect too.
If a segfault leads to some other state you do not deem "safe", such as a single program gating access to a valuable asset with a default fail state of "allow", you just have a fundamental design flaw in your system. The safety problem is you or your AI agent, not the segfault.
First, you can define what happens when stack space is exceeded. Second not all programs need an arbitrary amount of stack space, some only need a constant amount that can be calculated ahead of time. (And some languages don't use a stack at all in their implementations.)
Your language could also offer tools to probe how much stack space you have left, and make guarantees based on that. Or they could let you install some handlers for what to do when you run out of stack space.
How to think of this properly is that when you have UB, you are no longer under the auspices of a language standard. Things may work fine for a time, indefinitely even. But what happens instead is you unknowingly become subject to whimsies of your toolchain (swap/upgrade compilers), architecture, or runtime (libc version differences).
You end up building a foundation on quicksand. That's the danger of UB.
Tbh, already the first example (unaligned pointer access) is bogus and the C standard should be fixed (in the end the list of UB in the C standard is entirely "made up" and should be adapted to modern hardware, a lot of UB was important 30 years ago to allow optimizations on ancient CPUs, but a lot of those hardware restrictions are long gone).
In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not. On most modern CPUs unaligned load/stores are no problem at all (not even a performance penalty unless you straddle a cache line). There's no point in restricting the entire C standard because of the behaviour of a few esoteric CPUs that are stuck in the past.
PS: we also need to stop with the "what if there is a CPU that..." discussions. The C standard should follow the current hardware, and not care about 40 year old CPUs or theoretical future CPU architectures. If esoteric CPUs need to be supported, compilers can do that with non-standard extensions.
For most C software on x86_64, UB is "fine" with very strong bunny ears. But it is preferable for one to, shall we say, write UB intentionally rather than accidentally and unknowingly. Having an awareness of all the minefields lends for more respect for the dangers of C code, it makes one question literally everything, and that would hopefully result in more correct code, more often.
On that note, on some RISC-V cores unaligned access can turn a single load into hundreds of instructions.
I think the problem is just that C is under specified for what we expect a language to provide in the modern age. It is still a great language, but the edges are sharp.
However I do agree that just saying "the behaviour is undefined" is an unhelpful cop-out. They could easily say something like "non-atomic misaligned accesses either succeed or trap" or something like that.
> In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not.
Not just the CPU - memory decides as well. MMIO devices often don't support misaligned accesses.
An honest discussion would be something more like 'dereferencing pointers can lead to UB on invalid pointers. Here are N examples of that. Maybe avoid using pointers. Maybe consider how other languages avoid pointers. Maybe these shouldn't be UB and instead some other class of error.' And then even more honest discussion would present the upsides of having pointers and the upsides of having these errors be UB.
Instead, the article (and your comment) take this valid operation and presents it as invalid. Imagine you're a new programmer, you are just starting to wrap your head around pointers and you stumble across this article. You see the first example and it looks exactly what you would expect a dereference to look like. But the article claims it's wrong, and now you're confused. So you dig into the article more closely and are exposed to all these terms like UB, alignment, type coercion etc and come away more confused and scared and disinclined to understand pointers. This is classic FUD. This is a technique to manipulate, not educate.
Pointers have pros and cons. UB has pros and cons. Let's try to educate people about them.
UB simply means the operation you are intending to perform has no defined semantic under the ISO C specification. That is all. Understand what this means but do not read further into it. It is easy to read further into this as you have and many do, and come to incorrect conclusions, and think this MUST result in incorrect behaviour, but this is not the claim. The claim is rather than once you write UB, you are no longer writing C the language with a defined spec, and that any manner of degrees of freedom (architecture, toolchain, etc) can now cause your code that was once behaving correctly to now behave incorrectly. That is the danger.
> That is a valid operation. Now if that pointer isn't valid (and being unaligned is one of many reasons it could be invalid) then calling the function with that invalid pointer will be UB.
This is incorrect. The moment you express this in source code, it is already UB wrt to the C abstract machine.
6.3.2.3. 755 If the resulting pointer is not correctly aligned for the pointed-to type, the behavior is undefined.
https://c0x.shape-of-code.com/6.3.2.3.html
The important distinction is to KNOW this is still UB; whether the operation yields the expected behaviour on your platform and architecture is completely a separate question.
The reason this is of utmost important is because the C compiler operates on the C abstract machine.
If you violate language invariants, the compiler can--keyword can--emit WRONG code and it will be CORRECT to do so because C unfortunately allows it to. When this happens it's silent and deadly and it's a pain to debug. The point of all this seeming language lawyering is not FUD, it is genuine frustration with these footguns of the language that we are trying to share with others. Understanding UB correctly really is what separates those that know C and those that "know" C.
Things will work and then they won't. This can be fine for most cases but not fine for others. If you use C in 2026 you need to understand this.
> come away more confused and scared
This is the correct take. One aught to be more confused and scared after learning about UB; the language simply leaves things under-specified and it is up to the developer to understand they are engaging in UB.
Once UB is acknowledged, one aught to impress upon themselves the software they build is dependent ever more on the whims of their particular compiler (clang/gcc), compiler flags (optimizations), architecture, and runtime environment.
https://lists.isocpp.org/std-proposals/2020/05/1322.php
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p28...
In worse scenarios, your programme will silently continue with garbage, or format your hard disk or give attackers the key to the kingdom.
Another commenter suggested using LLMs, but I disagree. Having clangd emit warning squiggles for unchecked operations (like signed addition) would be a good start.
Dead code elimination is essential for performance, especially when using templates (this is basically what enables the fabled "zero cost abstraction" because complex template code may generate a lot of 'inactive' code which needs to be removed by the optimizer).
The actual issue is that the compiler is free to eliminate code paths after UB, but that's also not trivial to fix (and some optimizations are actually enabled by manually injecting UB (like `__builtin_unreachable()` which can make a measurable difference in the right places).
Not, that the compiler can also emit code paths before UB, as UB is a property of the whole program, not just of a single statement.
before.
It's reasonable that such people would also be interested in design aspects of languages, and UB in C is in that field. Though I would argue that a lot of it was originally accommodating old CPU architectures without compromising performance too badly, and about as much a "design choice" as wheels being round...
There were a few high-profile "scandals" around GCC 3.2 (IIRC) because the compiler finally started much more aggressively using UB in optimizations, which was a reason that lots of people stayed on GCC 2.95 for a very long time. GCC 3.2 came out in 2002.
I have lots of my code running day-in, day-out on literally hundreds of millions of machines. The approach to "getting it working" is exactly OP's.
I'll admit to being pretty defensive and anal in checking values and return-codes (more so than most, I suspect), and I'm a firm believer in KISS principles in software engineering ("solving hard problems with complicated code is easy, solving them with simple, understandable algorithms is the hard bit") but generally there's no real difference in approach to the code I write to work on my workstation, and the code I write to work in the field.
Every company keep harping on about safety and being exposed (being in the news): so the narrative against 'unsafe' is up the wazoo.
The new world is basically a bunch of city dwellers who haven't seen raw nature and you show them a lawn mower, they freak out. Blades that spin?!?!?! Madness!!
Can't talk about C without CVE.
If you think C is the problem, you'll come to the eventual conclusion that humans are the problems, and greed. Don't hate the player, hate the game etc.
C was invented so you don't have to write assembly. It wasn't invented to expose devices to billions of other devices.
The real answer is that proponents of languages like C seem to completely disregard the dangers/difficulty of hitting/difficulty of fixing UB. Proponents of languages like Rust overstate it instead. Pointless wars/drama is fun to read and gets clicks.
Understanding three important concepts properly in C allows one to easily identify what can/cannot result in UB viz. 1) Expressions 2) Statements 3) Sequence Points and "Single Update Rule". It is not that hard at all.
I wrote about it here with links to further reading provided - https://news.ycombinator.com/item?id=48144734
Although I haven’t noticed a spike the last 6 months, just a slowly increasing realization that C isn’t fit for humans and should go the way of asbest: Don’t use it for anything new, and remove it where it already exists, unless doing so would be too expensive or disruptive.
Personally I like C because you should have a good idea of what it's going to do. Other languages feel like a black box, and I start having to fight them far too often. But I say that as a hacker of low level stuff, not as someone who's paid and working on higher level stuff, so that is probably a niche view.
2. You don't really appreciate the issue. Signed integer overflow is undefined. If you check for that overflow after the fact the compiler can, and demonstrably has pretended that the overflow can't happen and optimised away your overflow check.
You may not even come across that failure mode to know to 'fix' it. And good luck finding the issue unless you know about UB and what the compiler can and will do in such situations.
After switching to Rust five years ago I agree with all the Rust hipsters as far as disliking those languages go.
I just don't talk about it a lot. If every Rust person I know that was a C/C++ developer before was as outspoken about what they think of the latter, you'd see that these people are a majority.
We're just old hands who like to use stuff that works. And most of us don't get attached to code or languages.
It's also difficult to admint to yourself that you were never in command of a language as far as UB/other footguns go, as much as you thought. Or ever, for your enire career. For me that self-realization about C/C++ (enabled by Rust) was a turning point.
Lately you can read about the dichotomy re. AI use.
I.e. developers who define them themselves through what they build/ideas are embracing LLMs; for what they can do.
I.e.: I am what I build.
Whereas developers for whom software engineering is a craft that defines them hate them openly.
I.e.: I am how I build.
Now this seems to suggest to me that maybe Rust developers who openly hate C/C++ squarely belong to the latter group whereas the silent ones belong to the former. It's builders vs programmers. Just different world views.
Also you can not dislike something and still not speak about it. Because you decided to not care.
Just like TypeScript can't get rid of JavaScript WATs.
I trust your historical C usage was more productive than that..
It’s not. All that matters is what C compilers actually do and what real C programs expect.
This is a good thing. It creates a culture where the two sides meet each other where they’re at
The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.
Most C++ today will be immediately obvious and not accidentally mixed up with C.
So I see your counter points are all "so just don't do that, then".
And the point of my post is that this particular "just don't do that, then" has never been achieved by humans.
If if there's no example of a program without these bugs in a language, then I do think it's fair to blame the language. A knife with 16 blades and no handle.
> Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is.
Like the post says, it's rare that programmers actually want a pointer to memory address zero. But in my experience most programmers who even encounter that have this "complete lack of understanding", as you put it.
As I stated:
> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C/C++ code has UB.
It's about that point, not about how to avoid it. Because you can't.
Signed vs unsigned chars, and the accompanying extension rules, have already bitten me switching between x86/ARM compilers. Confused the hell out of me when I was just starting out with C.
If you're going to interpret C as in "C on amd64, running on Linux 7.0 on an Arrow Lake Intel processor" then yes, you can get away with a lot of UB. That mitigates the problem but doesn't make it go away.
That word is carrying a lot of weight here. Compilers are unbelievably complex these days, and it's impossible for any one human to fully understand the entire compilation process, including the effects of any arbitrary combination of compiler flags.
Any assumptions you have about what the compiler does in the face of UB will collapse on the next patch release of that compiler, or the moment somebody changes the compiler flags, or the moment somebody tries to compile the code for a slightly different OS, not to mention architecture.
There is no other way to understand what C compilers do than reading the standard.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1
But don't misunderstand the goal of that: C and C++ will never get rid of UB. The result of dereferencing an invalid pointer is UB, will always remain UB, and really cannot be anything other than UB.
And it also signals that you actually do want to improve, just a little bit of boy scout rule goes a long way.
> The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
Yup. But the point of the article is that even expert humans cannot do this alone. And as I wrote, LLM+junior won't suffice either. We need LLM+senior experts.
And it's a problem that we have way more existing UB than expert capacity.
Now, will LLMs and experts both miss UB in some cases? Of course. There's no 100% solution. But LLMs, I claim, will find orders of magnitude more, with low false positive, than any expert. Even if these expert humans (like in the OpenBSD case for the two bugs I found, one of which was UB) are given more than three decades to do it.
I didn't even use the best model, complex code target, or time. I just wanted to choose a target that has a high chance of having very good experts already having audited it.
Yes.
Even in languages other than C (i.e. you will get behaviour that nothing in the input specified).
When LLMs generate code, all languages have UB.
UB means literally no restrictions. So if you standard says 'you have to crash with an error message' that's already no longer UB.
Sure. For crashes. But when you instruct an LLM to do something, the output is probablistic, so you may get behviour that is unexpected and/or unwanted.
Like storing security tokens in code. Or nuking the production database.
Part of the reason for all the UB in OpenBSD is that UBSan doesn't run on that platform. When I ported OpenBSD's httpd to Linux, I found that UBSan tripped before the server even came up because the config flag parsing shifts into the MSB of a signed integer.
I tried to contribute back a patch (just make the flag bitfield unsigned), but it was ignored. I think if UBSan ran natively on OpenBSD, then there would be a lot more of these patches, and the maintainers would have to take an official stance on whether they think these bugs matter.
"Explicit casts only" worked fine in Modula-2, which doesn't have as many scalar types.
Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.
Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.
The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.
edit: I'm not sure it's even undefined in C.
The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.
> The part about hardware is wrong BTW
Could you be more specific? I think by "wrong" you may mean "not actually relevant to UB", and you're right about that. If that's what you mean then that part is not for you. It's for the "but it's demonstrably fine" crowd.
> the hardware semantics are clearly defined
Yup. The article means to dive from the C abstract machine to illustrate how your defined intentions (in your head), written as UB C, get translated into defined hardware behavior that you did not intend.
I'm not saying the CPU has UB, and I wonder what part made you think I did.
That's what I mean game of telephone. The UB parts get interpreted as real instructions by the hardware, and it will definitely do those things. But what are those things? It's not the things you intended, and any "common sense" reading of the C code is irrelevant, because the C representation of your intentions were UB.
> If the function is defined with a type that is not compatible with the type (of the expression) pointed to by the expression that denotes the called function, the behavior is undefined.
Compatible types requires integrating texts from several different paragraphs, but the general notion is "identical type, in a frontend sense", not "same ABI." This means that "const void " and "void " are not compatible types, much less "void " and "struct foo ".
- casting an object pointer to or from void*
- casting an object pointer to or from char*
You're not doing either of those things. A function pointer is not an object pointer (the standard does not guarantee that the two kinds of pointer even have the same size/representation, and in fact on some esoteric hardware they don't), and even if it were, you aren't casting to or from void* or char*. So it's UB for two separate reasons.
You can cast between pointer types freely so long as they can be representable in one another (some casts are undefined because the address would be unaligned in the target pointer type, and there's actually no guarantee that pointers to objects and pointers to functions have the same representation).
Strict aliasing rules don't kick in at pointer type casting, but rather kick in at lvalue access--when you dereference a pointer, in other words--and you've also given the list of strict aliasing rules completely incorrectly.
I guess enumerating all the possibility is just .. don't look right? make the standard too long and complex?
(1) you can cast between any pointer types (no UB - assuming they're aligned), but accessing memory through a wrongly-typed pointer is UB
(2) the only exception is char*, which allows you a "byte view of memory"
(3) calling a function through a pointer requires the parameter pointer types to be compatible, and none of these are: int*, struct foo *, void*, char*
I mean, you have to go out of your way and use a cast to get the UB in the first example.
For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.
For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.
> For all you know the compiler has no internal way to even express your intention here.
I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?
> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.
I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.
I think only the final one is of note (the 24-bit shift assigned to a uint64_t).
Probably confusion with C++ where NULL is 0 which is a special case that can be implicitly cast to both integers and pointers, unlike non-zero constants. C doesn't need this because it doesn't require explicit casts from void pointers to others.
First let me state the case for C. It’s meant to be used as a systems language that’s as close to assembly as possible while remaining portable (compared to assembly). As such it’s the first high-level language developed for any new processor.
Given the above predicate: Isn’t everything described in the article as it should be?
Add too much to the language and it becomes less possible to implement on new architectures, right? Because the undefined behavior lets implementors stand up new compilers fairly quickly.
For less undefined behavior isn’t it better to use languages that have that in their DNA? D, Zig, Go, Java, etc?
I think the real trick question is "as it should be for whom?".
Reading the comments I think people underestimate the complex interaction between:
- engineers that design hardware (they don't care much about the compiler, except when it has to fix their mistakes)
- engineers that do the compiler (they have to struggle with all quirks of the new architecture and all of the complaints of the users)
- users of the new system (hardware + compiler) that just want to take their 100k lines of code (libraries) and just use it on the new system with better performance (as that's what the hardware people promissed!)
- users working on one architecture all their lives
For the compiler people, yes, probably most what is described is as it should be. For the users (that care about performance and not making porting efforts), probably no.
Now, even when I was doing compiler work we had a hard time explaining our users why we couldn't do some things they wanted (while also improving performance and not changing code that was writting), so explaining that on the internet seems to me a lost battle.
I am sure there are things that can be improved, and standards evolve. But the problem is very complex given the sheer amount of code written and the strange architectures out there.
Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.
People complain about C as though they know how to fix it.
It's slower than Fortran and, depending on the platform, cobol. It's a bigger minefield than any language that came after it barring C++.
The only real advantage I can ascribe to C is that it's actually still being used after all these years, and it mostly works similarly on most hardware, like a Java for people who enjoy the casino.
Fixing C without breaking existing C code is pretty much impossible. You can start by defining warnings for UB, but then you will break any of the more trivial examples in the article. You can also start by simply killing off weird platforms (force a specific amount of bits for instance, screw the weird 16 bit char chips). Making casts explicit would probably fix a lot of problems too, though you'd need better syntax for that.
There is no fixing C without changing what C really is.
Brainfuck is "simple" by any other definition as well, but that's not a useful quality.
That doesn't mean the C is a safer language than Swift, or a less-capable language than Swift. But in terms of "easy to understand along the happy-path", it's a lot easier to get going in C.
Swift, for example, bakes a whole load of CS-degree-level ideas and concepts into the basic language with its optionals, unwrapping, type-inference, async/await, existential types, ... ... ... . C doesn't do any of that. There are (many!) more footguns in C, but the language is less complex as a result.
Brainfuck is not at all simple, from that point of view. This is a valid Brainfuck program:
>+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.[-]>++++++++[<++++>-]<. >+++++++++++[<+++++>-]<.>++++++++[<+++>-]<.+++.------.--------.[-]>++++++++[<++++ >-]<+.[-]++++++++++.
This is the equivalent C program
#include <stdio.h> int main() { printf("Hello world!\n"); }
One of these is far simpler than the other.
[edit: changed to make the examples do the same thing]
The brainfuck example is "simpler": Only 8 kinds of tokens! Not really useful, though.
The cognitive load of _actually delivering software_ written in C is immensely greater than doing so with Swift, or Rust, or Python, or Java, even Zig, despite all of those leveraging much heavier machinery in order to deliver a friendlier abstract model for you to program against.
The tragedy of C is that, in addition only delivering very baseline abstraction tools, it also adds its own set of seemingly arbitrary rules and requirements that come from nowhere but the C standard. Fictitious limitations to suit a bygone era. The abstract model of C is fine in some places, but definitely not fine in other places, and my hypothesis is that most UB in practice comes from a mismatch between programmer intuitions and C's idiosyncracies.
I'm not an expert in either language but my anecdotal experience disagrees with this - writing Zig has been far simpler and less error-prone than writing C.
> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).
Is "nontrivial" defined
How would one identify "nontrivial" C code
Is there an objective measure (defined)
Or is it a matter of personal opinion that could vary from person to person (undefined)
Good open source ones:
Frama-C
IKOS (from NASA)
Doesn't catch all of it.
When comparing signed and unsigned integers of same size the signed one will be converted to unsigned. In a reasonably configured project compiler will warn about it.
In case of integers smaller than int, promotion to int happens first.
In case of signed and unsigned integers of different size, the smaller one will be converted to bigger one.
By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.
Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?
I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5, just that it would be good to print a diagnostic if it can be detected, and if not to do what's "characteristic of the environment". Does that make sense?
Bounding UB would be a nice idea, or at least prohibiting time-traveling UB (and there is an effort in that direction). But properly specifing it is actually hard.
Reading adversarially is what people do who are looking for ways that something can be abused, from an offensive or defensive position.
Personally I am tired of the entire topic.
I noticed that. Those are 100% consistent & implied by the parts of the standard I quoted that you are ignoring, though.
What you're doing is:
- Arguing is that those phrases describe the totality of the implications, rather than mere examples, without providing anything to base this method of argumentation on.
- Completely ignoring the other phrases I quoted, which (taken at face value) contradict your reading.
- Claiming that anyone who disagrees is being insincere(?) and reading the standard uncharitably.
- Not even attempting to support this line of reasoning through other arguments.
So you're not only asking people to read contradictions into the standard, but also insinuating that people who don't are not arguing in good faith. That... honestly isn't a winning strategy.
Note that I'm not even saying your conclusion regarding their intent is necessarily wrong. I'm just saying your argument is bad. And that there is a difference between what the rules are and what some people believe their authors intended them to be.
If I wanted to argue your position, I would look for other parts of the standard where they do what you're claiming. That is, where the literal meaning of the wording would be crazy, and which would clearly contract what everyone believes the authors of the standard intended it to mean. Then you would at least have some basis for extrapolating that line of reasoning to this paragraph. At that point you might at least get an acknowledgment from the other side that the standard is unclear and/or has a defect, even if they didn't agree with your take on what requirements it imposes as-written.
> I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5,
I'm not sure if you're exaggerating ("totally"?), being sloppy, or misunderstanding, or if you actually mean this literally, but I already don't believe it does that, and I have never seen any compiler interpret it that way either. Sorry, but you're going to have to be more precise and pedantic here so you actually have something realistic to argue against. Right now it looks like you have an impression of UB that doesn't match reality.
I touched on this in the "it's not about optimizations" section. It's not the compiler is out to get you. It's that you told it to do something it cannot express.
It's like if you slipped in a word in French, and not being programmed for French, it misheard the word as a false friend in English. The compiler had no way to represent the French word in it's parse tree.
So no, it's not overly legalistic. Like if the compiler knows that this hardware can do unaligned memory access, but not atomic unaligned access, should it check for alignment in std::atomic<int> ptr but not in int ptr? Probably not, right?
I've (fruitlessly) had this discussion on HN before - super-aggressive optimisations for diminishing rewards are the norm in modern compilers.
In old C compilers, dereferencing NULL was reliable - the code that dereferenced NULL will always be emitted. Now, dereferencing NULL is not reliable, because the compiler may remove that and the program may fail in ways not anticipated (i.e, no access is attempted to memory location 0).
The compiler authors are on the standard, and they tend to push for more cases of UB being added rather than removing what UB there is right now (for exampel, by replacing with Implementation Defined Behaviour).
https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...
C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.
C is great for writing low-level system code where you need to optimize performance down to the last cycle. It not abstracting away the hardware is super important for some use cases. A classic example is all of the platform-specific flavors of memcpy in the Linux kernel that are C/assembly hybrids hand-optimized for the SIMD pipelines of some CPUs.
C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.
Sigh. s/sizeof(int)/_Alignof(int)/.
There are good reasons for an implementation to have sizeof(int) = _Alignof(int) and not a mere multiple of it, but if you are going to discuss subtle points and UB, just stick to the language guarantees.
> But let’s say you have a modern machine, where NULL is a pointer to address zero, and you actually have an object there.
You don't program in C on such a machine. Or maybe memory is virtualized, and it does not matter that your object lives at physical address zero, as long as you can map a non-zero virtual address to it.
> So how do you print an uid_t?
> It’s not rare for the denominator to come from untrusted input.It's not rare for the array index to come from untrusted input.
It's not rare for the supposedly valid UTF-8 string to come from untrusted input.
...
Why single out division? This problem affects every partially defined operation. In the case of division at least, everyone learned in school that thou shalt not divide by zero. Adding two untrusted integers and forgetting that signed overflow is UB, not defined as a modulo? Your average programmer is much less likely to see that coming.
Please. Convert your operands to wide enough types before the operation. Convert your results back to narrow enough types to compensate for integer promotion to wider types than you would have liked. Do that consistently, and you're good.Here:
you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)
Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.
A "world-class expert in low level programming" knows that unaligned memory accesses are no problem anymore on most modern CPUs, and that this particular UB in the C standard is bogus and needs to fixed ;)
Pointer casts changing pointer bit sequences is common on weird platforms (eg: some TI DSPs, PIC, and aarch64+PAC). And it is valid as per spec. Pointer assignment is not required to be the same as memcpy-ing the pointer unto a pointer to another type.
You misunderstood the spec. No promises are made that that cast copies the pointer bit for bit (and thus creates an invalid pointer). Therefore, your objection to invalid pointers is null and void. :)
6.3.2.3 paragraph 7: A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned[footnote 68]) for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.
This is a subsection of section 6.3 which describes conversions, which include both implicit and conversions from a cast operation. This language is not saying anything about bit representations or derefencing.
I happen to be wearing my undefined behavior shirt at the moment, which lends me an extra layer of authority. I'm at RustWeek in Utrecht, and it's one of my favorite shirts to wear at Rust conferences. But let's say for the sake of argument that you are right and I am indeed misunderstanding the spec. Then the logical conclusion is that it's very difficult for even experienced programmers to agree on basic interpretations of what is and what isn't UB in C.
> A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned71) for the referenced type, the behavior is undefined.
C23 6.3.2.3p7.
Great way to demonstrate the point of the article.
On recent Intel and 64-bit ARM processors, data alignment does not make processing a lot faster. It is a micro-optimization. Data alignment for speed is a myth. // https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...
(while in the olden days, a program may crash on unaligned access, esp on RISC)
I don’t see what spec part would prohibit that cast from validly compiling to
Spec only guaranteed round-trip through char* of properly aligned for type pointers. This doesn’t break that.``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```
If you write code like this, then everything in programming is UB.
Additionally, some (most?) UB is intentionally UB so that optimisers are free to do fancy tricks assuming that certain cases will never happen. Indeed, this is required for high performance. If they do happen, again, it can lead to unexpected behaviour.
PS: Most languages that don't have a specification declare their primary implementation to be specification-as-code. Rust is an example of that, and it does still have UB: the cases that the compiler assumes will not happen.
edit: for example I'm typing this into Safari which means probably every key press and event is going through JSC JIT compiled functions—which have, structurally and necessarily and intentionally, COMPLETELY undefined behavior according to the spec—and yet it miraculously works, perfectly, because the spec doesn't really matter
I stopped reading there. If you have decades of experience in C/C++ and don't know what that means (and that it's arch specific), I'll assume those decades were mostly the same year over and over.
C/C++ are horrible languages, but they deserve better opponents than that.
Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.
Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."
Int c = abs(a) + abs(b); If (a > c) //overflow
Is UB because some system might do overflow differently. In practice every system wraps around.
That should be a valid check, instead it gets optimised away because it 'can't' happen.
C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.
the only people complaining about being able to do awful things are people that do awful things
- unless you are trying to sink it in mercury. then it floats
- unless it is an uranium bar
- go sink uranium bars in mercury yourself
What a contradiction. Strong evidence that standard-driven programming language development is much worse than implementation-driven development. Standards should be used for data types and external interfaces/protocols, not programming languages.
Shrug.
(I hope casting fear is not UB)
I say this as an experienced C developer.
I'm sure that's UB in C
In C++ just use <reinterpret_cast>
Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.
No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:
- eschewing boomer loops in favor of ranges
- using RAII with smart pointers
- move semantics
- using STL containers instead of raw arrays
- borrowing using spans and string views
These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)
>After all, C/C++ is not a memory safe language.
Will fix it.
In the context of UB discussion, the arguments apply equally to C and C++.
How would you write that?
I entirely agree with all your points that C and C++ are completely different languages at this point. And yet I wanted to write this post about something that is true for both.
In the end, everything comes down to culture war.
However, that's obviously not the point? Ignoring the idea that people can/should just "git gud" and write perfect code in a language with lots of old traps, you can't control how everyone else writes their code, even on your own team once it gets big enough. And there will always be junior devs stumbling into the bear traps of c/c++ (even if the rest of the codebase is all modern c++). So no matter how many great new features get added to C++, until (never) they start taking away the bad ones, the danger inherent to writing in that language doesn't go away.
Also, safe != non-UB. TFA isn't so much about memory safety anyway.
As far as stdlib usage is concerned: that's just your opinion. The stdlib has a lot of footguns and terrible design decisions too, e.g. std::vector pulling in 20k lines of code into each compilation unit is simply bizarre.
Also:
- eschewing boomer loops in favor of ranges
Those "boomer loops" compile infinitely faster than the new ranges stuff (and they are arguably more readable too): https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/
- borrowing using spans and string views
Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.
So Linus was right? But for a second reason too:
C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much, much easier to generate total and utter crap with it. Quite frankly, even if the choice of C were to do _nothing_ but keep the C++ programmers out, that in itself would be a huge reason to use C.
That is, accepting C++ code from programmers who use C++ could be a SOX violation ;-)
(e.g. just compiling with address sanitizer and using static analyzers catch pretty much all of the 'trivial' memory corruption issues).
Everything else is a waste of time!
The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.
Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.
OTOH one could argue that creating truly portable programs is not possible since a programming language is a leaky abstraction - different machines have different endianness, different alignment requirements, different amounts of memory, etc. One could argue therefore that the language should not make any assumptions about the alignment restrictions, or lack of them, on the machine you are compiling for. Just document that "manually created" pointers may be unaligned and have machine-dependent behavior. A nice compiler could still generate a warning or error if you create a pointer that doesn't meet the alignment requirements of the target you are compiling for.
C/C++'s provision of type casts reflects that the language has made the design decision to not restrict the user, and let them step outside the bounds of any guarantees the language provides if they want to. Unions are also a form of type cast.
completely agree!