C array types are weird

126

ssigna11 3 days ago 131 commentsRead Article on anselmschueler.com

HI version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

62% Positive

Analyzed from 3424 words in the discussion.

Discussion (131 Comments)Read Original on HackerNews

Animats•1 day ago

The real lack is that C doesn't have slices. Slices can do most of what pointers into arrays can do, with sane semantics. Slices were invented surprisingly late. They were implementable in the 1970s, but didn't really show up until the 1990s. Now that we have slices, the demand for pointers into the middle of an array has much decreased.

I had a go at retrofitting C with slices over a decade ago.[1] Too much political hassle.

[1] https://www.animats.com/papers/languages/safearraysforc43.pd...

abnercoimbre•1 day ago

Meaning it died at committee?

jcranmer•1 day ago

From what I can see in the WG14 document log [1], it never made it to the committee in the first place.

[1] https://www.open-std.org/jtc1/sc22/wg14/www/wg14_document_lo...

Animats•1 day ago

Never pushed it that far. Mentioning safety in a C or C++ context was viewed very negatively back then.

alexey-salmin•1 day ago

Interestingly the article doesn't mention two-dimensional arrays and they're curios because they bring a certain asymmetry with them. It always tripped me over the most in C because I otherwise find the language very "symmetrical". It often feels like in design of this language the beauty of expressing certain things took priority over readability or safety which I admire in a way. But somehow not in the case of the two-dimensional arrays.

If you see a[i][j] it could mean two completely different things:

1) "a" is a continuous chunk of memory of N*M bytes, so it behaves as char*; a[i][j] == *(a + i*M + j)

2) "a" is an array of char* pointers that point to N completely distinct memory chunks of size M, so it behaves as char**; a[i][j] == *(*(a + i) + j)

With flat arrays the difference between an array as a variable and a pointer to the first element is literally negligible because you won't even see the difference in the assembly. This is why the automatic decay-to-pointer makes a lot of sense.

But that breaks completely with multiple dimensions. You definitely see the difference in the assembly because the memory layout is so different.

quietbritishjim•1 day ago

> If you see a[i][j] it could mean two completely different things:

> 1) ... a[i][j] == *((char*)a + i*M + j) // I added the char* cast to make it correct

> 2) ... a[i][j] == *(*(a + i) + j)

You may already understand this but: even in case (1), you still have

   a[i][j] = *(*(a + i) + j)

(It has to - that's what operator[] means in C.)

It's just that, in this case, `a + i` is applying pointer arithmetic to char[M]* so it adds M * i bytes to a's address.

This is similar to how `a + i`, if a is int32_t*, will give you an address 4 * i bytes bigger than a.

Really the confusing part of this is that *(a + i), which is an array value i.e. has type char[M], decays to char* when you add an integer to it (or dereference it). This is a pretty crazy hack really. Imagine if, in C++, you could do this

   std::vector<int> v = {1, 2, 3};
   int* x = v + 1;   // equivalent to &v[1]

Yuck.

quietbritishjim•about 21 hours ago

Too late to edit but I wrote pointer to char[M] as char[M]* when, of course(!), it should be written as char(*)[M].

uecker•1 day ago

"breaks completely"

I rather would say it works nicely in auto-generating the complex indexing operation for n-dimensional arrays which makes it a lot more convenient and less error-prone to write such code. The compiler may also flatten a loop.

The array of pointer hack used previously to similate 2d arrays using an array to pointers to arrays should not be used outside of special algorithms, as it is error prone and slow.

fooker•1 day ago

> The compiler may also flatten a loop.

http://c2.com/cgi/wiki?SufficientlySmartCompiler

In practice, C compilers are still notoriously bad at loop optimizations.

Polyhedral optimizations provided some hope, but no compiler managed to adopt it in production.

uecker•about 24 hours ago

Maybe, but also irrelevant to the discussion because whether you write mat[b * A + a] by hand or mat[b][a] and let the compiler frontend expand then makes no difference to the optimizer.

ossopite•1 day ago

As I recall, C# supports this in a completely sensible way by distinguishing a[i,j] and a[i][j]. If I understand right, in C, a[i][j] means what C# would spell a[i,j], which does seem rather surprising and inconsistent

mananaysiempre•1 day ago

Not quite. As GP mentions, a[i][j] might mean either, depending on what the type of a is:

(a) If the type of a is “array of length N of pointer to (say) char” (declaration: char *a[N]), then a[i][j] means the jth char in the contiguous block pointed to by the ith pointer. In C#, this is what you get with an array of arrays.

(b) If the type of a is “array of length N of array of length M of char” (declaration: char a[N][M] — sic!), then a[i][j] means the jth element of the ith element, aka the (i*M+j)th char in the single contiguous memory block. In C#, this is what you get with a two-dimensional array.

The way this happens is a bit subtle:

(a) The value a, of type “array of size N of pointer to char”, first decays into “pointer to pointer to char”, then a[i] retrieves the ith “pointer to char” starting from it as a base, then in turn a[i][j] retrieves the jth “char” starting from that as a base.

(b) The value a, of type “array of length N of array of length M of char”, first decays into “pointer to array of length M of char” (sic!), then a[i] retrieves the ith “array of length M of char” starting from it as a base, which then decays into “pointer to char”, then a[i][j] retrieves the jth “char” starting from that as a base.

NB: There are no implicit references here, unlike in C#; in part (b), a is an N*M-byte chunk of memory and a[i] is an M-byte piece of it.

simiones•1 day ago

In C, a[i][j] can mean either a[i,j] or a[i][j], depending on the type of a.

1718627440•about 22 hours ago

For 1), you can just write (&a[i])[j] .

fooker•1 day ago

And just in case you have not come across this, C++ allows you you overload all the relevant operators here: [], *, ->

So, you really can't tell what's going on behind the scenes.

I wanted to pull my hair out seeing some 'enterprise' code use

  state[i] = foo;

for some kind of logging where i was the severity level. There were even instances of state[i++], where the severity was incremental. I hope someone has rewritten that codebase with AI by now.

BeetleB•about 21 hours ago

So you would be equally critical of overloading [] for maps?

Sorry, hard for me to relate, as I've overloaded [] (in, say, Python) to make life easy on everyone. People loved it.

I hope you're aware that there is a long standing debate on whether overloading operators is good/bad, and it comes down to personal preference?

fooker•about 16 hours ago

> So you would be equally critical of overloading [] for maps?

No, I'm not sure how you got that impression. Overloading is great.

It's also confusing when it does something completely different from what you intuitively expect.

ActorNightly•1 day ago

I mean, just like with 1 dimensional arrays, it depends on the context.

Array memory is on the stack. The size of that array is actually not known at run time, its only known at compile time, where any reference to that length gets resolved by the compiled.

If your 2d array sits on the stack, then inferring memory layout is pretty easy. If you are dealing with pointer that was passed to a function, then you can't assume anything about data size or limits, which is why many functions that take pointers take a size parameter as well.

alexey-salmin•1 day ago

> If your 2d array sits on the stack, then inferring memory layout is pretty easy. If you are dealing with pointer that was passed to a function, then you can't assume anything about data size or limits, which is why many functions that take pointers take a size parameter as well.

Right, but 2d arrays come into this picture with their own quirks again. You're not just passing the size as the parameter, you can pass it as a "special" parameter that influences how the compiler will interpret other parameters. E.g. in C99 you can do this:

    void do(size_t x, size_t y, int a[][y]);

Here "y" plays the critical role because it will be used to compute offsets in the a[i][j] expression. For 1d arrays this doesn't happen.

Of course it's still generalizable as "all but the outermost dimensions should be known" and for 1d array the outermost dimension is the only dimension. Still, this whole thing always felt a bit odd to me.

uecker•1 day ago

Well, you give the explanation yourself: The size for the outermost array is not always needed, and then C allows it to be omitted.

But my recommendation is to always give the size and then everything is regular and the compiler can use the information for warnings.

simiones•1 day ago

> Array memory is on the stack.

Array memory can sit on either the stack or the heap.

> The size of that array is actually not known at run time, its only known at compile time, where any reference to that length gets resolved by the compiled.

This is also a bit misleading, in two ways. First, it's not clear what you mean by "size" here - the size of the memory block(s), or the shape of the array?

Second, many people think that the C runtime doesn't know the amount of memory allocated to an array, but this is actually false. It's just the C abstract model that for some reason chose to not expose this information - but the size is actually always stored and accessible, and this is virtually mandated by the standard: otherwise, `free(arr)` couldn't realistically work, it would have to be `free(arr, size)`. This is one of the weirdest inefficiencies of C, in fact - it requires you to store the size of arrays twice - once in user code, and another time in the internal logic of the allocator.

Edit: and as a fun extra, C++ not only inherited this mistake from C, but reproduced it again, meaning that a C++ array allocated with new[] actually stores the size twice, at least with typical implementations - once in the C++ runtime and again in the allocator - and still requires the user-space code to store it a third time. This is because `delete[]` needs to call the destructors of all of the elements of the array, regardless of where and how the array was allocated, so the number of array elements needs to be stored alongside the object itself.

zajio1am•about 21 hours ago

> Second, many people think that the C runtime doesn't know the amount of memory allocated to an array, but this is actually false. It's just the C abstract model that for some reason chose to not expose this information.

There are some counterpoints:

1) Conceptually, allocated memory block and data structure / array in it are not related. You can allocate memory block and then subdivide it to multiple different structures / arrays. You can implement sub-allocators.

2) Heap allocator does not need to store exact length of allocated object. For example, it could have several fixed-length slab allocators for smaller objects, select matching one during malloc() and use address range to find slab during free().

3) Array can be also on the stack (VLA or alloca()).

4) Arrays can be also on memory allocated outside of C library allocator (e.g. mmap()).

ActorNightly•1 day ago

>Array memory can sit on either the stack or the heap.

No, if we are using the definition of an array that is like int c[] = ..., that is always going to be on the stack. Heap continuous memory =/= array. You can use the [] operator to access it like an array, but fundamentally, as far as structures in C language are concerned, those 2 are different, because they get treated by compiler differently.

>but the size is actually always stored and accessible, and this is virtually mandated by the standard: otherwise, `free(arr)` couldn't realistically work,

That would only be true if each element in the array was a char.

The dynamic data structure stores total amount of memory allocated by address, it has no info about the size of the element, so it can't infer the actual number of items at runtime. You could write your own malloc that does this, but generally, that is left to the user for flexibility. For example, a really good practice in C coding that basically solves any double free is a mempool that allocates all the memory up front. That way, you never really even have to call free, and the memory you allocate can be partitioned any way you chose dynamically.

uecker•3 days ago

In practice, the [static n] notation can give you useful warnings and bounds checking.

https://godbolt.org/z/PzcjW4zKK

And while the (*array_ptr)[3] notation take a moment to get used to, it is very logical. If you have a pointer to an array, you dereference it first and then indx into it. Again, useful for bounds checking: https://godbolt.org/z/ao1so9KP7

keyle•1 day ago

I know of this notations but I don't see many people using [static n].

Not sure why, maybe it doesn't feel like C anymore, maybe it feels hacky?

typically if you're passed an array you'd want to get more anyway, so you'd get passed a struct. Not sure.

uecker•1 day ago

I don't know. I see people increasingly make use of it. The problem was that in the past compilers ignored this completely, so there was simply no point. Nowadays GCC uses it for warning (the length for bounds and "static" for nonnull), so it starts to become useful.

kazinator•1 day ago

The parentheses in (*parray)[i] would be unnecessary if dereferencing used postfix notation.

  Current:       All postfix

  *ptr[3]        ptr[3]*   // indexed access, then deref

  (*ptr)[3]      ptr*[3]   // deref, then indexed access*

geocar•about 22 hours ago

Dereferencing does have a postfix notation, so you can try it (sort of):

    #define $ [0]

then you can say ptr $[0] or ptr[0]$ and see if it's really better...

dnautics•1 day ago

What is **int[3][5]

kazinator•1 day ago

In C declaration syntax, there is a "stem" called declaration specifiers consisting of specifiers and qualifiers. That's where int can appear. After that, there is a declarator. In some cases, multiple declarators separated by a comma, which share the same "stem".

  int a, b, *c; // one stem consisting of "int", three declarators.

The * is declarator syntax for deriving a pointer type. It never appears such that a type specifier would come after it somewhere to the right.

Some languages have extended the C declaration syntax such that the type derivators can be moved from the declarator part to the "stem". For instance, as an alternative to:

  int a[10];

you can write

  int[10] a;

This is how we could get

   **int[3]

as a declarator stem indicating an array of 3 pointers to pointers to int. But it's not in C.

arcfour•about 24 hours ago

The work of the mythical four star programmer? https://wiki.c2.com/?ThreeStarProgrammer

ori_b•1 day ago

A syntax error. You need a variable name, not a type name, in the middle.

ori_b•1 day ago

And if you want 'int **arr[a][b]', it's a value that when you say 'x = **arr[m][n]', will evaluate to an int and assign it to x. Postfix has higher precedence than prefix.

fusslo•1 day ago

or a rejected PR

thrance•1 day ago

A pointer to a pointer to a pointer to a pointer of integers.

kazinator•1 day ago

There is a history to it; in one of the predecessor languages, like B, Ritchie actually had arrays that had a hidden pointer to their start. The "array to pointer decay" was actually a real operation that loaded an address from memory, and it was possible to twiddle the bits to relocate an array. One problem with it was no way to initialize such a pointer field that would allow an array to live in dynamically allocated storage (no constructors in the language).

So in short, the bad design (array values produce pointers) was informed by conceptual compability with an earlier design in which that was literally happening.

xenadu02•1 day ago

Not just this it is important to remember that there was no "aha!" moment where C was created whole-cloth by writing the first compiler in B then cross-compiling.

The language B was evolved in-place by adding new features, then editing the compiler source to make use of those new features, then repeating. They simply started calling it "New B". At some point the language had evolved sufficiently that they decided to call it C.

The semantics of arrays were inherited from B and simply never changed. Part of me suspects this was also because it was seen as "clever" at the time. Look ma, we let arrays turn into pointers! Isn't that clever?

When you look at pre-ANSI C function prototypes you wonder "where are the parameter types?" because there are none. The compiler didn't bother to check. Part of that was perhaps for implementation reasons but a big part of that was the feeling or culture inherited from B: in that language you just had words of memory. You were free to interpret any word of memory as any data type you liked. So duh of course it is up to you to decide how many parameters your function received and of what type. If the caller supplied a different number or different types? Don't do that.

If you are coming from that sort of world clever tricks like arrays decaying to pointers or automatically converting between data types and sizes seems perfectly natural. Anything C offers above and beyond that is an improvement from B after all.

ux266478•about 20 hours ago

> Part of me suspects this was also because it was seen as "clever" at the time. Look ma, we let arrays turn into pointers! Isn't that clever?

It was intentional and functional. The idea was basically a primitive kind of polymorphism, which allowed for functions intended to act on arrays to accept any size of an array to be passed in. It was redundant with pointer arithmetic, but allowed for communication of intent without accidentally incurring a semantic unit of meaning. There's an interview where Ritchie talked about this.

Pascal's biggest misgiving was that it went the complete opposite route, where pointer arithmetic was disallowed and arrays did not decay. It also lacked any kind of polymorphism, and one of the biggest ergonomic painpoints ends up being that if your problem domain has non-uniform array sizes, you're in for a lot of annoying re-writing.

> When you look at pre-ANSI C function prototypes you wonder "where are the parameter types?" because there are none.

Actually pre-ANSI C technically didn't have function prototypes, ANSI C introduced them and it got them from C-with-classes. It did have function declarations though (which aren't the same thing)

Pedantics aside,

    f(a, b) { return a + b; }

This is fully typed, the parameters and return type default to int.

Fun fact:

    int f();

Does not declare a function with no parameters, but it does declare a function with an unknown number of parameters of unknown types. An empty parameter list in C is:

    int f(void);

xenadu02•about 15 hours ago

> Actually pre-ANSI C technically didn't have function prototypes,

Thanks, I completely forgot they weren't called prototypes originally.

im3w1l•about 22 hours ago

Those decisions also make a lot of sense from the C-as-macro-assembler point of view (passing parameters puts values in the places defined by the calling convention, and taking parameters pulls them out) that has of course gradually faded over the year, being replaced by a rigorously defined (and undefined) abstract machine.

fooker•1 day ago

C array types are weird because C doesn't really need arrays. It's not what C was about.

But if you designed a language in the era where Fortran, THE array language, reigned supreme, nobody would use your language. The mindshare Fortran had is difficult to convey now, half a century later.

Think of it like making a chatbot today and not mentioning AI or LLMs, that's what making a language without arrays would have felt like in 1970.

alexey-salmin•1 day ago

People who do HPC in C actually wish C had proper arrays like Fortran. If your function takes two pointers as inputs instead of two arrays they can alias the same memory and in fact they may alias any other pointer of the same type. Writing into one of them invalidates all the values you have in registers so you have to load them again.

The "restrict" keyword was invented to solve this but it still has weaker semantics than original Fortran arrays. It can still solve a big share of problems, but it never got proper adoption and never even made it into C++.

xmcqdpt2•about 23 hours ago

Sometimes you have to use C but really they should be doing HPC in fortran. It has C FFI, it can compile to static programs and to dynamic libraries, it has C-like performance, etc.

(It's not as portable as C though, and the compilers have more bugs.)

genewitch•1 day ago

i learned FORTRAN in an accelerated tech program in 1996-ish in high school.

i used fortran recently to see how "slow" python is, i did matrix multiplies by hand in .c, and .py. Now i didn't write the fortran, the AI did, but i remember enough that i verified what it did was sane, also the other two i wrote did agree with results.

  fortran 1   unit of time
  C       1.7 unit of time
  python  2.2 unit of time

for the same matmuls.

anyhow, 1996-ish. crazy.

fooker•about 16 hours ago

Fortran compilers will happily tile your loops.

C compilers don't really do this.

flohofwoe•1 day ago

> C array types are weird because C doesn't really need arrays. It's not what C was about.

I would phrase that differently: "The main feature of arrays (performing the `base + index * size` address computation) is already provided by the C pointer type via the `ptr[N]` syntax sugar, so having a separate array type might have felt redundant at the time".

I think having "proper" array types in a language (where the type carries both the array item type and the comptime length) only really makes sense when there's also a slice type (e.g. a runtime ptr/length pair). And I guess at any point during C's development this was a too big language change for the committee to swallow.

saidnooneever•1 day ago

this exactly. if you need arrays or sequences of objects/memory items u can trivially implement them :/. why does it need to be embedded in the language?? people want the language to do all the programming for them. I suppose this is why they like LLMs too..

they should pay programmers less. get rid of all these moneygrabs

danborn26•about 22 hours ago

The way C handles array decay to pointers always trips up beginners, but it's exactly what makes passing data around so lightweight. Good writeup on a classic quirk.

fredrikholm•about 21 hours ago

Agreed, I even find it surprisingly ergonomic. Thinking of data as offsets into memory is unusual coming from almost every other language, but once you grokk it it's actually quite nice.

I love C more than I should.

the__alchemist•1 day ago

This is one of the things that I feel is an inappropriate abstraction that is around for historical reasons. When I do FFI to call C from rust, I usually wrap the generated API (Which is pointer based) into rust's &[] array syntax. Arrays/lists/Vecs etc in most non-C languages feel like an abstraction over a collection of items; I feel like C's exposing the pointer directly is taking a low-level memory/MMIO operation and inserting it into business logic. Conceptually, I like to keep them separate; pointers for writing drivers, accessing registers, writing to flash memory etc. Arrays/lists/vecs for higher level operations on collections.

Tangent: I have a pet theory that part of Zig's raison d'etre is to fix some of the problems with C, while accommodating its pointer-based data structures, and the resulting patterns.

vulcan01•1 day ago

This talk – "Programming without pointers" – by Andrew Kelley may be interesting to you.

https://www.hytradboi.com/2025/05c72e39-c07e-41bc-ac40-85e83...

throwaway27448•1 day ago

Learning to program with pointers is enormously useful. It's simply bad software engineering to not use typing to enforce constraints on access to pointers (or addresses, or however you'd like to term them)

doyougnu•1 day ago

IIRC that talk of about using indices (u32) to represent data in an array. That is orthogonal to representing that information in the type system since you can just type the index

codethief•1 day ago

Interesting talk, thanks for sharing!

mlmonkey•1 day ago

It still cracks me up that 3[x] and x[3] mean the same thing in C.

jason_s•1 day ago

yeah that's what I thought the article was going to be about.

RobotToaster•1 day ago

It's still weird to me that you can declare an array with the register keyword.

Then it (understandably) becomes UB to attempt to get the pointer.

(It also probably isn't stored in a register, since the keyword is just asking the compiler nicely.)

flohofwoe•1 day ago

The meaning of the 'register' keyword has changed over time to just "it's illegal to take the address of this item":

https://www.godbolt.org/z/TKq9rWzP1

Don't know what's the idea behind not allowing to take the address of a value though.

wasmperson•about 22 hours ago

It matters in single-pass compilers. You can't allocate a variable in a register if its address is ever taken, but by the time a single-pass compiler knows that information it has already spit out all of the assembly for the function.

RobotToaster•1 day ago

A register isn't in external memory, so isn't addressable as such. That part makes sense since if the compiler actually follows your suggestion it can't be addressed.

Thinking about it, storing arrays in registers would possibly make sense on systems like the 8051 where you actually have a bunch of general purpose register banks, but those don't exist in x86.

randomNumber7•1 day ago

It was always only a suggestion to the compiler, to hold this variable in a register.

Compilers got so good at optimization that there is little point using it.

If a variable is held in a register you can't access it with a pointer. So if your intention is it should be in a register you can't take the address.

tardedmeme•1 day ago

It once told the compiler to hold the value in a register because the compiler wasn't very smart at all.

danborn26•about 24 hours ago

C's array decay into pointers still catches me off guard sometimes. It is definitely one of those quirks you just have to memorize.

moktonar•1 day ago

int x[n] and int *x are very different things when it comes to defining memory layout tho. In one case you end up with n int sized slots of memory, in the second with one register sized slot. That makes all the difference when defining structs for example.

jason_s•1 day ago

From the title, I thought they were going to point out that `a[2]` and `2[a]` have identical meaning in C.

K0nserv•about 17 hours ago

But remember, C is simple.

fatty_patty89•1 day ago

there's no array type in c

colejohnson66•1 day ago

Yes it does. It just decays to a pointer at the slightest touch.

Mikhail_Edoshin•1 day ago

There are differences. E.g. va_xxx functionality may be implemented either with a pointer or an array. The difference becomes visible if you try to pass a va_list to another variadic function and then extract it later with va_arg. About half of compilers will happily do that, and another half will refuse to compile the naive version. (There's a more sophisticated proper way.)

https://stackoverflow.com/questions/79897621

ocreat•1 day ago

There is a big difference between:

struct A { int size; char data[]; }

struct B { int size; char *data; }

IncreasePosts•1 day ago

Paging walter bright

WalterBright•1 day ago

At your service! D fixed it, and I'm sorry C users have suffered as the array-to-pointer decay blasted their kingdom. Fixing it in C is easy and should be the #1 priority.

aa-jv•1 day ago

At this point, C's #1 priority is not breaking things in existing implementations, though, Walter...

WalterBright•about 17 hours ago

My proposal does not break existing code.

https://www.digitalmars.com/articles/C-biggest-mistake.html

glouwbug•1 day ago

C's biggest mistake.

But in other news most don't know that a[3] == 3[a]

glouwbug•1 day ago

https://www.digitalmars.com/articles/C-biggest-mistake.html

parlortricks•1 day ago

I didn't understand why a[3] == 3[a], but i found this stackoverflow that explains it.

https://stackoverflow.com/a/16163840

In C a[i] is converted to *(a+i) internally. i[a] is converted to *(i+a). Array names also act as pointers in c. so (a+i) or (i+a) give an address (using pointer arithmetic) that is dereferenced using

parlortricks•1 day ago

In C a[i] is converted to *(a+i) internally. i[a] is converted to *(i+a). Array names also act as pointers in c. so (a+i) or (i+a) give an address (using pointer arithmetic) that is dereferenced using *

throwaway27448•1 day ago

Even more irrelevant than the array type