DE version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
45% Positive
Analyzed from 5335 words in the discussion.
Trending Topics
#code#fread#game#memory#stack#read#windows#https#made#calls

Discussion (171 Comments)Read Original on HackerNews
There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.
When I started digging into it, I realized the reason was the game was using something like
instead of Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.
Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.
To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?
I did the architectural design for the SuperMac cards. I figured out what needed to be accelerated, dropping code into people's machines to see where the cycles were going. Others did the physical design for the first 2 cards, I did the design of the chip in the Thunder and later cards (designed the data paths and state machines and a full simulation, someone else actually laid the gates)
If your card has a SQD01 on it it's my work. It peaks at 1.5Gb/s on solid fills
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
I feel like I'm having a stroke trying to read this, what does it mean??
8 bit psuedo color, so the color palette switched with every focus-follows-mouse window boundary crossing. 16 bit direct color with banding but no more palette psychedlia.
This was equal parts to make it faster and to allow for higher framebuffer resolutions with limited VRAM.
> We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems
Since talking to developers is no longer an option, I actually do write "Such-and-such Tune-up" extensions that patch applications dynamically to make them run better (or at all) in Advanced Mac Substitute, or even Mac OS itself.
I also worked on the original A/UX port for the Mac II, some hardware (like the IWM) required tiny buzzy loops, we ran into one bug where using the floppy caused ADB to freeze, but only on the release machines, not the prototypes all our engineers had, turned out there was hardware that made access to the VIA faster by pulling the clock in for 1 cycle, if you sat in a loop reading the timer in the via to measure a sector time for the IWM in too tight a loop it upped the output clock from the VIA to the ADB chip and over clocked it ....
Was it a workaround for things that didn’t fully complete on one iteration, so the devs kept hammering away at it until it worked?
Not every bug results in the program doing the wrong thing, they often just make the program do the right thing very slowly.
And nobody notices, since it still produces the right result.
If the stream is buffered, then all operations, including fread, are supposed to go through the buffer.
All three of these should issue buffer-sized reads to the operating system:
1. A loop which calls getc(stream) 65536 times.
2. fread(buf, 1, 65536, stream)
3. fread(buf, 65536, 1, stream)
The more direct behavior of fread should only kick in if the stream is configured as unbuffered.
I would say that the way low-level reads are issued to the host operating system is a "visible effect" of the program, so I suspect this may actually be a matter of conformance. I.e. it's not okay to issue those reads however the stream library wants as long as the data is read.
Edit: removed incorrect information.
See the original post and discussion for the whole story:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times... https://news.ycombinator.com/item?id=26296339
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
> For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object
(wording unchanged since C99)
If the file is unbuffered, depending on how the implementation handles buffering, and how it interprets the standard, then perhaps it does end up hitting a path where there's 1 ReadFile call per byte...
I don't know how most implementations get around this. Presumably it's valid to interpret "calls are made" as "behaving as if calls are made", meaning fread can copy data out of the FILE's buffer directly, or make calls directly to whatever routine fgetc defers to, rather than calling fgetc N times literally. Looks like glibc's fread does this.
>> For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object
Aha! That phrase led me to https://man7.org/linux/man-pages/man3/fread.3p.html. I consulted https://man7.org/linux/man-pages/man3/fread.3.html and https://man.openbsd.org/fread.3. Neither mentions that.
Now, I checked https://cplusplus.com/reference/cstdio/fread/. It doesn’t mention it, either.
⇒ this appears to be POSIX-specific.
Finally, if somebody implements fread as “For each object, size calls are made to the fgetc function”, it doesn’t matter whether you ask for 1 object of size 65,536 or 65,536 objects of size 1; both would call fgetc 65,536 times.
I had to convince people with benchmarks regularly that, yes, you could write the handful of lines to do proper user-space buffering and trivially run rings around any code that did extra context switches, because a lot of people didn't realise the cost difference between system calls and calling their own functions.
This included, by the way, the MySQL client library, at one point, which would do small read for length fields instead of larger non-blocking reads into a buffer all the time
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
The C runtime authors did (presumably Microsoft, if it's MSVCRT).
He's hooking into ReadFile, a layer below the stdlib. By the time it reaches the hook, it's already split.
For example, I run TortoiseGit which has a caching feature which is supposed to make it faster at showing what to commit. Disabling it increases the number of items I can delete per second in my Windows Explorer from about 1000 to about 3000 while making not making TortoiseGit operations meaningfully slower (that I can tell).
This is a Dev Drive [0] on my machine, it would probably be slower on my C: drive which has full Windows Defender real time file scanning.
[0]: https://learn.microsoft.com/windows/dev-drive/
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
Edit: mort96: So did you check the return value or not?
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
That's because the OS does the same thing too. It's the right fix, when I implemented something similar, we implemented caching right away.
Which is the obvious reason you'd pass an element size of 1: you want to know how many bytes were read.
https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...
The optimizer was allowed, but not obligated, to transform that into: x = a
However, in this case, b was sometimes 0. And if so, the unoptimized version computed: x = a / 0 * 0 = Inf * 0 = NaN
So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.
- deciding to inform the game developer & wait for reply vs not waiting for reply vs just fixing it yourself without informing the developer; and
- if informed: developer actually fixing it vs only saying they would fix it vs no reply whatsoever (not counting automated "thank you for your inquiry" replies, in cases where you don't already have more direct channels to the dev than email)
I've always kind of wondered this because in a way, it's kind of weird that it's fixed for them, at least for new releases / games actively being developed.
(Full disclosure: I'm a game developer myself, with a very high interest in engine plumbing & dev [including graphics], though finding a job for the latter is easier said than done.)
I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.
[1] - https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.
It's the life of a (game) developer...
An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.
And of course, browser engines also do the same things for certain websites:
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.
If not, why not have this enabled as default behavior instead?
> What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
I can't quite parse this. Yes, there is no guarantee that the optimizations will work for another game, which is precisely why you can expect an improvement with hl2. With non-hl2, you may get an improvement, you may not, and you may get incorrect behavior.
Everything else is not the same, but hl2 doesn't use the stuff that's different.
This seems genuinely unbelievable. Does anyone have a technical explanation for this?
then driver "optimizes" behavior, sometimes dishonestly (reducing precision), sometimes honestly (working around game engine stupidity)
Windows 95 patched a bug in SimCity just to get it to work.
Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.
Most stack allocations in the wild are not checked.
In order to protect against this, the compiler inserts some dummy reads or writes as needed to ensure every page is touched in order from bottom to top. This ensures the guard page is hit before the application has a chance to write to memory beyond it.
Here's an example: https://godbolt.org/z/oTbzTczM6
I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
https://www.reddit.com/r/cpp/comments/1i36ahd/is_this_an_msv...
That is, until I checked the program I used for testing (which I didn't write), and found the following code:
With the original allocator, this worked fine, since the deallocation didn't touch the memory.My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.
Unlike TFA, I had the luxury of just fixing the test program.
However as someone who looks a lot at instruction traces I could probably write on e on why Linux kernel code sucks too. One of my current pet peeves is the way Linux walks bitmasks of CPU bits, which is a reasonably common operation. Due to a chain of unfortunate changes and decisions it currently needs 16+ instructions to find the next bit for something which the x86 instruction set has a single instruction. Of course that is so big that it is even outlined, adding even more overhead.
With more and more code being written with AI (which has notoriously inefficient solutions to simple problems), I expect this issue to become more prevalent. I just hope we optimize at the source of the problem (AI and humans using it) and not on platforms (compiler and engine/kernel heuristics)
I do old school embedded, the amount of desktop bloat is insane. Any function I really need to refactor, I can reduce size and improve performance. And there are better engineers out there that are more efficient than me.
In this case we're talking about a tight initialization loop with probably a single instruction in the body. The HW optimizations necessary to make a loop like this perform equally to the unrolled form are so rudimentary that they're taken for granted on basically any CPU, even 30 years ago. Seriously, we're talking about optimizations I made in an "intro to Verilog" class as an undergrad, and I'm not even a HW engineer.
It also depends how often this code is being hit. Does the code run once while the program loads? Nobody will notice a 2 microsecond improvement in loading times. Does the code run in a timing-sensitive hot path, like a game loop or a GUI rendering thread? Well now optimization matters. But again, consider the HW argument above.
Also remember that, back then, storage wasn't cheap. 256K of code is 18% of a 1.44MB floppy, and 35% of a 720K floppy.
Agreed.
Ah, yes. Microsoft's!
It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.
Which would have made it an emulation code escape.
solidity sweating profusely
But there wasn't any similar programmatic debugging aid for detecting uninitialized stack memory.
Going further down the rabbit hole, I discovered the _chkstk function.
The MS C compiler would emit a call to _chkstk on function entry to ensure that stack memory had been paged in. But further reading noted that _chkstk was only emitted if the function allocated a lot of stack memory. And there was source code! MS included the assembly language source code for _chkstk in the CRT source code, installed with compiler.
I needed _chkstk to be emitted for every function not only for functions that allocated >= 4KB of stack variables.
Curses, foiled again.
Then, while perusing the list of compiler command line switches, I see "/Ge".
Ahhhhh! The grey, storm clouds parted and the sun rays bathed shone down on me in their warmth.I had all the pieces I needed to fill uninitialized stack memory with a non-zero canary value so I could make detection of uninitialized stack variables more reliable.
_stkfil was born
Modifying _chkstk was easy. I needed to write to every byte of stack in a stack page instead of reading only 4 bytes and skipping to the next page of stack.
While I was mucking in the bowels of modifying _chkstk, I added a 4-byte global variable to hold my canary value. Let the app override what value to use.
In debug builds, _stkfil helped find a couple of bugs, but soon all the stray uninited stack vars were gone and the code was forgotten.
Then I read about InitAll in https://www.microsoft.com/en-us/msrc/blog/2020/05/solving-un...