Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained

104

00xKelsey about 5 hours ago 45 commentsRead Article on read.thecoder.cafe

FR version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

51% Positive

Analyzed from 1503 words in the discussion.

Discussion (45 Comments)Read Original on HackerNews

ameliaquining•about 3 hours ago

This post comes uncomfortably close to plagiarizing https://thebuild.com/blog/2026/04/23/preempt_none-is-dead-yo..., which it cites as a source; almost all the technical explanation is in there and some of the wording is extremely similar. Compare, e.g., "What Linux 7.0 actually changed" in Pettus's post to "What Is Preemption?" in this one. I think this link should have been to Pettus's post instead.

teivah•about 1 hour ago

I used that post as a source yes, and it’s stated explicitly but it’s not the only one. One section in particular is similar since we both present the different preemption modes. However, both audiences are different. thebuild.com has an audience composed of PostgreSQL enthusiasts (if not experts) and I don’t. So a significant part of my post was about explaining things from first principles (what’s a page, a TLB, a spinlock, etc.). I explain way more “basic” things and he goes beyond me in terms of how to cope with the problem. I don’t think the posts are closed.

jdonaldson•about 2 hours ago

That post comes uncomfortably close to how Opus writes this kind of prose. It's a good idea to acknowledge all stakeholders.

fulafel•about 1 hour ago

> PREEMPT_NONE: The kernel almost never interrupts a running thread

This seems confused. These are options for preemptibility of the kernel, which is a relatively modern fearure. Userspace could always be preempted and these options do not change anything there. The kernel must in any case frequently interrupt threads and processes to implement preemptive multitasking which Linux of course had since the beginning.

ozgrakkurt•about 2 hours ago

It is a crime that postgres isn't able to allocate with 1GB huge pages by changing a config parameter in 2026

Also a crime that people are still running databases with 4kb pages.

To put it in perspective, this means you will have more than 30 million pages on a server with 128GB RAM. As an example, if there is 16bytes of metadata for memory page. The metadata itself would take more than half a gigabyte.

dezgeg•about 2 hours ago

Even worse, the actual struct page on Linux is 64 bytes, so 4x your example

bonzini•about 2 hours ago

There is 64 bytes of metadata per memory page indeed.

andrewstuart•about 1 hour ago

Sensible defaults would be nice.

buster•about 3 hours ago

I'd rather like to know if any real world usage broke, before coming to the conclusion that an edge case synthetic benchmark is worth changing the kernel (back or wherever) where supposedly the change that broke the benchmark had real world benefits.

Since we will never know it might be a good idea to feature gate the change, change the default and let users decide to change it back. This may give some feedback on the lkml or else to decide if the change is worthwhile?

nijave•about 3 hours ago

"synthetic benchmark" is doing some heavy lifting here. Pgbench just runs a bunch of SQL statements against a real Postgres instance.

It's very close to a real world simulation of a production workload

buster•about 2 hours ago

I am not questioning the benchmark. But the benchmark is NOT measuring a real world application in a real world setting. Anyway, I am merely wondering IF there is a company out there affected, at all. I understand that this was only measured on a graviton 4 setting with very heavy lifting, without huge tables.

For example, this issue aside, I'd rather split such a workload into multiple smaller instances, naturally. Because the impact of a crash on this single node, heavy load, many cores, many clients scenario would be huge.

MBCook•about 3 hours ago

This only happened under a very odd configuration. Yeah it wasn’t great but it was not the normal case.

The headline implies it broke PG everywhere. It didn’t.

selckin•about 4 hours ago

https://news.ycombinator.com/item?id=47644864

fabian2k•about 1 hour ago

That regression is maybe most useful as a reminder to people to configure huge pages for PostgreSQL. That's the one recommended basic performance tuning that is just annoying enough to set up that I suspect many people with smaller DBs will skip it.

Though I actually don't know how large shared buffers has to be for huge pages to make a noticeable difference.

ahartmetz•about 2 hours ago

PREEMPT_LAZY triggering on page faults seems like a bad idea in light of this. It is probably not a good idea to suspend processes right when they get unexpectedly bogged down. The logic makes a little more sense for syscalls that are expected to take long compared to a scheduling quantum (a few milliseconds). But page faults are mostly invisible and unplannable.

It only took a few decades for Linux to get a good CPU scheduler and good I/O schedulers, too. I don't get how such an important area can be so bad for so long. But then, bad scheduling is everywhere. I find it to be a pretty fun area to work in, but, judging by how much it is less than half-assed in much existing software, most developers seem to hate dealing with it?

AlienRobot•about 1 hour ago

One thing I miss from using Windows is that the desktop didn't just freeze completely if you ran out of RAM.

At first I thought that maybe Linux doesn't have ways to give priority to the desktop environment (a.k.a. "graphical shell") which is why running out of RAM means your cursor starts lagging, clicking on things stops working, etc.

But maybe Linux is just bad at that in general and a single process eating too much RAM can simply bring the whole system to a halt as it tries to move and compress RAM to a pagefile on an HDD (not SSD).

Every time it happens to me I just find it so incredible. Here I am with a PC with a multiple cores, multiple processors, and a single process eating all the RAM can bottleneck ALL of them at once? Am I misunderstanding something? Shouldn't it, ideally, work in such way that so long as one processor is free, the system can process mouse input and render the cursor and do all the desktop stuff no matter what I/O is happening in the background?

Since it's Linux maybe it's just my DE/distro (Cinnamon/Mint). Maybe it does allocations under the assumption there will always be a few free bytes in RAM available, so it halts if RAM runs out while some other DE wouldn't. But even then you'd think there would be a way to just reserve "premium" memory for critical processes so they never become unresponsive.

I wonder if other people have the same experience as me. This part of Linux just always felt fundamentally poor for me.

bobmcnamara•about 2 hours ago

Userspace spinlocks seem like a bad idea too.

What if it was on a VM and the core holding the lock got descheduled from the hypervisor?

nijave•about 3 hours ago

Right on the heels of 6.19 breaking tcmalloc and Mongo

matharmin•about 2 hours ago

Yup - interesting to see so much written about Postgres having a performance regression on Linux 7.0, in a scenario that affects almost no-one in practice. Meanwhile MongoDB refuses to run at all on Linux 7.0 due to some issue with tcmalloc.

https://jira.mongodb.org/browse/SERVER-121885

duskwuff•32 minutes ago

The underlying tcmalloc issue is interesting - the library was relying on an implementation detail of the rseq kernel API which was never guaranteed, and which already generated warnings in previous versions.

https://lore.kernel.org/all/20260126204745.GP171111@noisy.pr...

jeltz•about 2 hours ago

Moderators should change this headline because it is nowhere near true. It only regressed performance on some incorrect configurations.

ApolloFortyNine•about 2 hours ago

I can't help but think of the classic XKCD example of breaking a user's workflow [1].

Doing research though a spinlock actually doesn't seem as unusual a hack as it would first seem, do drivers and the like not have similar issues because they don't trigger a page fault I guess?

[1] https://xkcd.com/1172/

baq•about 3 hours ago

TLDR of the LMKL thread: 120GB RAM postgres with hugepages=off, lock contention went from terrible to abysmal. nothing to see here except that amazon for whatever reason runs DB tests with huge pages disabled. (hope I'm not paying for RDS and auroras like that in production!)

Twirrim•about 3 hours ago

Huge pages has had a spotty history, that lead to people being paranoid about it, and no doubt a whole bunch of folks just disable it "because that's what we've always done". It has been stable and reliable for quite a while now, would really hope folks could move away from that perspective.

jeltz•about 2 hours ago

Are you sure you are not thinking of transparent huge pages? They have a spotty history but you are supposed to run big PostgreSQL instances with huge pages, not transparent huge pages.

nijave•about 3 hours ago

I tested it once about 2 years ago on Azure VM and got a nice 10-15% perf boost on pgbench (I want to say at least 64GB shared mem)

lstodd•about 2 hours ago

I remember when support for them just appeared and you had to LD_PRELOAD a shim IIRC to make Postgres actually use them we jumped on it, enabled them immediately and got a pretty significant boost, around 15-20%, yes.

That was idk, 2008-9 -ish? I don't know what spotty history you are talking about, if you have multigigabyte address spaces floating on a machine it's stupid not to use hugepages.

nijave•about 3 hours ago

In fairness, AWS could (and almost certainly is) using their own kernel build that does who-knows-what

mplanchard•about 3 hours ago

Also was only on ARM, wasn’t it?

dist-epoch•about 3 hours ago

Many people have desktops with 128 GB RAM. Should they enable hugepages? I've never heard this recommendation for a desktop.

nijave•about 3 hours ago

Huge pages is good when a single process is reserving a giant block of memory which I think isnt that common.

You might have transparent huge pages on by default depending on the distro

dataflow•about 3 hours ago

An X% performance regression is basically a (100 - X)% feature breakage, so whatever that implies in terms of breaking userspace...

PunchyHamster•about 3 hours ago

Seems Linus needs to yell at someone again.

Especially with containers around you might very well hit the case of running new kernel but older version of PostgreSQL with no code mitigation for the problem

nobleach•about 3 hours ago

I get that folks love a good Linus rant. But as someone who's been at the end of that style of "feedback", nothing can be more humiliating or demotivating. Certainly there are contributors that are making "rookie mistakes". There are folks that aren't willing to ingest the entire context of what was tried back in 2.0.36, 2.2, 2.4... etc. And perhaps it's wise to simply stay away until you're completely certain you've got the chops to contribute. More than half the folks that enjoy that sort of abuse don't have those chops.

I can defend someone who is unwilling to yield on quality. Afterall, this truly is his baby. Issuing scathing rebukes to well-intentioned contributors is like slapping my kid when he brings me the wrong type of screwdriver.

ecshafer•about 3 hours ago

I don't think a Linus rant ever hit anyone that was a rookie, they are always AFAIK against people "who should know better". Veteran developers, with multiple commits merged.

colechristensen•about 3 hours ago

If you're at the level of delivering to Linus, I'm sorry but humiliation and demotivation are earned.

You don't talk like this to junior or even senior engineers, but you do reach a level at which gently telling isn't necessary.

If you don't like it go fork Linux and try being the nice benevolent dictator and we'll applaud your success.

slackfan•about 2 hours ago

Code quality does not care about your feelings.

themafia•about 2 hours ago

> scathing rebukes

Would you be able to point one out?

> to well-intentioned contributors

This is a system used and relied upon by billions of people around the world. Your intentions, while good, are not material to the problem. Put another way we have an endless supply of people with "good intentions" but we don't enjoy the same largess of people with "good skills."

vogelke•2 minutes ago

https://lwn.net/Articles/343828/ describes Alan Cox trying to fix the TTY layer, being trashed by Linus, and removing himself from the maintainer page.

bonzini•about 2 hours ago

Nope, there was and will be no yelling.