Back to News
Advertisement
Advertisement

⚡ Community Insights

Discussion Sentiment

63% Positive

Analyzed from 2842 words in the discussion.

Trending Topics

#model#gpt#benchmark#date#better#cutoff#things#openai#more#opus

Discussion (106 Comments)Read Original on HackerNews

robertwt7about 1 hour ago
Gpt 5.5 combined with codex is really good. I actually have no doubt whenever I asked questions, plan, or implement a code with it. With opus 4.7, I have to keep double checking because it doesnt follow the CLAUDE.md instruction, it hallucinates a lot, by default it makes things up when it can’t find the answer to something. Its crazy how quickly people are saying that OpenAI is left behind last year when they declared code red and look at where we are now
wincyabout 3 hours ago
Just tried it out for a prod issue was experiencing. Claude never does this sort of thing, I had it write an update statement after doing some troubleshooting, and I said “okay let’s write this in a transaction with a rollback” and GPT-5.5 gave me the old “okay,

BEGIN TRAN;

-- put the query here

commit;

I feel like I haven’t had to prod a model to actually do what I told it to in awhile so that was a shock. I guess that it does use fewer tokens that way, just annoying when I’m paying for the “cutting edge” model to have it be lazy on me like that.

This is in Cursor the model popped up and so I tried it out from the model selector.

XCSmeabout 3 hours ago
I feel like the last 2-3 generations of models (after gpt-5.3-codex) didn't really improve much, just changed stuff around and making different tradeoffs.
pixel_poppingabout 3 hours ago
I disagree, it improved enormously especially at staying consistent for long-tasks, I have a task running for 32 days (400M+ tokens) via Codex and that's only since gpt-5.4
ericpauleyabout 3 hours ago
Has that task accomplished anything yet?
lowdudeabout 2 hours ago
That’s actually crazy, what kind of task is that? And is that a recurring kind of task like some analysis, or coding related?
r_leeabout 2 hours ago
...what? what kind of a task are you running?
endymi0nabout 2 hours ago
OpenAI is the first company that has reached a level of intelligence so high, the model has finally become smart enough to make YOU do all the work. Emergent behavior in action.

All earnesty aside, OpenAI’s oddly specific singular focus on “intelligence per token” (also in the benchmarks) that literally noone else pushes so hard eerily reminds me of Apple’s Macbook anorexia era pre-M1. One metric to chase at the cost of literally anything else. GPT-5.3+ are some of the smartest models out there and could be a pleasure to work with, if they weren’t lazy bastards to the point of being completely infuriating.

syspecabout 3 hours ago
Can't tell if above is good or bad.
hbnabout 2 hours ago
GPT-5.5 shatters benchmarks for amount of faith it puts in the user.
guilamuabout 3 hours ago
Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I know it's only on a single benchmark, but I dont understand how it can be so bad...

goldenarmabout 2 hours ago
gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong
guilamuabout 2 hours ago
Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.
embedding-shapeabout 2 hours ago
Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.
ac29about 3 hours ago
Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.
guilamuabout 2 hours ago
Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.
ac29about 2 hours ago
Your table doesn't indicate reasoning vs non-reasoning, or reasoning level
mosselmanabout 2 hours ago
You even traveled in time to deliver us this benchmark.

I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.

guilamuabout 2 hours ago
Haha, just fixed the date!

I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.

BTW, if you explore the repo, sorry for all the French files...

DrProticabout 2 hours ago
Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.

guilamuabout 2 hours ago
Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own".

The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...

DrProticabout 2 hours ago
That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.

I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.

I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.

jvidalv25 minutes ago
In the only one that feels that OpenAI has bots/commenters on payroll on all this kind of news downplaying Claude and stating how much better Codex is?

There is too much and there are too many, and some of their takes don’t fly if you use Claude daily.

zerof1labout 2 hours ago
I don't see any meaningful performance improvements in those paid models anymore.

They all roughly produce junior developer-level code, continue to have mental breakdowns in their “thinking” stage, occasionally hallucinate things, delete pieces of code/docs they don’t understand or don’t like, use 1.5 times the necessary words to explain things when generating docs and so on.

I'm now testing "avoid sycophancy, keep details short and focus on the facts" in my AGENTS.md files.

podnamiabout 2 hours ago
This is snark. Since when has a junior level dev managed to debug and deploy say a cloudformation stack and follow up with notes under 3 minutes?
gjsman-1000about 1 hour ago
Heard this analogy elsewhere, but worth repeating:

AI is like having the greatest developer who ever lived, but she is always on 4 beers.

nathan-helloabout 1 hour ago
personifying ai is incredibly cringe no matter how weird your comparison is
Topfiabout 1 hour ago
Pricing by context length:

Input: $5/M tokens at <=272K, $10/M tokens above 272K.

Output: $30/M tokens at <=272K, $45/M tokens above 272K.

Cache read: $0.50/M tokens at <=272K, $1/M tokens above 272K.

Significantly more expensive than Opus 4.7 beyond 272K and at least in my tasks, I haven't seen the model that much more token efficient, certainly not to such a degree that it'd compensate this difference. GPT-5.4 had a solid context window at 400k with reliable compaction, both appear somewhat regressed, though still to early to truly say whether compaction is less reliable. Also, I have found frontend output to still skew towards that one very distinct, easily noticeable, card laden, bluesy hue overindulged template that made me skeptical of Horizon Alpha/Beta pre GPT-5s release. Ended up doing amazing at the time for task adherence, which made it very useful for me outside that one major deficit. The fact that GPT-5.5 is still so restricted in that area is weird considering it's supposed to be an entirely new foundation.

sigmoid10about 4 hours ago
Huh. Yesterday they said:

>API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale.

And now this. I guess one day counts as "very soon." But I wonder what that meant for these safeguards and security requirements.

FINDarksideabout 4 hours ago
When stuff is delayed due to "safeguards" it just means they don't think they have the compute to release it right now.
simonwabout 4 hours ago
I wonder if the fact that GPT-5.5 was already available in their Codex-specific API which they had explicitly told people they were allowed to use for other purposes - https://simonwillison.net/2026/Apr/23/gpt-5-5/#the-openclaw-... - accelerated this release!
embedding-shapeabout 4 hours ago
The same person who've mercilessly lied about safety is still running the company, so not sure why anyone would expect any different from them moving forward. Previous example:

> In 2023, the company was preparing to release its GPT-4 Turbo model. As Sutskever details in the memos, Altman apparently told Murati that the model didn’t need safety approval, citing the company’s general counsel, Jason Kwon. But when she asked Kwon, over Slack, he replied, “ugh . . . confused where sam got that impression.”

Lots of cases where Altman hass not been entirely forthcoming about how important (or not) safety is for OpenAI. https://www.newyorker.com/magazine/2026/04/13/sam-altman-may... (https://archive.is/a2vqW)

neosatabout 3 hours ago
Enterprise user here and still seeing only 5.4. Yesterday's announcement said that it will take a few hours to roll out to everybody. OpenAI needs better GTM to set the right expectations.
neosatabout 3 hours ago
Just refreshed and see 5.5 now - yay! Love the speedy resolution ;) Thanks folks, I'll complain faster next time....
ftononabout 3 hours ago
Looks like the default config in the chat is instant 5.3, it only uses the 5.5 on the thinking variant
bnm04about 2 hours ago
They moved a few months ago to have separate instant and thinking models. 5.3 is the latest instant, and 5.5 is a reasoning model.
czkabout 3 hours ago
API page lists the knowledge cutoff as Dec 01, 2025 but when prompting the model it says June 2024.

   Knowledge cutoff: 2024-06
   Current date: 2026-04-24

   You are an AI assistant accessed via an API.
BeetleBabout 3 hours ago
I don't know why this keeps coming up. This has always been the least reliable way to know the cutoff date (and indeed, it may well have been trained on sites with comments like these!)

Just ask it about an event that happened shortly before Dec 1, 2025. Sporting event, preferably.

czkabout 3 hours ago
the model obviously knows things after the reported date but its just curious that it reports that date consistently

could be they do it intentionally to encourage more tool calls/searches or for tuning reasons

htrpabout 3 hours ago
Can you really believe things that the model says? (A lot of prior model api pages say knowledge cutoffs of June 2024, maybe the model picks that up?)
czkabout 3 hours ago
you cant but its pretty reproducible across api and codex and other agents so i just thought it was odd. full text it gives:

   Knowledge cutoff: 2024-06
   Current date: 2026-04-24

   You are an AI assistant accessed via an API.

   # Desired oververbosity for the final answer (not analysis): 5
   An oververbosity of 1 means the model should respond using only the minimal content necessary to satisfy the request, using
 concise phrasing and avoiding extra detail or explanation."
   An oververbosity of 10 means the model should provide maximally detailed, thorough responses with context, explanations, and
 possibly multiple examples."
   The desired oververbosity should be treated only as a *default*. Defer to any user or developer requirements regarding
 response length, if present.
swyxabout 3 hours ago
can u test it on say who won the 2024 US election
ghurtadoabout 3 hours ago
I can't really think of a less reliable test for anything at all than making a random guess as to something that had about 50/50 odds to begin with

Easiest Turing test ever...

himata4113about 3 hours ago
ask it 10 times.
WarmWashabout 3 hours ago
Usually the labs do some kind of post training on major events so the model isn't totally lost.

A better test is something like "what is the latest version of NumPy?"

bakugoabout 3 hours ago
That sort of test isn't super reliable either, in my experience.

You're probably better off asking something like "what are the most notable changes in version X of NumPy?" and repeating until you find the version at which it says "I don't know" or hallucinates.

czkabout 3 hours ago
with thinking off and tools disabled:

  Donald Trump won the 2024 U.S. presidential election.
bakugoabout 3 hours ago
Models don't know what their cutoff dates are unless told via a system prompt.

The proper way to figure out the real cutoff date is to ask the model about things that did not exist or did not happen before the date in question.

A few quick tests suggest 5.5's general knowledge cutoff is still around early 2025.

czkabout 3 hours ago
i wonder if they put an older cutoff date into the prompt intentionally so that when asked on more current events it leans towards tool calls / web searches for tuning
ssl-3about 2 hours ago
I wonder if the cutoff date is the result of so many people posting about the date over time and poisoning the data. "Dead cutoff date theory," perhaps.

Whatever it is, the cutoff date reporting discrepancy isn't new. Back when Musk was making headlines about buying/not buying Twitter, I was able to find recent-ish related news that was published well after the bot's stated cutoff date.

ChatGPT was not yet browsing/searching/using the web at that point. That tool didn't come for another year or so.

MallocVoidstarabout 3 hours ago
OpenAI does tell the model the current date via API, so it's odd for them not to also tell the model its cutoff
socoabout 3 hours ago
Stupid question: wouldn't it then search the web for that event?
bakugoabout 3 hours ago
If you have web search enabled, sure. But if you're testing on the API, you can just not enable it.
QuadrupleAabout 2 hours ago
Exactly double the cost of GPT 5.4 - $5 per MTok input, $0.50 cached, $30 output.

All the AI players definitely seem to be trying to claw more money out of their users at the moment.

languid-photicabout 2 hours ago
It's 2x/token, but for default reasoning we've found GPT-5.5 uses fewer tokens overall, so net cheaper on median. [1]

(Note, that stops being true at higher reasoning levels, where our observed total cost goes up ~2-3x.)

[1] https://x.com/voratiq/status/2047737190323769488?s=20

guilamuabout 2 hours ago
https://openrouter.ai/openai/gpt-5.5-pro

30/180 usd on Openrouter. Did I miss something?

languid-photicabout 2 hours ago
I think that's Pro. Regular 5.5 is 2x regular 5.4.
Advertisement
throw03172019about 4 hours ago
Faster than anticipated because of Deepseek release?
XCSmeabout 3 hours ago
Doubt it, DeepSeek v4 is quite underwhelming.
swyxabout 3 hours ago
more like they wanted to release it yesterday but merely had some last min flags they wanted to hold off for
Jhonwilsonabout 3 hours ago
ok not bad
m3kw9about 3 hours ago
Maybe but no one serious is using deepseek
XCSmeabout 2 hours ago
GPT 5.5 is close to Opus 4.7, but at 7x the cost[0]...

Either Opus 4.7 miscounts reasoning tokens, or it's A LOT more efficient than GPT 5.5

I thought they made GPT 5.5 more token efficient than 5.4, but it uses 2x the reasoning tokens.

[0]: https://aibenchy.com/compare/openai-gpt-5-5-medium/openai-gp...

redsaberabout 4 hours ago
not available for Github Copilot pro(only in pro+, business and enterprise), I am really now feeling the era of subsidized AI is over.
sunaookamiabout 3 hours ago
With a 7.5x multiplier and even that is a promo!! Microsoft is insane! https://github.blog/changelog/2026-04-24-gpt-5-5-is-generall...
skeledrewabout 3 hours ago
This is where the emigration to Chinese providers begins.
_pdp_about 3 hours ago
A very expensive model for API usage. Fine in codex I think.
pants2about 4 hours ago
Is anyone here actually using pro models through the API? I'd be very curious what the use-case is.
chadashabout 4 hours ago
Yes. High value work where cost (mostly) doesn't matter. For example, if I need to look over a legal doc for possible mistakes (part of a workflow i have), it doesn't matter (in my case) whether it costs $0.01 or $10.00, since it's a somewhat infrequent event. So i'll pay $9.99 more, even if the model is only slightly better.
bogtogabout 3 hours ago
I'm surprised I never heard people talking about using -Pro variants, even though their rates ($125-175/M?) aren't drastically larger than old Opus ($75/M), which people seemed to use
freedombenabout 4 hours ago
Indeed, even just Terms of Service and Privacy Policy work. Infrequent enough that cost isn't an issue, but model quality absolutely is
ComputerGuruabout 4 hours ago
Yes? The same reason you would use it via the tooling.
gigatexalabout 3 hours ago
what's the real world comparison to opus 4.7 fellow coders?
Sembianceabout 1 hour ago
I gave 4.6, 4.7 and GPT 5.5 the same prompt and task to reverse engineer a collection of sample vector files from an obscure Amiga CAD program and create a detailed txt specification and a python converter that converts to SVG and produce a report so I can visually verify.

4.6 did very well. 90% perfect on first try, got to 100% with just a few followups. 4.7 failed horribly. First produced garbage output and claimed it was done, admitted it did that when called out, proceeded to work at it a lot longer and then IT GAVE UP. GPT 5.5 codex was shockingly good. Achieved 90% perfect on first try in about a fourth of the time. Got to 100% faster and with fewer follow-ups.

I’m impressed.

pillefitzabout 3 hours ago
Please consider the ethical aspects of giving money to OpenAI versus alternatives.
Jhonwilsonabout 3 hours ago
that is great news
refulgentisabout 2 hours ago
I'm absolutely stunned by what I've seen from 5.5. I thought it'd be a nothingburger and ~= Opus.

Gave it two very long-running problems I haven't had the courage to work on in the last 2.5 years, solved each within an hour.

- An incremental streaming JSON decoder that can optionally take a list of keys to stop decoding after. 1800 LOC about 30 minutes later, and now my local-first apps first sync time is 0.8s instead of 75s when there's 1.5 GB of data locally.

- Flutter Web can compile to WASM and then render via Skia WASM. I've been getting odd crashes during rapid animation for months. In an hour, it got Skia WASM checked out, building locally, a Flutter test script, and root caused the issue to text shadows and font glyphs (technically, not solved yet, I want to get to the point we have Skia / Flutter patch(es))

If you told me a week ago that an LLM could do either of these, without heavy guidance, I'd be stunned. And I regularly push them to limits, ex. one of Opus' last projects was a tolerant JSON decoder, and it ended up being 8% faster than the one built-in to Dart/Flutter, which has plenty of love and attention. (we're cheating a little, that's why it's faster. TL;DR: LLMs will emit control characters in JSON and that's fine for me, treating them as fine means file edit error rates go from ~2% to 0%)

I just wish it was cheaper, but, don't we all...

rvnxabout 4 hours ago
Very bad habit these safeguards. These "safety" filters are counter-productive and even can be dangerous.

In my place for example, a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients.

Even yourself, when you want to learn about one disease, about some real-world threats, some statistics, self-defense techniques, etc.

Otherwise it's like blocking Wikipedia for the reason that using that knowledge you can do harmful stuff or read things that may change your mind.

Freedom to read about things is good.

NicuCalceaabout 3 hours ago
> a lot of doctors are using ChatGPT both to search diagnosis and communicate with non-English speaking patients

I think that's the problem. Who's going to claim responsibility when ChatGPT hallucinates or mistranslates a patient's diagnosis and they die? For OpenAI, this would at best be a PR nightmare, so that's why they have safeguards.

rvnxabout 2 hours ago
Adults bear responsibility for choices about their own lives. In fact, the more educated they are, the better choices they can make.

A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option (Google Translate, a family member interpreting, guessing). Refusal isn't safety, it's liability-shifting dressed up as safety.

If there's no doctor, no interpreter, no pharmacist, just a person with a sick kid and a phone, then "refuse and redirect to a professional" is advice from a world that doesn't exist for them. The refusal doesn't send them to a better option; there is no better option, it's a large majority of people on this planet.

Hell is paved of good intentions, but open-education and unlimited access to knowledge is very good.

It doesn't change the human nature of some people, bad people stay bad, good people stay good.

About PR, they're optimizing for not being the named defendant in a lawsuit or the subject of a bad news cycle, it's self-interest wearing benevolence as a costume.

This is because harms from answering are punishable (bad PR, unhappy advertisers, unhappy investors, unhappy politicians / dictators, unhappy lobbies, unhappy army, etc); but harms from refusing are invisible and unpunished.

landl0rd9 minutes ago
It is not "liability shifting", you are asking them to take this liability on. Status quo matters in how we frame this. It is entirely reasonable for them to decline because settlements when things go wrong in medicine can be astronomically, back-breakingly large.

Until the court system holds to that without reservation and sympathetic juries cannot be persuaded otherwise, to issue a zillion-dollar settlement against Evil AI Co, they are not going to change this policy. It's all just CYA. Most of the stupid company policies you ever encounter are down to CYA. The rest are mostly laziness.

This won't happen, so instead it will be left to specialized firms that understand the industry well and so will not get sued to death. Which is pretty normal for highly-regulated industries. The price of state legibility is things lag a few years. I'm not even saying this is good, but it's the trade-off we have chosen as a society. We could offer a "right to try" type legal shield for deploying new technologies in places where they provide serious benefits if we wanted to change this.

NicuCalceaabout 1 hour ago
> A doctor who gets refused by ChatGPT doesn't stop needing to communicate with the patient; they fall back to a worse option

I think AI proves the contrary. There are plenty of examples of things that are getting worse because of technological advancement, particularly AI. Software quality, writing, online discourse, misinformation have all suffered over the last few years. I truly believe the internet is a worse place than it was 5 years ago, and I can't imagine bringing that to medicine would work out differently.

The medical system shouldn't rely on falling back to crappy workarounds, it should aspire to build the best system it reasonably can.

hellohello2about 3 hours ago
The doctor would be responsible.

I had a choice better a doctor that used AI or not, I would much prefer one that did...

NicuCalceaabout 3 hours ago
The doctor would be responsible for the accuracy of their translation tool, something they can't verify but you expect them to use?
timedudeabout 3 hours ago
Yup, deliberately making the model retarded
Advertisement