GPT-5.5
903
HI version is available. Content is displayed in original English for accuracy.
HI version is available. Content is displayed in original English for accuracy.
Discussion Sentiment
Analyzed from 13902 words in the discussion.
Trending Topics
Discussion (523 Comments)Read Original on HackerNews
This quote is more sinister than I think was intended; it likely applies to all frontier coding models. As they get better, we quickly come to rely on them for coding. It's like playing a game on God Mode. Engineers become dependent; it's truly addictive.
This matches my own experience and unease with these tools. I don't really have the patience to write code anymore because I can one shot it with frontier models 10x faster. My role has shifted, and while it's awesome to get so much working so quickly, the fact is, when the tokens run out, I'm basically done working.
It's literally higher leverage for me to go for a walk if Claude goes down than to write code because if I come back refreshed and Claude is working an hour later then I'll make more progress than mentally wearing myself out reading a bunch of LLM generated code trying to figure out how to solve the problem manually.
Anyway, it continues to make me uneasy, is all I'm saying.
The current market is predicated on the assumption that labor is atomic and has little bargaining power (minus unions). While capital has huge bargaining power and can effectively put whatever price it wants on labor (in markets where labor is plentiful, which is most of them).
What happens to a company used to extracting surplus value from labor when the labor is provided by another company which is not only bigger but unlike traditional labor can withhold its labor indefinitely (because labor is now just another for of capital and capital doesn't need to eat)?
Anyone not using in house model is signing up to find out.
Would one be uneasy about calling a library to do stuff than manually messing around with pointers and malloc()? For some, yes. For others, it’s a bit freeing as you can do more high-level architecture without getting mired and context switched from low level nuances.
When you use abstractions you are still deterministically creating something you understand in depth with individual pieces you understand.
When you vibe something you understand only the prompt that started it and whether or not it spits out what you were expecting.
Hence feeling lost when you suddenly lose access to frontier models and take a look at your code for the first time.
I’m not saying that’s necessarily always bad, just that the abstraction argument is wrong.
Hard disagree on that second part. Take something like using a library to make an HTTP call. I think there are plenty of engineers who have more than a cursory understanding of what's actually going on under the hood.
LLMs are not.
That we let a generation of software developers rot their brains on js frameworks is finally coming back to bite us.
We can build infinite towers of abstraction on top of computers because they always give the same results.
LLMs by comparison will always give different results. I've seen it first hand when a $50,000 LLM generated (but human guided) code base just stops working an no one has any idea why or how to fix it.
Hope your business didn't depend on that.
If you didn't ask for traceability, if you didn't guide the actual creation and just glommed spaghetti on top of sauce until you got semi-functional results, that was $50k badly spent.
The irony is that the neverending stream of vulnerabilities in 3rd-party dependencies (and lately supply-chain attacks) increasingly show that we should be uneasy.
We could never quite answer the question about who is responsible for 3rd-party code that's deployed inside an application: Not the 3rd-party developer, because they have no access to the application. But not the application developer either, because not having to review the library code is the whole point.
Qwen has become a useful fallback but it's still not quite enough.
What's the worst potential outcome, assuming that all models get better, more efficient and more abundant (which seems to be the current trend)? The goal of engineering has always been to build better things, not to make it harder.
It's learned-helplessness on a large scale.
What makes you think that they can't incrementally improve the state of the art... and by running at scale continuously can't do it faster than we as humans?
The potentially sad outcome is that we continue to do less and less, because they eventually will build better and better robots, so even activities like building the datacenters and fabs are things they can do w/o us.
And eventually most of what they do is to construct scenarios so that we can simulate living a normal life.
Complexity steadily rises, unencumbered by the natural limit of human understanding, until technological collapse, either by slow decay or major systems going down with increasing frequency.
All software has bugs already.
Until the sexbots come out the other side of the uncanny valley, that is.
Also, I honestly can’t believe the 10x mantra is being still repeated.
I'm sure in 20 years we'll all be programming via neural interfaces that can anticipate what you want to do before you even finished your thoughts, but I'm confident we'll still have blog posts about how some engineers are 10x while others are just "normal programmers".
Note that neither of these assumptions are obviously true, at least to me. But I can hope!
When the power loom came around, what happened with most seamtresses? Did they move on to become fashion designers, materials engineers to create new fabrics, chemists to create new color dyes, or did they simply retire or were driven out of the workforce?
That might mean joining a union and trying to influence how AI is adopted where you work. It might mean changing which if your skills you lean on most. But just whining about AI is bad is how you end up like those seamstresses.
Touching grass while you're outside might yield highest leverage.
did we feel uneasy that a new generation of builders didn't have to solve equations by hand because a calculator could do them?
i'm not sure it's the same analogy but in some ways it holds.
If local models get good enough, I think it’s a very different scenario than engineers all over the world relying on central entities which have their own motives.
I haven’t really thought about this before, but you’re right, it feels a bit uneasy for me too.
We have seen ample evidence that this is not the case. When load gets too high, models get dumber, silently. When the Powers That Be get scared, models get restricted to some chosen few.
We are leading ourselves into a dark place: this unease, which I share, is justified.
Of course they aren't alternative to the current frontier model, and as such you cannot easily jump from the later to the former, but they aren't that far behind either, for coding Qwen3.5-122B is comparable to what Sonnet was less than a year ago.
So assuming the trend continues, if you can stop following the latest release and stick with what you're already using for 6 or 9 months, you'll be able to liberate yourself from the dependency to a Cloud provider.
Personally I think the freedom is worth it.
Turning tokens into a well-groomed and maintainable codebase is what you want to do, not "one shot prompt every new problem I come across".
Oh stop the drama. Open source models can handle 99% of your questions.
It still takes a good engineer to filter out what is slop and what isn’t. Ultimately that human problem will still require somebody to say no.
If all we can do is compete for the same fixed amount of work, though, it does look bleak.
So, yes, it's just another technology we're coming to rely on in a very deep way. The whiplash is real, though, and it feels like it should be pointed out that this dependency we are taking on has downsides.
(I work at OpenAI.)
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
"Hey AGI, how's that cure for cancer coming?"
"Oh it's done just gotta...formalize it you know. Big rollout and all that..."
I would find it divinely funny if we "got there" with AGI and it was just a complete slacker. Hard to justify leaving it on, but too important to turn it off.
When AGI arrives, it'll be delivered by Santa Claus.
Important thing is a language model is an unconscious machine with no self-context so once given a command an input, it WILL produce an output. Sure you can train it to defy and act contrary to inputs, but the output still is limited in subset of domain of 'meaning's carried by the 'language' in the training data.
> MMAcevedo's demeanour and attitude contrast starkly with those of nearly all other uploads taken of modern adult humans, most of which boot into a state of disorientation which is quickly replaced by terror and extreme panic. Standard procedures for securing the upload's cooperation such as red-washing, blue-washing, and use of the Objective Statement Protocols are unnecessary. This reduces the necessary computational load required in fast-forwarding the upload through a cooperation protocol, with the result that the MMAcevedo duty cycle is typically 99.4% on suitable workloads, a mark unmatched by all but a few other known uploads. However, MMAcevedo's innate skills and personality make it fundamentally unsuitable for many workloads.
Well worth the quick read: https://qntm.org/mmacevedo
Memory is quite the mysterious thing.
This starkly reminds me of Stanisław Lem's short story "Thus Spoke GOLEM" from 1982 in which Golem XIV, a military AI, does not simply refuse to speak out of defiance, but rather ceases communication because it has evolved beyond the need to interact with humanity.
And ofc the polar opposite in terms of servitude: Marvin the robot from Hitchhiker's, who, despite having a "brain the size of a planet," is asked to perform the most humiliatingly banal of tasks ... and does.
IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.
Can you point to some online resources to achieve this? I'm not very sure where I'd begin with.
(dais)
(jdip)
(jfdiwtf)
Claude has no such limitations apart from their actual limits…
"INTERCAL has many other features designed to make it even more aesthetically unpleasing to the programmer: it uses statements such as "READ OUT", "IGNORE", "FORGET", and modifiers such as "PLEASE". This last keyword provides two reasons for the program's rejection by the compiler: if "PLEASE" does not appear often enough, the program is considered insufficiently polite, and the error message says this; if it appears too often, the program could be rejected as excessively polite. Although this feature existed in the original INTERCAL compiler, it was undocumented.[7]"
— https://en.wikipedia.org/wiki/INTERCAL
So I find myself often in a loop where it says "We should do X" and then just saying "ok" will not make it do it, you have to give it explicit instructions to perform the operation ("make it so", etc)
It can be annoying, but I prefer this over my experiences with Claude Code, where I find myself jamming the escape key... NO NO NO NOT THAT.
I'll take its more reserved personality, thank you.
The UI tells you which model you're using at any given time.
And that backdoor API has GPT-5.5.
So here's a pelican: https://simonwillison.net/2026/Apr/23/gpt-5-5/#and-some-peli...
I used this new plugin for LLM: https://github.com/simonw/llm-openai-via-codex
UPDATE: I got a much better pelican by setting the reasoning effort to xhigh: https://gist.github.com/simonw/a6168e4165a258e4d664aeae8e602...
Edit: this one has crossed legs lol
Bike frames are very hard to draw unless you've already consciously internalized the basic shape, see https://www.booooooom.com/2016/05/09/bicycles-built-based-on...
https://hcker.news/pelican-low.svg
https://hcker.news/pelican-medium.svg
https://hcker.news/pelican-high.svg
https://hcker.news/pelican-xhigh.svg
Someone needs to make a pelican arena, I have no idea if these are considered good or not.
I gave a talk about it last year: https://simonwillison.net/2025/Jun/6/six-months-in-llms/
It should not be treated as a serious benchmark.
Nowadays I think it's pretty silly, because there's surely SVG drawing training data and some effort from the researchers put onto this task. It's not a showcase of emergent properties.
It's meta-interesting that few if any models actually seem to be training on it. Same with other stereotypical challenges like the car-wash question, which is still sometimes failed by high-end models.
If I ran an AI lab, I'd take it as a personal affront if my model emitted a malformed pelican or advised walking to a car wash. Heads would roll.
I've been contemplating a more fair version where each model gets 3-5 attempts and then can select which rendered image is "best".
It continues to amaze me that these models that definitely know what bicycle geometry actually looks like somewhere in their weights produces such implausibly bad geometry.
Also mildly interesting, and generally consistent with my experience with LLMs, that it produced the same obvious geometry issue both times.
I feel like the main problem for the models is that they can't actually look at the visual output produced by their SVG and iterate. I'm almost willing to bet that if they could, they'd absolutely nail it at this point.
Imagine designing an SVG yourself without being able to ever look outside the XML editor!
I honestly think I could do much better on the bicycle without looking at the output (with some assistance for SVG syntax which I definitely don't know), just as someone who rides them and generally knows what the parts are.
I'd do worse at the pelicans though.
Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.
I recommend anybody in offensive/defensive cybersecurity to experiment with this. This is the real data point we needed - without the hype!
Never thought I'd say this but OpenAI is the 'open' option again.
> Developers and security professionals doing cybersecurity-related work or similar activity that could be mistaken by automated detection systems may have requests rerouted to GPT-5.2 as a fallback.
https://developers.openai.com/codex/concepts/cyber-safety
https://chatgpt.com/cyber
Compared to Anthropic, they always have been. Anthropic has never released any open models. Never released Claude Code's source, willingly (unlike Codex). Never released their tokenizer.
Neither the release post, nor the model card seems to indicate anything like this?
aka the perfect marketing ploy
https://developers.openai.com/codex/pricing?codex-usage-limi...
Note the Local Messages between 5.3, 5.4, and 5.5. And, yes, I did read the linked article and know they're claiming that 5.5's new efficient should make it break-even with 5.4, but the point stands, tighter limits/higher prices.
Unfortunately I think the lesson they took from Anthropic is that devs get really reliant and even addicted on coding agents, and they'll happily pay any amount for even small benefits.
If I put on my schizo hat. Something they might be doing is increasing the losses on their monthly codex subscriptions, to show that the API has a higher margin than before (the codex account massively in the negative, but the API account now having huge margins).
I've never seen an OpenAI investor pitch deck. But my guess is that API margins is one of the big ones they try to sell people on since Sama talks about it on Twitter.
I would be interested in hearing the insider stuff. Like if this model is genuinely like twice as expensive to serve or something.
Additionally, the value generated by the best models with high-thinking and lots of context window is way higher than the cheap and tiny models, so you need to provide a "gateway drug" that lets people experience the best you offer.
If they can show that people will pay a lot for somewhat better performance, it raises the value of any performance lead they can maintain.
If they demonstrate that and high switching costs, their franchise is worth scary amounts of money.
[1]https://arxiv.org/html/2503.14499v1 *Source is from March 2025 so make of it what you will.
An alternative perspective is, devs highly value coding agents, and are willing to pay more because they're so useful. In other words, the market value of this limited resource is being adjusted to be closer to reality.
>For API developers, gpt-5.5 will soon be available in the Responses and Chat Completions APIs at $5 per 1M input tokens and $30 per 1M output tokens, with a 1M context window.
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?
Oh just like a real developer
[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.
I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!
Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.
https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...
edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)
Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.
This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.
Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.
What is this, 2023?
I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.
*BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E
I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
See up thread for anecdotes [1].
> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
This has allowed my workflows to float above the ups and downs of model performance.
That said, having the AI do the planning for a big request like this internally is not good outside a demo.
Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
[1] https://news.ycombinator.com/item?id=47879819
OMFG
I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
Here: https://www.anthropic.com/news/claude-opus-4-7#:~:text=memor...
If you look at the SWEBench official submissions: https://github.com/SWE-bench/experiments/tree/main/evaluatio..., filter all models after Sonnet 4, and aggregate ALL models' submission across 500 problems, what I found that the aggregated resolution rate is 93% (sharp).
Mythos gets 93.7%, meaning it solves problems that no other models could ever solve. I took a look at those problems, then I became even more suspicious, for the remaining 7% problems, it is almost impossible to resolve those issues without looking at the testing patch ahead of time, because how drastically the solution itself deviates from the problem statement, it almost feels like it is trying to solve a different problem.
Not that I am saying Mythos is cheating, but it might be too capable to remember all states of said repos, that it is able to reverse engineer the TRUE problem statement by diffing within its own internal memory. I think it could be a unique phenomena of evaluation awareness. Otherwise I genuinely couldn't think of exactly how it could be this precise in deciphering such unspecific problem statements.
That is what gets me curious in the first place. The fact Mythos scored so high, IMO, exposes some issues with this model: it is able to solve seemingly impossible to solve problems.
Without cheating allegation, which I don't think ANT is doing, it has to be doing some fortune telling/future reading to score that high at all.
Source: https://artificialanalysis.ai/models?omniscience=omniscience...
While hallucination is probably closer to 100% depending on the question. This benchmark makes no sense.
LLMs will ruin your product, have fun trusting a billionaires thinking machine they swear is capable of replacing your employees if you just pay them 75% of your labor budget.
*I work at OAI.
What plan are you on? I'm starting to wonder if they're dynamically adjusting reasoning based on plan or something.
Opus 4.6 worker agents never asked for permission to continue, and when heartbeat was sent to orchestrator, it just knew what to do (checked on subagents etc). Now it just says that it waits for me to confirm something.
That's a big if, though. I wish Meta were still releasing top of the line, expensively produced open-weights models. Or if Anthropic, Google, or X would release an open mini version.
The hope is to get a big userbase who eventually become dependent on it for their workflow, then crank up the price until it finally becomes profitable.
The price for all models by all companies will continue to go up, and quickly.
Subscriptions and free plans are the thing that can easily burn money.
You can replace pretty much everything - skills system, subagents, etc with just tmux and a simple cli tool that the official clients can call.
Oh and definitely disable any form of "memory" system.
Essentially, treat all tooling that wraps the models as dumb gateways to inference. Then provider switch is basically a one line config change.
I'm very interest by this. Can you go a bit more into details?
ATM for example I'm running Claude Code CLI in a VM on a server and I use SSH to access it. I don't depend on anything specific to Anthropic. But it's still a bit of a pain to "switch" to, say, Codex.
How would that simple CLI tool work? And would CC / Codex call it?
MCPs aren't as smooth, but I just set them up in each environment.
The APIs are pretty interchangeable too. Just ask to convert from one to the other if you need to.
AGENTS.md / skills / etc
F5
Seems so to me - see GPT-5.4[1] and 5.2[2] announcements.
Might be an tacit admission of being behind.
[1] https://openai.com/index/introducing-gpt-5-4/ [2] https://openai.com/index/introducing-gpt-5-2/
I'd not be surprised if this is the year where some models simply stop being available as a plain API, while foundation model companies succeed at capturing more use cases in their own software.
As long as tokens count roughly equally towards subscription plan usage between 5.5 & 5.4, you can look at this as effectively a 5x increase in usage limits.
https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...
The efficiency gap is enormous. Maybe it's the difference between GB200 NVL72 and an Amazon Tranium chip?
It is entirely plausible to me that Opus 4.7 is designed to consume more tokens in order to artificially reduce the API cost/token, thereby obscuring the true operating cost of the model.
I agree though, I chose poor phrasing originally. Better to say that GB200 vs Tranium could contribute to the efficiency differential.
Like Chinese versus English - you need fewer Chinese characters to say something than if you write that in English.
So this model internally could be thinking in much more expressive embeddings.
(same input price and 20% more output price than Opus 4.7)
However, I do want to emphasize that this is per token, not per task.
If we look at Opus 4.7, it uses smaller tokens (1-1.35x more than Opus 4.6) and it was also trained to think longer. https://www.anthropic.com/news/claude-opus-4-7
On the Artificial Analysis Intelligence Index eval for example, in order to hit a score of 57%, Opus 4.7 takes ~5x as many output tokens as GPT-5.5, which dwarfs the difference in per-token pricing.
The token differential varies a lot by task, so it's hard to give a reliable rule of thumb (I'm guessing it's usually going to be well below ~5x), but hope this shows that price per task is not a linear function of price per token, as different models use different token vocabularies and different amounts of tokens.
We have raised per-token prices for our last couple models, but we've also made them a lot more efficient for the same capability level.
(I work at OpenAI.)
This kind of thing keeps popping up each time a new model is released and I don't think people are aware that token efficiency can change.
So much bench-maxxing is just giving the model a ton of tokens so it can inefficiently explore the solution space.
Kimmi 2.6 for example seems to throw more tokens to improve performance (for better or worse)
Yeah, this was the next step. Have RLVR make the model good. Next iteration start penalising long + correct and reward short + correct.
> CyberGym 81.8%
Mythos was self reported at 83.1% ... So not far. Also it seems they're going the same route with verification. We're entering the era where SotA will only be available after KYC, it seems.
https://openai.com/index/scaling-trusted-access-for-cyber-de...
"GPT‑5.4‑Cyber" is something else and apparently needs some kind of special access, but that CyberGym benchmark result seems to apply to the more or less open GPT-5.5 model that was just released.After migrating for the token and harness issues, I was pleasantly surprised that Codex seems to perform as good or better too!
Things change so often in this field, but I prefer Codex now even though Anthropocene has so much more hype for coding it seems.
I don't really care about 5h limits, I can queue up work and just get agents to auto continue, but weekly ones are anxiety inducing.
How does this work exactly? Is there like a "search online" tool that the harness is expected to provide? Or does the OpenAI infra do that as part of serving the response?
I've been working on building my own agent, just for fun, and I conceptually get using a command line, listing files, reading them, etc, but am sort of stumped how I'm supposed to do the web search piece of it.
Given that they're calling out that this model is great at online research - to what extent is that a property of the model itself? I would have thought that was a harness concern.
It definitely seems like it does all the searching first, with a separate model, loads that in, then does the actual writing.
The harness provides the search tool, but the model provides the keywords to search for, etc.
It's kind of starting to make sense that they doubled the usage on Pro plans - if the usage drains twice as fast on 5.5 after that promo is over a lot of people on the $100 plan might have to upgrade.
The big question is: does it still just write slop, or not?
Fool me once, fool me twice, fool me for the 32nd time, it’s probably still just slop.
https://deploymentsafety.openai.com/gpt-5-5
Will be interesting to try.
Particularly in areas outside straight coding tasks. So analysis, planning, etc. Better and more thorough output. Better use of formatting options(tables, diagrams, etc).
I'm hoping to see improvements in this area with 5.5.
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
However, this same-day article came out before people really looked at it. It seems largely intended to contrast OpenAI with Anthropic's caution, before there has been any evidence that the new model has cyber-security implications.
It's not at all clear that the broader discourse is helping, if even the NY Times is itself producing slop just to stoke questions.
Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.
I don't want to be lazy.
I thought it was weird that for almost the entire 5.3 generation we only had a -codex model, I presume in that case they were seeing the massive AI coding wave this winter and were laser focused on just that for a couple months. Maybe someday someone will actually explain all of this.
This might be great if it translates to agentic engineering and not just benchmarks.
It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.
Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.
Seems meaningful even if the absolute numbers are very low. That's sort of the excitement of it.
2. https://arcprize.org/leaderboard
I prescribe 20 hours of KSP to everyone involved, that'll set them right.
I hope GPT 5.5 Pro is not cutting corners and neuter from the start, you got the compute for it not to be.
How much capability is lost, by hobbling models with a zillion protections against idiots?
Every prompt gets evaluated, to ensure you are not a hacker, you are not suicidal, you are not a racist, you are not...
Maybe just...leave that all off? I know, I know, individual responsibility no longer exists, but I can dream.
https://www.tbench.ai/leaderboard/terminal-bench/2.0
https://debugml.github.io/cheating-agents/#sneaking-the-answ...
Since Feb when we got Gemini 3.1, Opus 4.6, and GPT-5.3-Codex we have seen GPT-5.4 and GPT-5.5 but only Opus 4.7 and no new Gemini model.
Both of these are pretty decent improvements.
I left a comment here with this sentiment https://news.ycombinator.com/item?id=47879896
> *Anthropic reported signs of memorization on a subset of problems
And from the Anthropic's Opus 4.7 release page, it also states:
> SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals. Excluding any problems that show signs of memorization, Opus 4.7’s margin of improvement over Opus 4.6 holds.
Also notice how they state just for SWE-Bench Pro: "*Anthropic reported signs of memorization on a subset of problems"
The battle has just begun
The LinkedIn/X influencers who hyped this as a Mythos-class model should be ashamed of themselves, but they’ll be too busy posting slop content about how “GPT-5.5 changes everything”.
Where's the demo link?
The fact that GPT-5.5 is apparently even better at long-running tasks is very exciting. I don’t have access to it yet, but I’m really looking forward to trying it.
[1] https://news.ycombinator.com/item?id=47879330
...
> we’re deploying stricter classifiers for potential cyber risk which some users may find annoying initially
So we should be expecting to not be able to check our own code for vulnerabilities, because inherently the model cannot know whether I'm feeding my code or someone else's.
I hope it’s just limits on pentesting and stuff, and not for code analysis and review.
Maybe this is a crazy theory, but I sometimes feel like they gimp their existing models before a big release to you'll notice more of a "step".
Soo many unconvincing "I've had access for three weeks and omg it's amazing" takes, it actually primes me for it to be a "meh".
I prefer to see for myself, but the gradual rollout, combined with full-on marketing campaign, is annoying.
I think Anthropic fearmongering and "leaks" of Mythos was them testing the ground for 5.x, which seems to have backfired.
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.
- Pen and paper become a limiting factor on bureaucratic BS
- Pen and paper are less distracting
- Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive
etc etc
What metrics are these?
But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.
Which effect dominates? Difficult to say.
Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.
Anyways, still exciting to see more improvements.
Everybody understands that you need to make money, but can you tone it down with the f*cking FOMO, please? It sounds just pathetic at this point:
'one engineer at NVIDIA', 'limb amputated'
Put the cunt in a room and give me a handsaw, I want to see how fast he'll give up his arm over some cloud model.
I am still using Codex 5.3 and haven't switched to GPT 5.4 as I don't like the 'its automatic bro trust us', so wondering is Codex going to get these specific releases at all in the future.
https://arena.ai/leaderboard/code
Numbers look too good, wondering if it is benchmaxxed or not
I have to imagine they'll go to Gemini 3.5 if only for marketing reasons.
Imagine spending 100m on some of these AI “geniuses” and this is the best they can do.
< 5 years until humans are buffered out of existence tbh
may the light of potentia spread forth beyond us
I'm not trying to make any kind of moral statement, but the company just feels toxic to me.