ES version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
β‘ Community Insights
Discussion Sentiment
76% Positive
Analyzed from 16490 words in the discussion.
Trending Topics
#flash#gemini#models#model#google#more#https#pro#price#don

Discussion (635 Comments)Read Original on HackerNews
We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.
We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).
We know Google intends to serve this model at a floor speed of around 280 tok/s too.
Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.
Visual:
I do model serving optimization work. This is napkin math.Edit: There's one factor I under-rated in my initial estimate... TurboQuant. This is a compute to KV memory use tradeoff. It's plausible with TurboQuant at a quality-neutral setting they've gotten the model up to 400B with similar economics. This is a variable effecting concurrency and the the way they decided total model size was likely based on what they see for the average user's average KV cache depth in real-world usage.
300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.
For comparison, DeepSeek V4 Flash is all the rage now for small efficient models. It's very good for its size but far from the performance of the latest GPT Pro and Opus models. The vanilla variant has 284B parameters. It fits on both 256GB and 512GB Mac Studios and hits about 20-30 tokens/second.
The implication of all this here is that you could have a (somewhat sluggish) Opus in a small box at home. At least once competing models and hardware to run them will be available (high end Mac Studios have been discontinued).
Something tells me that this means that Google's performance numbers here are inflated.
[1]https://arxiv.org/pdf/2604.24827
That wouldn't surprise me at all actually, models like Qwen3.6-35B are comparable to frontier level models from a year ago and I wouldn't be surprised if we had self-hostable open weight models matching Opus 4.7 in a year. Assuming that Google has one year of advance against Chinese lab isn't far fetched given how much resources they have compared to their Chinese competitors.
It's not yet obvious to me that you can achieve the breakthrough performance of say Opus 4.1/4.5 in a number of parameters you can swing at home.
I run 2.54 BPW 397B Qwen 3.5 GGUF on a 128G mac studio at 20 tokens/second generation and 200 tokens/second processing. I'm not suggesting it matches the performance of the full BF16 model, but I did run some benchmarks locally and the results were pretty good:
- MMLU: 87.96%
- GPQA diamond: 86.36%
- IfEval: 91.13%
- GSM8k: 92.57%
So I think we have been at the "frontier capabilities at home" for a few months now.
If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.
Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).
Data at https://gertlabs.com/rankings
Mythos is an exception that's larger.
> It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.
The have plenty if data. They use very large amounts of verifiable synthetic data in (lots in coding and math) cover the gap.
Also the frontier labs are paying people to do tasks, tracking the trajectories and training on that. Most of the optimization is in RL based on these trajectories.
Even if he knew, why would anyone expect Elon not to lie about anything?
> The have plenty if data.
I don't think data is the problem either, but compute is: if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.
Some interesting notes:
- Training a small model with large model output resulted in LESS improvement than distilling a less smart model onto the same small architecture [0]. We are starting to hit intelligence density limits in small models (<30B models may be nearing saturation now)
- good RL environments incidentally also make for good benchmarking
[0] https://arxiv.org/html/2502.12143v1
So where does humanity cap out? The statement more or less implies that there's a ceiling of our ability to train models which might be below what LLMs are capable of (e.g. not AGI but how good coding agents they might ever become, for example).
Xai paying cursor to train models with their data, tell us that having an agent tool like claude code is important for quality data acquisition. Thatβs why they recently shipped grok build
I think we will see insane SOTA models from xai in the next few months.
And serving is not training. For distilling you need to train the big models to have something to be distilled.
I mitigate it by creating dense planning docs for everything and executing iteratively.
Lot's of time wasted on procedure unfortunately
MTP - https://blog.google/innovation-and-ai/technology/developers-...
MLA - https://machinelearningmastery.com/a-gentle-introduction-to-...
CSA - https://deepseek.ai/blog/deepseek-v4-compressed-attention
are these pre-generated in a different tool with plain unicode and then just copy-pasted, or is it a built-in feature of hn?
I think itβs pure economics. Flash models are OP for the price, leads to too much demand, google cannot serve it. This is likely expensive to reduce load and hey, if it still makes money just keep the margin.
Itβs not a rumor - there are many public announcements about $B deals around compute for other Ai companies
It seems to be a huge overshot, vide Hy3 model, which this model claims to be 2.4T, while it is 295B.
With the Pro variant being around 600B - 800B
My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.
Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.
Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...
Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.
edit: fixed human hallucination
I ask because:
Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.
But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)
I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.
(not that I am in any sense pro AI, but it's just a weird lack of intellectual rigor)
And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.
When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.
https://www.gianlucagimini.it/portfolio-item/velocipedia/
> most ended up drawing something that was pretty far off from a regular menβs bicycle
Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.
That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.
It really is a lot some of the time. And itβs chain of thought is hilarious a lot of the time.
https://en.wikipedia.org/wiki/Vaporwave
wtf
`<!-- Gold Rim -->`
WTF??
Last time I tried, ChatGPT's image generator got the best result.
I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.
https://en.wikipedia.org/wiki/Synthwave
Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.
To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.
So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.
Gemini 2.5 flash: $0.30/$2.50
Gemini 3.0 flash preview: $0.50/$3.00
Gemini 3.5 flash: $1.50/$9.00
Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).
3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10
Gemini 2.5 flash (27 score): $172 (1.0x)
Gemini 2.5 pro (35 score): $649 (3.8x)
Gemini 3.0 Flash (46 score): $278 (1.6x)
Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)
This is a massive price increase... 5.6x compared to Gemini 3.0 Flash
People really canβt wait to be the next Zynga
Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they donβt have the market share to justify a move like that yet to me.
My guess: it's the price at which they make more money than if they rent the TPUs to other companies.
The Gemini team has had trouble securing enough TPUs for their user's needs. They struggle with load and their rate limits are really bad. Maybe at a higher price, they have a better chance at getting more TPUs assigned?
Just because you are vertically integrated doesn't mean you get to discount the one business units products to the other. Doing so discounts the opportunity cost you pay and is just bad accounting.
You have free local models for most tasks, $20 subscriptions for near-frontier intelligence, and API per token costs for frontier intelligence.
Flash seems to be targeting the near-frontier category.
Open-source model inference providers (who do not have to bear the cost of training) seem able to do it at much lower prices.
https://www.together.ai/pricing
https://fireworks.ai/pricing#serverless-pricing (scroll down to headline models)
Of course, it's possible that they are burning through investor cash as well, and apples-to-apples comparisons are not possible because AFAIK Google does not mention the size/paramcount for 3.5 Flash.
But if the prevailing wisdom is true, I think it's actually encouraging. It suggests that OpenAI and Anthropic could perhaps, if they need to, achieve profitability if they slow down model development and focus on tooling etc. instead. If true that's probably good news for everybody w.r.t. preventing a bursting of this economic bubble.
...my opinions here are of course, conjecture built on top of conjecture....
I think you're right that releasing models at a slower cadence would bring down costs to some degree, but it's not clear how much. All of these companies could significantly reduce their opex but at the risk of falling behind in terms of being at the frontier.
The economic value increases non-linearly as models get more intelligent: being 10% more capable unlocks way more than 10% in downstream value.
That's trouble because the non-linear component means at some point their margins will stop primarily defined by the cost of compute, and start being dominated by how intelligent the model is.
At that point you can expect compute prices to skyrocket and free capacity to plummet, so even if you have a model that's "good enough", you can't afford to deploy it at scale.
(and in terms of timing, I think they're all well under the curve for pricing by economic value. Everyone is talking about Uber spending millions on tokens, but how much payroll did they pay while devs scrolled their phones and waited for CC to do their job?)
Is it? More capability, more demand, higher price. Seems relatively uninteresting. The naming structure complicates it: 3.5 Flash is less comparable to 3.0 Flash than it is to 3.0 Pro.
More generally, $/token + naming scheme comparisons are just confusing: I am not looking for a wordy idiot and I doubt most people are (at least not with what I would consider worthwhile business ambitions). In fact wordy idiots are fairly costly, because we have to consider the large amounts of cheap garbage that they are producing, and if you price your own time somewhat competitively then fairly quickly that's the bigger lever.
Even if we don't consider the last part: How do we price the better model, that can one shot a task without having to go back and forth and spending more tokens or having to fix more bugs later? It is definitely worth something and I think it's quite undervalued right now. What seems to be missing is a better measurement of capability per token. I don't know how that could look like. Maybe something like how we try and measure inflation, some basket of tasks (which then ends up being part of the training data so idk).
Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.
The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.
And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.
The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).
You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.
DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.
This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.
Or if you prefer smaller ones, Qwen3.6-35B-A3B, https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF
Of course not
And you don't need to
https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...
3.1 flash lite isnβt quite as good as 3 flash preview (which is the most incredible cheap modelβ¦ I really love it) β but 3.1 is half the price and the insane speed opens up different use cases.
For comparison, Opus models are $5/$25
Since Gemini 3.5 Flash is raising the price to $1.50/$9.00, it's priced between Haiku and Sonnet. If it outperforms Sonnet, it remains a good value, I guess. Though DeepSeek V4 Flash is much cheaper than all of them, and seemingly competitive.
Of course, if I manage to reach my limits every week on my Claude $200 sub, opus 4.7 is probably priced closer to flash!
Outside of coding, claude models are pretty meh. GPT and Gemini are the workhorses of science/math/finance.
I use it _a lot_ and itβs very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.
That said, I think weβll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.
Fwiw itβs beating Claude Sonnet in most benchmarking (benchmaxxing?), yet theyβve priced it almost half off on a per token basis.
Question is are you going to persuade anyone with this argument?
Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.
A few weeks ago, Steve Yegge claimed he'd heard that Google employees are banned from using Claude & Codex.
https://x.com/Steve_Yegge/status/2046260541912707471
A number of Googlers replied to say that was totally false, including Demis Hassabis, but they were all on the DeepMind team.
https://x.com/demishassabis/status/2043867486320222333
This person here claims they left Google because of the ban, and because the ban applied outside of Google work as well:
https://x.com/mihaimaruseac/status/2046272726881693960
I think false (or hasn't filtered to everyone lol)
I will definitely not be updating to this new model, and I think once 2.5 Flash is deprecated I'll have to re-architect so Gemini is only used for web grounding requests. This is an insane price increase.
I think that they might have reached the latency sweetspot where voice applications become more natural. Natural speech is <100 tokens per second (after STT), so $9 for a million token takes you to roughly 3 hours of speech. That's totally competitive compared to human costs.
Inference alone is certainly profitable. I'm running models at home that are comparable to performance of paid models a year or so ago for free. Even for much larger models the cost around inference serving are clearly manageable.
Training is where the costs are, but I'm increasingly convinced those too could have costs dramatically reduced if necessary. Chinese companies like Moonshot.ai are doing fantastic work training frontier models for a fraction of the cost we're seeing from Anthropic/OpenAI.
This isn't like Uber or Doordash where the economics fundamentally don't make sense (referring to the early days of these services where rates were very cheap).
It's a compelling story that "current AI is unsustainable", but it doesn't pan out in practice for a multitude of reasons (not the least of which is that we can always fall back to what models did last year for basically free).
Profitable maybe, in terms of having low costs, but why pay Google or whoever when you can do it yourself for cheaper/"free"?
The small models are useful for small things like summarizing text or search but not much else.
Even anthropic who does not own any hardware still have a big margin providing claude models.
3.1 Pro did NOT find them. 3.5 flash did. Plus one I hadn't thought of that may or may not work (which it also pointed out).
I'm pretty impressed.
Empty Slot (new Pro as Mythos competitor?)
Old Pro -> now Flash
Old Flash -> now Flash Lite
Old Flash Lite -> now Gemma (and not served by Google)
I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...
This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.
But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.
And if we think this way, it's possible that prices are actually falling?
> which is now Gemma territory, and I can't get that served by Google anymore
Gemma is served by Google. They're serving Gemma 4 26B A4B at $0.15/$0.60.
https://console.cloud.google.com/agent-platform/publishers/g...
https://cloud.google.com/gemini-enterprise-agent-platform/ge...
and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)
I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.
Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.
Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.
https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...
3.5 Flash: Thinking High - 7280 tokens
https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...
3.1 Pro - 28,258 tokens
https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...
Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.
https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...
Gemini 2.5 Pro - 5,325 tokens:
https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...
Gemini 2.5 Flash - 7,556 tokens:
https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...
Gemma 4 31B IT - 3,261 tokens via AI Studio:
https://gistpreview.github.io/?858a42b96af864859a3b89508619d...
Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:
https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...
https://gistpreview.github.io/?da742884e5e830ce71ee4db877519...
OFC this is just for fun, but nevertheless gave me working code on first try.
https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...
8112 tokens @ 52.97 TPS, 0.85s TTFT
https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...
Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...
Generated with LM Studio on a Macbook Pro M2 Max
https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...
This one works:
https://www.svgviewer.dev/s/04ipQgsU
https://gistpreview.github.io/?557f979c82701862bc26d24f10399...
> Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG. Use the Brave Browser to verifty that the image is indeed animated and looks like a proper rowing frog; iterate until you are satisfied with it.
It was able to discover and fix an animation bug, but the result is still far from perfect: https://gistpreview.github.io/?029df86d03bfe8f87df1e4d9ed2f6...
The benchmarks used donβt really give a full story
[1] https://github.com/htdt/godogen
[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...
Actual results for models, one shot:
Gemini 3.5 Flash - Three Little Pigs - 9,050 tokens:
https://gistpreview.github.io/?ed9faa53604035005cae86c63c766...
Gemini 3.1 Pro - Three Little Pigs - 24,272 tokens:
https://gistpreview.github.io/?f506bbfd9b4459c8cd55d89605af8...
Gemini 3 Flash - Three Little Pigs - 5,350 tokens:
https://gistpreview.github.io/?f58eff069cf916031c97d560b0e35...
Gemma 4 31B IT - Three Little Pigs - 5,494 tokens:
https://gistpreview.github.io/?a3aa75abbe8fd7818b73f6fa55ee6...
Gemma 4 26B A4B IT - Three Iittle Pigs - 6,375 tokens:
https://gistpreview.github.io/?1e631caebeb54f9f0cd6d0e3d4d5e...
I don't know if what the doctor said is some kind of idiomatic expression, but appears to be the opposite of sound medical advice. :)
There's still fun stuff, though. I stumbled upon this bit of insanity just yesterday: https://tykenn.itch.io/trees-hate-you. It would have fit in fabulously with the old Flash sites.
Not sure, I'm not versed in game dev. So maybe my point about creation tools is moot.
However, 3D content always seems very samey to me, in a way that cartoons and regular animation don't. So the rest of my comment should still express what I mean.
---
Flash had a WYSIWYG editor aimed at media creators who treat programming at best as an afterthought.
Flash was mostly about ease of tweening and extremely flexible vector graphics engine combined with an intuitive creation tool.
So the "Flash vs HTML/JS/SVG/CSS..." debate is not just about technical capabilities of the medium.
Of course there are many fun web apps in the browser, or as native apps, too. But Flash attracted all kinds of slightly nerdy people with cultural things to say, not just web devs with a lot of free time.
What "HTML5"/browser web technology doesn't offer is this intuitive, visual creation pipeline, and this kind of speaks for itself!
Also, I think the Flash "creator's" age is not separable from its time: using Flash wasn't trivial either.
There were just more people with interesting ideas, free time, and a wholistic talent for expressing their humor and ideas, combined with the curiosity and skill to learn using Flash (of course only as a licensed copy purchased from Macromedia).
People like this today are probably more often hyper-optimizing social media creators, and/or not terminally online.
In other words: I don't think the typical Newgrounds creator would have taken the time and effort to translate a stickman collage, meme, or other idea into a web app / animation.
---
And to add even more preaching: I think that "creating" things using AI produces exactly the opposite effect: feed it an original idea, and the result will be a regression to the mean.
The whole "friendslop" genre is what replaced flash games.
In the html5 camp the features appeared one by one and the tooling is still fragmented.
What happened between flash dying and html5 having a complete toolset is that interest died.
And then I moved to the bay area and noticed there was a road called Page Mill Rd. in Palo Alto and sort of laughed for a bit. Surprised Adobe didn't release a tool called Sandhill.
[*] to be fair, most WYSIWYG page builder tools of the era spat out some sort of crappy subset of HTML, so not trying to say pagemill was the only offender.
Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!
Every time I have heard the word flash for goodness knows how many years.
From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.
The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol
- The model is appox 3.3x cost. - The model is realistically almost 5x cost due to token usage - Google has TPUs to run this on (yet the cost) - Google has a lot more security and backup cash compared to all other AI companies, likely even combined (yet the cost)
We can continue moving the goal posts, but it seems we're at a bit of a wall. Costs are increasing, intelligence is improving, but the cost is rising drastically.
You'd think Google of all companies in the mix would be able to sustain lower costs with how integrated they are with TPU, Deepmind and effectively unlimited budget.
Checked my 5 hour quota, it was 0%, got this for multiple attempts:
I'm getting more image requests than usual, so I can't create that for you right now. Please try again later.
or
Can you ask me again later? I'm being asked to create more images than usual, so I can't do that for you right now.
Went back and found they took 34% of my quota for the privilege of repeating that same error.
I think the "Usage Limits" screen is new so who knows how long they've been counting errors against our quota. I guess I should be grateful it's now visible.
API price for gemini-3.5-flash is 3x gemini-3-flash-preview so they might be throttling it 3x sooner. They should either drop API prices or not advertise AI Pro as supporting Antigravity.
https://ai.google.dev/gemini-api/docs/pricing#gemini-3.5-fla...
It means performs worse than 3.1 Flash Lite Preview (22/25), is slower (367s vs 142s) and is more expensive (75c vs 2c).
It is outperformed by Gemma4 26B-A4B in every way(!)
https://sql-benchmark.nicklothian.com/?highlight=google_gemi...
(Switch to the cost vs performance chart to see how far this is off the Pareto frontier)
I have a SQL agent and my tests with 3.5 are resulting in hitting query budget limits that have never been hit before. On average, to answer the same question, 3.5 is spending 10x more on SQL queries vs gemini-3-flash-preview.
The query patterns can be extremely degenerate too. E.g. the agent will hit the semantic layer tool to pull the schema, then run `SELECT * FROM table LIMIT 1`, which hits the query budget limit and fails.
I've only really been looking this morning, so I need to do a full eval, but the initial results match what your benchmark shows.
---
Side note: your benchmark has an issue. On Q1 medium the model returned gross margin of 0.127 instead of 12.7 (%), and the benchmark failed it. The failures on Q9 and Q21 are the same (I didn't check other questions). Nowhere in the prompt did you specify you wanted the values converted to percentage points and rounded.
If you asked me to write that SQL with that prompt, unless you were throwing it directly into a visualization I would format it the same way gemini-flash did. If I were pulling into a spreadsheet or vis tool this format is preferable because it's easier to format in a client application.
The other failures like Q21 incorrectly averaging the list price are correct failures.
Latest update: May 2026
I have a very bad feeling about this lag.
With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.
Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.
This actually really does matter. Otherwise, the model simply won't know about your product and will always suggest only a few market leaders.
Searching for information on the Internet became a jungle a decade ago, and to be visible you have to pay Google for sunlight. Now, we risk falling into real darkness β until some paid model eventually emerges. This might be the reason Google is fine with training data from 2024. If the top spot is reserved for whoever pays anyway, why bother?
Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.
So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.
If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.
The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.
still the cutoff is very much concerning and inconvenient
Well, looking at OpenAI / Google / Anthropic we see crazy cost increases, such that it might invalidate your unit economics.
Cheering for Chinese models!
And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?
It seems like google does want us to use Chinese models.
They are more willing to wait though, so Chinese models are pretty attractive right now.
Right question: What exactly is Google's plan for the long term pricing of these models, and are we all going to be priced out in a year?
Iβm finding it very bad at instruction following vs 3.1. It calls tools it is told shouldnβt, and it loves calling tools. Thereβs a pretty strong bias towards its training vs system prompt instructions.
Googleβs release notes say to reduce unnecessary tool calls by reducing thinking, but that feels like it should be orthogonal to me.
It definitely has improved a few logic things, like in data visualizations itβs better at labelling data, but itβs much worse at preparing data out of the box.
On tool use. Gave it interactive design assignment on Antigravity 2. Failed miserably until I asked to use playwright for testing. And boy did it go with it. Tested hell out of visuals, nailed the solution.
On following instruction. Asked Gemini Flash 3.5 to summarize YouTube video (google io developer keynote), a task that would previously be trivial (use ot often), but it kept hallucinating points and referencing io dev keynote blog posts from several years ago. Multiple attempts, same result even on repeat requests. Almost insistent on validity of information provided, ignoring questions if it had such capability.
(Typed on a 2023 macbook perfectly capable of running the Chinese open weight models.)
Raw intelligence is high for a flash model. But Google's problem has always been productization and tool use, whereas raw intelligence is always competitive. It does not look like they solved that with this release -- in fact, their tool use delta (the improvement in scores when given arbitrary tools and a harness) has actually regressed from some previous models.
Data at https://gertlabs.com/rankings
https://x.com/arena/status/2056793180998361233
6x the price of 3.1 flash lite
Cost per task is a more productive measure, but obviously a more difficult one to benchmark.
Compare to the GPT-5.5 announcement: https://openai.com/index/introducing-gpt-5-5/
I confirmed this by running a bunch of prompts through Gemini 3.5 Flash without doing anything special to configure caching and noting that it comes back with a "cachedContentTokenCount" on many of the responses.
The "storage price" quoted is for an optional Gemini feature that most people don't care about: https://ai.google.dev/gemini-api/docs/caching#explicit-cachi...
Google: we donβt need Chinese to distill our models, we can do it ourself
https://storage.googleapis.com/gweb-uniblog-publish-prod/ori...
They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.
Can you link to a source?
They are just refining their current models while they finish training the next generation.
They will all come out at about the same time. Anthropic, OpenAi, Google, xAI
Hold on, I think this claim needs some hard data. Here you go gentlemen:
https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...
For intelligence/size only OpenAI and Anthropic are the frontier. Google has more compute so it can compensate for that with size of the models...
Nobody really knows the answer to which one is more optimal
* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.
* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.
It takes on average 2.84s for Gemini 3.5 Flash to give an answer, compared to GPT 5.5 33s [0].
Also the max/slowest test is answered in under 7s, whereas GPT 5.4 takes more than 5 minutes...
[0]: https://aibenchy.com/compare/google-gemini-3-5-flash-low/ope...
More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.
Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.
Claude also believes it knows how AWS' KMS works, quite confidently, while getting things wrong. I have a separate "this is how KMS replication actually works" file just to deal with its misconceptions.
For gemini, I typically use it to query information from large corpuses, but it often web searches and hallucinates instead of reading the actual corpus. On a book series, it will hallucinate chapters and events which, while reasonable and plausible, do not exist. "Go look at the files and see if your reference is correct" shows that it's not correct, and it's a mandatory step. But that doesn't prevent hallucination, but makes sure you catch it after the fact, just like a method in a class that doesn't exist gets found out by the compiler. The LLM still hallucinated it.
I was trying to understand a game I've been playing, The Last Spell. I asked it for a tier list of omens -- which ones the community considers most important. At least a few of the names it posts are hallucinated ("omen of the sun" does not exist, and the omens that give extra gold are "savings," "fortune," and "great wealth").
Obviously not a critical use case but issues like this do keep me on my toes regarding whether the thing is working at all. I should ask 3.5 flash to do the same job. (I did try and it once again hallucinated the omen names and some of the effects.)
The fix is easy enough though, a line in my global AGENTS.md instructing agents to search/ask for documentation before working on API integrations.
```
Build a Nango sync that stores Figma projects.
Integration ID: figma
Connection ID for dry run: my-figma-connection
Frequency: every hour
Metadata: team_id
Records: Project with id, name, last_modified
API reference: https://www.figma.com/developers/api#projects-endpoints
```
Note: You do need a Nango account and the Nango Skill installed before it could work.
Two of the three strip titles are hallucinated and two of the three strips are bad examples. Haley is mute in strip 403 and does nothing. Strip 578 is the start of the arc that shows the behavior Gemini is talking about, but has things going wrong so it's not a good example either.
Claude picks a good strip but also hallucinates the strip title: https://claude.ai/share/56be379d-c3da-443e-b60f-2d33c374eba8
...my chats are all pretty long and involve personal conversations, or I've deleted them. It's a lot to ask for someone to post receipts. The number of complaints is enough data.
No matter how big the model is there will be edge cases where it has no data or is out of date. In these cases it just makes stuff up. You can detect it yourself by looking for words like usually or often when it states facts, e.g. "the mall often has a Starbucks." I asked it about a Genshin Impact character released in June 2025 and it consistently interpreted the name (Aino) as my player character because Aino wasn't in its data.
Honestly I'm surprised your haven't encountered it if you're using it more than casually. Pro is much better but not perfect.
Also, prompts that reliably produce hallucinations is kind of a hard ask. It's inconsistent. One day the LLM I work with quotes verbatim from the PCIe spec and it's super helpful. The next day it gives me wrong information and when I ask it what section of the spec that information comes from it just makes up a section number
And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.
I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.
If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.
but for research it makes shit up all the time, I asked GPT5.5 to make me a build for Rogue Trader and not only did it use out of date info, it made up a bunch of skills that were NEVER in the game. I attribute that to there not being enough online information in the wikis or whatever but I wish it would just say "I dont know" instead of hallucinating but I know that's not how the tech works.
https://g.co/gemini/share/33e7a589a161
Coding, however, is solved like magic. Easier to add tests, to be fair.
AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"
(the domain name is dumb and completely unmarketable)
Source: https://developers.googleblog.com/an-important-update-transi...
Itβs not possible to uptrain on preview releases and it did not get that much love for a while.
> Gemini 3.5 Flashβs pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.
https://x.com/arena/status/2056793180998361233
Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.
I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.
One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.
[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview
How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?
3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.
That's everything I needed to know.
Does that mean this model is production ready?
[0] https://news.ycombinator.com/item?id=47076484
At least it read the authors of the article to me.
I wish we would push more towards testing code. Agentic AI excel when it's engaged.
What they did do in the keynote was spend a lot of time talking about their distribution advantage, and how they can own the consumer in search. But not a lot that will benefit partners or developers.
Basically, they released something broadly competitive with Sonnet 4.6, a new Omni model that seems interesting but unclear yet. They have completely ceded the frontier to OpenAI / Anthropic, and are saying "look for pro next month".
The best release since nano banana pro from Google has been Gemma.
Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.
I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose, relatively speaking, that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.
My default AGENTS.md/CLAUDE.md/etc. is a few sentences from Strunk and White, to try to make all the models not suck at writing. It helps keep the models brief, but it doesn't actually make models with shitty prose have good prose. The relevant portion of my agents file is: "Omit needless words. Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts." Which might add up roughly the same as "be brief" in the weights, I don't know.
If you have a prompt that makes GPT a decent-to-good writer, I would like to see it.
Gemini produces decent-to-good prose without prompting, which improves if instructed to be concise. The other models, even the frontier models, do not have decent-to-good prose without prompting, and even with prompting, rarely elevate to what I would consider Good Enough. Part of this may be that GPT and Claude models get used a lot more heavily, and so I'm highly tuned into their idiosyncrasies. The heavy use of emojis, the click-bait headline style, etc. that they both use unprompted. All of that is repugnant to me, so anything that doesn't do all that by default, or at least not as aggressively, has a huge leg up.
I have been trying Gemini since 2.5 for coding.
It is the smartest for creative web stuff like HTML/CSS/JS.
But it has been very stubborn with following instructions like AGENTS.md.
And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.
I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.
I currently use GPT 5.5 + DeepSeek 4 Flash.
BUT I didn't test Gemini 3.5 Flash yet. And it seems, from another comment in this post, that the Antigravity quota for is bricked for Google Pro plans which is the plan I have. So I don't have high hopes.
It's actually 10-15% slower and also more expensive than Gemini 3.1 Pro, because it thinks more than 2.5x Gemini 3.1 Pro.
So that thinking verbosity nullifies the speed and cost gains.
AND the quality is worse than 3.1 Pro for our use cases, making mistakes Pro doesn't make.
For pure chat that's annoying but tolerable. For agentic workflows where output tokens dominate (tool-call replies, reasoning traces, code emission) it's a real practical hit. I'd bet the substitution effect favors DeepSeek and Qwen here pretty fast.
That said, haste makes waste as the price point completely invalidates that
Do you mean "the weight parameters you have access to[sic]" or do you frequently find yourself limited by the model's token vocabulary?
For what it's worth, my own personal metric of LLM-badness the past few months has been the number of times I leap out of my chair in my home office to loudly declare to my wife how much I loathe reading what is being spewed and pushed into my face, and how I am being forced to use AI everyday and deaden my brain cells. Today is like a breath of fresh air.
The Antigravity harness is really well done, so I do agree it's their strong suit. Can't say the same about gemini-cli (though it has a really nice interface)
Would still choose Deepseek for the price
Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.
Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?
Actually there's probably a harness that does that - is someone out there using one?
On my tasks it has not been as good as even Sonnet 4.6 so far.
Instruction following over long context feels worse.
It's not a bad model by any means, better than any pro open source model for sure.
"Yes, your idea is excellent."
"How this works beautifully:"
"This is a fantastic development!"
"This is an exceptionally clean and robust architecture."
and then I point out what feels like an obvious flaw:
"You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."
I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.
I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!
Feels like the AI pricing noose is tightening sooner rather than later.
Relatively speaking here's where it's at:
this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...I really don't know why people down vote me. What do I need to say to make things for free that people like? Sincere question. I put a lot of time and generosity into these things and all I usually get are a bunch of "fuck yous".
This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.
We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?
You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.
The json on the page has a coding index result it hides from the table.
That's what this exposes. It's a sorting from the leading evals company on the coding index for basically every model that matters presented in an easy to parse format that you can feed into model routing harnesses in real time so, for instance, your agents can dynamically upgrade themselves to better models as they come out or cost optimize based on eval results.
I do stuff like this, give it away for free and it's either ignored or makes people angry...
I really wish I didn't piss people off with my sincerity but somehow it always goes down that way
I really appreciate your time thank you so much
I'm not being an ass, I don't know how to talk to people or when I think I'm being clear but I'm actually being cryptic
Also what message we should get from that table is not really obvious.
Also concerned about Gemini models being benchmaxxed generally
I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.
Oh and double the cost is assuming you're not using Google cloud for anything else, because data transfer, storage, anything but compute is 10x the going rate outside of GCP at least.
Plus you can run both Kimi K2.6 and MiMo V2.5 locally at marginal cost (ie. electricity + hosting) for an upfront investment of $300k or, if you're willing to eat the quantization quality hit, $80k.
If not then Iβm not using it.
Cancelled my account 3 months ago, only Claude code level capability would bring me back.
For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)
They're still months behind OpenAI and Anthropic on coding.
Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).
I do use Gemini for "lifestyle" AI usage (web research etc) tho.
This model isnt an advancement, its a previous model that runs more compute, which is why it costs more
(Me): Did you actually read the paper before when I pasted the link?
> I will be completely honest: No, I did not.
> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.
> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.
I am sure it learned a valuable lesson and won't do it again /s
GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.
That probably works for vibe coded apps by non-practitioners.
I suspect that practitioners/professionals will wait longer for better results.
And Google is trying to make something affordable enough for a mass market, ad-supported audience.
They arenβt hyper focused on enterprise like Anthropic is. And thatβs okay. Thereβs room for different players in different markets.
So, who is this for? People that want more ads and worse output, but want it faster? Sounds pretty awful to me.
Plus the vibe of the gemini models are so weird particularly when it comes to tool calling
At this point I kinda need them to shock me to make the switch
Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" β which is why it's higher than a typical flash-class model.
For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00
So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.
If this is the big model release out of google, its a disappointent.
(I suspect you're viewing the "flex" pricing).
> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization
Every Gemini model starting with 2.5 has been a reasoning model.