DeepSeek v4
2041
DE version is available. Content is displayed in original English for accuracy.
DE version is available. Content is displayed in original English for accuracy.
Discussion Sentiment
Analyzed from 26010 words in the discussion.
Trending Topics
Discussion (1555 Comments)Read Original on HackerNews
I have a collection of novel probability and statistics problems at the masters and PhD level with varying degrees of feasibility. My test suite involves running these problems through first (often with about 2-6 papers for context) and then requesting a rigorous proof as followup. Since the problems are pretty tough, there is no quantitative measure of performance here, I'm just judging based on how useful the output is toward outlining a solution that would hopefully become publishable.
Just prior to this model, Gemini led the pack, with GPT-5 as a close second. No other model came anywhere near these two (no, not even Claude). Gemini would sometimes have incredible insight for some of the harder problems (insightful guesses on relevant procedures are often most useful in research), but both of them tend to struggle with outlining a concrete proof in a single followup prompt. This DeepSeek V4 Pro with max thinking does remarkably well here. I'm not seeing the same level of insights in the first response as Gemini (closer to GPT-5), but it often gets much better in the followup, and the proofs can be _very_ impressive; nearly complete in several cases.
Given that both Gemini and DeepSeek also seem to lead on token performance, I'm guessing that might play a role in their capacity for these types of problems. It's probably more a matter of just how far they can get in a sensible computational budget.
Despite what the benchmarks seem to show, this feels like a huge step up for open-weight models. Bravo to the DeepSeek team!
https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B
This is when it happened for anyone interested: https://binaryverseai.com/deepseek-math-v2-benchmarks-review...
My largest models
Same with GPT-5: Latest 5.5, prior 5.4, or actually the original 5 (.0)?
You can't talk about model performance without specifying the exact model.
Summary: Opus 4.6 forms the baseline all three are trying to beat. DeepSeek V4-Pro roughly matches it across the board, Kimi K2.6 edges it on agentic/coding benchmarks, and Opus 4.7 surpasses it on nearly everything except web search.
DeepSeek V4-Pro Max shines in competitive coding benchmarks. However, it trails both Opus models on software engineering. Kimi K2.6 is remarkably competitive as an open-weight model. Its main weakness is in pure reasoning (GPQA, HMMT) where it trails Opus.
Speculation: The DeepSeek team wanted to come out with a model that surpassed proprietary ones. However, OpenAI dropped 5.4 and 5.5 and Anthropic released Opus 4.6 and 4.7. So they chose to just release V4 and iterate on it.
Basis for speculation? (i) The original reported timeline for the model was February. (ii) Their Hugging Face model card starts with "We present a preview version of DeepSeek-V4 series". (iii) V4 isn't multimodal yet (unlike the others) and their technical report states "We are also working on incorporating multimodal capabilities to our models."
But if you prompt it well - give it the reasoning behind why you're asking it to do something - it pulls far ahead.
I will however say that claude work and design are really great up until i blow its limit.
Just ran a couple of them through GPT 5.5, but this is a single attempt, so take any of this with a grain of salt. I'm on the Plus tier with memory off so each chat should have no memory of any other attempt (same goes for other models too).
It seems to be getting more of the impressive insights that Gemini got and doing so much faster, but I'm having a really hard time getting it to spit out a proper lengthy proof in a single prompt, as it loves its "summaries". For the random matrix theory problems, it also doesn't seem to adhere to the notation used in the documents I give it, which is a bit weird. My general impression at the moment is that it is probably on par with Gemini for the important stuff, and both are a bit better than DeepSeek.
I can't stress how much better these three models are than everything else though (at least in my type of math problems). Claude can't get anything nontrivial on any of the problems within ten (!!) minutes of thinking, so I have to shut it off before I run into usage limits. I have colleagues who love using Claude for tiny lemmas and things, so your mileage may vary, but it seems pretty bad at the hard stuff. Kimi and GLM are so vague as to be useless.
Have them do multiplication or other complicated arithmetic. You say that isn't difficult. Then why do they burn 200k tokens in 20 minutes without converging? I did a deep exploration to help myself understand here [0].
[0] https://adamsohn.com/reliably-incorrect/
- One problem on using quantum mechanics and C*-algebra techniques for non-Markovian stochastic processes. The interchange between the physics and probability languages often trips the models up, so pretty much everything tends to fail here.
- Three problems in random matrix theory and free probability; these require strong combinatorial skills and a good understanding of novel definitions, requiring multiple papers for context.
- One problem in saddle-point approximation; I've just recently put together a manuscript for this one with a masters student, so it isn't trivial either, but does not require as much insight.
- One problem pertaining to bounds on integral probability metrics for time-series modelling.
I'd be very curious to know how any LLMs fare. I completely understand if you don't want to continue the discussion because of anonymity reasons.
https://api-docs.deepseek.com/guides/thinking_mode
No BS, just a concise description of exactly what I need to write my own agent.
Western Models are optimizing to be used as an interchangeable product. Chinese models are being optimizing to be built upon.
But so much investment in their platforms, not just their APIs?
Now that you’re winning, others start cloning your API to siphon your users.
Now that you’re losing, you start cloning the current winner, who is probably a clone of your clone.
Highly competitive markets tend to normalize, because lock-in is a cost you can’t charge and remain competitive. The customer holds power here, not the supplier.
Thats also why everyone is trying to build into the less competitive spaces, where they could potentially moat. Tooling, certs, specialized training data, etc
They are developing their moats with the platform tooling around it right now though. Look at Anthropic with Routines and OpenAI with Agents. Drop that capability in to a business with loose controls and suddenly you have a very sticky product with high switching costs. Meanwhile if you stick with purely the ‘chat’ use cases, even Cowork and scheduled tasks, you maintain portability.
Example: the second sentence on the first page says “softwares” but “software” is a mass noun that cannot be pluralized.
Example: the third page about tokens has some zipped code to “calculate the token usage for your intput/output” and obviously “intput” should be “input” but misspelled.
As a company that produces LLMs, they could have even used their own LLM to edit their documentation to fix grammar issues, and yet they did not.
Maybe I’m just extra sensitive to grammar and spelling issues but this kind of lack of attention to detail is a huge subconscious turnoff. I had to fight my urge to close the tab.
I read OpenAI or Anthropic's documentation nowadays and it's just so full of useless junk and self-congratulation that makes it a miserable experience to go through. It's a real shame because OpenAI used to write stellar documentation and publish really lucid papers just few years ago.
> they could have even used their own LLM to edit their documentation to fix grammar issues
In my experience companies who do this rarely stop at using LLMs to fix grammar issues. It becomes full on LLM speak quite fast, especially if there isn’t a native English speaker in the room who can discern what’s good and bad writing.
I constantly see and hear this mistake from actual humans too.
It's fairly ironic that your own comment contains run-on sentences, speculative claims and phrasing peculiarities like "could have even" instead of "could even have". Perhaps you are less sensitive to this than you think!
It's strange that you criticise "could have even" when it is a phrasing clearly being used for emphasis. "Could even have" makes no clearer sense in context.
No irony detected.
Pretty cool, I think they're the first to guarantee determinism with the fixed seed or at the temperature 0. Google came close but never guaranteed it AFAIK. DeepSeek show their roots - it may not strictly be a SotA model, but there's a ton of low-level optimizations nobody else pays attention to.
"Limited by the capacity of high-end computational resources, the current throughput of the Pro model remains constrained. We expect its pricing to decrease significantly once the Ascend 950 has been deployed into production."
https://api-docs.deepseek.com/zh-cn/news/news260424#api-%E8%...
https://api-docs.deepseek.com/zh-cn/news/news260424#api-%E8%...
This is the first figure of the section that the above links point to (https://api-docs.deepseek.com/zh-cn/img/v4-spec.png).
And I can read Chinese.
Early takeaways: from this release, DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast.
The Pro model is slow, not much better in coding reasoning so far when it works, and honestly too unreliable and rate limited to be of much use, currently. Hopefully that improves as new providers host the model. Flash is working fine, and is currently performing competitively with recent releases, but only on agentic workflows. Check back in 24 hours for full combined scoring with tool use and long context for both models.
Many of the frontier Chinese AI labs have released near-frontier models that are just a little bit behind Opus 4.6 in terms of speed, tool use ability, or long context handling. Open weights are winning the AI race, led by China. Crazy couple weeks of releases.
Mimo V2.5 Pro by Xiaomi (not open weights) is actually the best performer of the latest string of Chinese releases in our combined, comprehensive benchmarks, despite getting less attention. Kimi K2.6 is the most interesting open weights release, still. DeepSeek is not the leader in the space anymore.
An interesting pattern with the latest string of Chinese releases is the much better agentic boost (models are not as smart out of the box, but their ability to iterate in a loop with tools makes up most of the difference). Deepseek V4 Flash exemplifying this -- not a smart model on the first try, but it makes up for it over the course of a session.
Others are purely subjective, like LMArena, which really only measures the personality and style preferences of the masses at this point, because frontier LLM technical answers are too hard for the average person to judge.
Then there are some interesting one-off benchmarks, but they lack enough rigor, breadth, and samples to draw larger conclusions from.
So we designed our benchmark with 3 goals: objective measurements (individual submissions not dependent on a human or LLM judge), no known correct answer (so simulations can scale to much higher levels of intelligence), and enough variety over important aspects of intelligence. We do this by running multiple models in cooperative/competitive environments with very complex action spaces and objective scoring, where model performance is relative and affected by the actions of other participants.
And yeah, there are some interesting results when you have a more objective benchmark. It should raise eyebrows when every single sub-release of every company's model is better across the board than its predecessor -- that isn't reality.
but the fact that you cite your brief as your main argument is funny - you don't even have any inherently subjective numbers to justify what you believe, you only have "I don't believe".
If pure speed is most important for your use case, GPT-5.3 Chat is the fastest model we've tested and it's still reasonably smart. Not meant for agentic tool usage / long context, though.
So it might be more useful for business applications or non-engineering usage where you don't need exceptional intelligence, but it's useful to get fast, cheap responses.
I’d like somebody to explain to me how the endless comments of "bleeding edge labs are subsidizing the inference at an insane rate" make sense in light of a humongous model like v4 pro being $4 per 1M. I’d bet even the subscriptions are profitable, much less the API prices.
edit: $1.74/M input $3.48/M output on OpenRouter
In 2023, the depreciation schedule for H100s was 2 years, but they are still oversubscribed and generating signficant income.
Coreweve has upped their depreciation for GPUs to 6 years(!) now, which seems more realistic.
https://www.silicondata.com/blog/h100-rental-price-over-time
Aka: everyone who uses Nvidia isn't selling at cost, because Nvidia is so expensive.
We therefore cannot just look at inference costs directly, training is part of the pitch. Without the promises of continuous improvement and chasing the elusive AGI, money for investments for inference evaporates.
In China you need to appease state goals. In the US you need to appease investor goals.
China will keep funding them regardless of their income, because the goal is (ostensibly) a state AGI/ASI. In the US, the goal is an ROI which may or may not come with AGI/ASI.
They are different economies with different goals. We can look at past Chinese national projects and see that they are fine with burning $50 to get [social goal] that's worth $5.
But seriously, it just stems from the fact some people want AI to go away. If you set your conclusion first, you can very easily derive any premise. AI must go away -> AI must be a bad business -> AI must be losing money.
There are still major unanswered questions here. For instance, all of the incremental data capacity build out is going to businesses that have totally unknown LT unit economics and that today are burning obscene amounts of cash.
At some point (from the very beginning till ~2025Q4) Claude Code's usage limit was so generous that you can get roughly $10~20 (API-price-equivalent) worth of usage out of a $20/mo Pro plan each day (2 * 5h window) - and for good reason, because LLM agentic coding is extremely token-heavy, people simply wouldn't return to Claude Code for the second time if provided usage wasn't generous or every prompt costs you $1. And then Codex started trying to poach Claude Code users by offering even greater limits and constantly resetting everyone's limit in recent months. The API price would have to be 30x operating cost to make this not a subsidy. That would be an extraordinary claim.
eg:
Token prices are significantly subsidized and anyone that does any serious work with AI can tell you this.
https://news.ycombinator.com/item?id=47684887
(the claims don't make any sense, but they are widely held)
I think I understand the major reasons for this meme, but I find it really worrying; there were lots of incorrect ‘it’s a bubble’ conversations here in 2012-2015, but I don’t think they had the pervasive nature and “obvious” conclusion that a whole generation of engineering talent should just, you know, leave.
Meanwhile I am hearing rational economic modeling from the companies selling inference; Jensen, (a polished promoter, I grant you) says it really well — token value is increasing radically, in that new models -> better quality, and therefore revenues and utilization are increasing, and therefore contrary to the popular financial and techbro modeling of 2023, things like A100s still cost quite a lot whether hourly or to purchase. (!) Basically the economic value is so strong that it has actually radically extended the life of hardware.
I just hate to imagine like half of the world’s (or US’s) engineering talent quitting, spending ten years afraid, or wrongly convinced of some ‘inevitable’ market outcome. Feels like it will be bad for people’s personal lives, and bad for progress simultaneously.
I'm still playing with the new Qwen3.6 35B and impressed, now DeepSeek v4 drops; with both base and instruction-tuned weights? There goes my weekend :P
One answer - Chinese Communist Party. They are being subsidized by the state.
Also, note that there's zero CUDA dependency. It runs entirely on Huawei chips. In other words, Chinese ecosystem has delivered a complete AI stack. Like it or not, that's a big news. But what's there not to like when monopolies break down?
That is a huge claim to make with no evidence.
I researched what you said, and I have found no statement to that effect in their paper[0], on huggingface[1], twitter[2], WeChat[3], or in their news release[4].
They only mention as a footnote in only the Chinese version of their news release that they plan to reduce inference costs with the Ascend 950 supernode when it releases[5]. The only mention of Huawei in their paper is that they validated a technique to lower interconnect bandwidth on Ascend NPUs and Nvidia GPUs[6].
[0] https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...
[1] https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
[2] https://xcancel.com/deepseek_ai/status/2047516922263285776
[3] https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg
[4] https://api-docs.deepseek.com/news/news260424
[5] https://api-docs.deepseek.com/zh-cn/img/v4-price.png
[6] Page 16
And while I'm here I want to note that I feel there's a big misunderstanding of what is and isn't demonstrated by DeepSeek. So far as I can tell the major (and important!) innovation is reproducing near-frontier level capabilities at a fraction of the cost, but it may be the case that iterating forward at the frontier is the costly thing and is a cost borne by Western companies and that nuance seems to get lost with DeepSeek. Which is not to say that as a matter of principle that non Western companies aren't sometimes capable of jumping into the lead (Kimi has been super impressive) but if GPT/Claude/etc "only" lead at the frontier with more expensive models, that's still a moat.
https://finance.yahoo.com/sectors/technology/articles/deepse...
Only mention of Huawei in that article (as of now).
DeepSeek’s next AI model delayed by attempt to use Chinese chips
https://www.ft.com/content/eb984646-6320-4bfe-a78d-a1da2274b...
“Due to constraints in high-end compute capacity, the current service capacity for Pro is very limited. After the 950 supernodes are launched at scale in the second half of this year, the price of Pro is expected to be reduced significantly.”
https://x.com/jukan05/status/2047516566149816627
That HN is quick to upvote an unsubstantiated comment ( the grandparent one, because it aligns with the anti US bias? ) and downvote fact finding one doesn't bode too well for the community as a whole. I have seen enough how polticial ideology colors everything in my home country( Malaysia), and the decline of the country is palpable, and I don't expect to find such a thing here. We are supposed to be impassioned and rational, right ?
Render to Jesus what's due to him, ditto for Caeser.
It's also more or less the same move that they've been using pretty much since the WTO entry: take on foreign manufacturing, copy the products, sell knockoffs as their own, build new products on top of the that knowledge.
Jensen came across as incredibly defensive and intentionally close-minded, shows that even billionaires suffer from "a man can't understand something if his paycheck depends on him not understanding it."
Your assertion is silly: did Tesla selling electric cars into China stop them from delivering their own industry? They were going to develop their domestic industry regardless.
We simply don't know the counterfactual, if they had unlimited access to Nvidia chips, how far ahead would their models be?
That's alright. It delays them at least.
China is not perfect but a bit of competition is healthy and needed
We need to accept that being too close to America is harming us and start funding projects to protect our assets e.g talent leaking out to American entities.
https://scrupulouspessimism.substack.com/p/america-means-the...
China’s governments actions are on a completely different level - for example:
“””
Since 2014, the government of the People's Republic of China has committed a series of ongoing human rights abuses against Uyghurs and other Turkic Muslim minorities in Xinjiang which has often been characterized as persecution or as genocide.
“”” https://en.wikipedia.org/wiki/Persecution_of_Uyghurs_in_Chin...
https://www.amnesty.org/en/location/asia-and-the-pacific/eas...
Yes Trump is clearly trying Totalitarianism in America, but it is orders of magnitude different from what is happening in China.
That should be at least comparable (if not worse) than what China is doing.
China is repressing the Uyghur and threatening Taiwan. I don't agree with these actions but is really "orders of magnitude" worse than the destruction the US facilitates in the Middle East?
With Trump they are now openly hostile to European democracies, and ICE and doing their best at repression within the US.
The next decade is going to look very different with America Alone.
With all that goes on it has changed. Recently I sat on a plane near some Americans discussing their holidays here, and I noticed I felt contempt. Sitting their with insane privilege as their government torches the world.
Individuals remain individuals, and one really ought not to be prejudice. However the lack of resistance I see in in the “land of the free” as their “democratic” institutions collapse just makes me believe they never cared at all. In France cars are torched if the pension age is raised. In America the rise facism apparently doesnt matter to them.
https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28no...
> now poorer than every state in America
You've confused the mean with the median. GDP Per Capita is not a measure of how well-off the people in a country are.
American states have a lot more income inequality than the UK does, which (due to positive "non-parametric skewness", I think) pulls their GDP Per Capita upwards.
Yeah, me too. All that pesky saving the world stuff that we do on the regular is so exhausting sometimes.
None of those have brought me a feeling of being part of saving someone.
Yes, even compared to this low price point.
As before, the headline news with DeepSeek isn't in the benchmarks, but that they're competitive there while being gut churningly cheap for the Western AI industry.
Already do on EVs.
We are "only" allowed Claude and MS Copilot for security reasons and cost reasons.
"open source" keeps being redefined by people with wealth and power to restrict our computing rights.
eventually its just gonna be "proprietary microsoft code that runs on microsoft servers, but you can see a portion of the results"
The report only talks about validating the "fine-grained EP scheme" on Huawei hardware.
However there is so many factors involved beyond your control that it would not be a viable option compared to other possible security attacks.
It's like suggesting BYD has a high likelihood of making their cars into weapons or something. It's not in the company or their countries interest to do that.
Sure it could happen but I bet it would only happen in a targeted way. Why risk all credibility right now and engage in cyber warfare?
Meaning Tiktok in the us is complete garbage for kids, almost like a virus. Whereas in China it's more educational.
If I had to place a hidden target it'd probably be around RNGs or publicly exposed services..
I don't mean that flippantly. These things are dumped in the wild, used on common (largely) open source execution chains. If you find a software exploit, it's going to affect your population too.
Wet exploits are a bit harder to track. I'd assume there are plenty of biases based on training material but who knows if these models have a MKUltra training programme integrated into them?
Spearphishing.
Building reliance and exploiting it, through state subsidies, dumping, and market manipulation.
Handicapping provision to the west for competitive advantage.
Tech ceos are going around talking about how they will rule over employees and they will be unable to work in the future except for intelligence tokens. What if China commoditizes that without spending nearly as much resources? Kind of makes the trillions of dollars invested in the US a literal joke.
Even on my phone via Edge Gallery Deepseek to Qwen 1.5B distill able to answer it. It's mess up facts a little, but certainly becauae its small model not because censorship.
I really unsure how it get less censored than this. API is obviously much more censored because they operate from China, but it have nothing to do with model itself.
Of course there are risks.
Really nice to see the Chinese are competing this strongly with the rest of the world. Competition is always nice for the end-consumer.
Then you can run it using some inference backend, e.g. llama.cpp, on any hardware supported by it.
However, this is a big model so even if you quantize it you need a lot of memory to be able to run it.
The alternative is to run it much more slowly, by storing the weights on an SSD. There have already been published some results about optimizing inference to work like this, and I expect that this will become more common in the future.
There are cases when running slowly a better model can still be preferable to running quickly a model that gives poor results, especially when you do not use it conversationally, but to do some work with agents.
So does this mean I can run this on AMD? And on a consumer 9000 series card?
* https://github.com/ggml-org/llama.cpp/releases
This version of AI is mostly taking a public paper from 2017, investing in GPUs, and feeding it as much data as possible. So with a few computer scientists, no respect for intellectual property, and tons of money to burn, you have all the ingredients to create this technology.
Sam Altman and friends did it, as did the Chinese. The difference is that the Americans have been hyping it up to the extreme with all these dramatic scenarios about what would happen if someone else got its hands on it.
The Chinese made it public, among other things to show how fragile this is as a business and as a large part of the US stock market
I love the implication that this paper just dropped out of thin air and not decades of private AI research funded by a US company.
>The Chinese made it public, among other things to show how fragile this is as a business
The Chinese distill US models, that's why they keep trailing close but never exceeding. It's easy to make things public when you didn't take on any of the cost of developing the technology. Stealing US IP and selling cheap copies has been China's MO for decades now.
Where did you read this? From what I read in the paper it appears to explicitly state that they used NVIDIA GPU's and their MegaMOE code, which is written in CUDA.
And in any case what does open source actually mean for an llm? It's not like you can look inside it to see what it's doing.
You can download it from the link given here at the top and you can run it on your own hardware, with whichever open-source harness you prefer, without having to worry about token cost or about subscription limits or about any future degradation in performance that you cannot control.
The recent history has demonstrated that such risks are very significant.
Being open weights is important for anyone who wants to use an LLM. Being open source is important only for a subset of those, who have the will, the knowledge and the means to train a model from its training data.
Having access to the training data used by a model would be very nice, but the reality is that for a normal LLM user it is very beneficial to use an open-weights model with an open-source harness, but it would be much harder to exploit the advantage of having access to all the information about how the LLM has been created.
AllenAi is the fullest open ai I know of
Understandable.
I asked DS itself and it denied this. It says: 'Nvidia chips are absolutely used for DeepSeek V4. The reality is a pragmatic "both-and" strategy, not an "either-or."'
And based on the DS V4 technical report (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...), it is mentioned that:
(In all honesty I relied on DS to give me the above, so I haven't vetted the information in full.)It mentions that Nvidia is still used. It doesn't even mention that Huawei chips are used in production — only in testing and validation, yes.
Bro, seriously?
Edit: it seems "open source" was edited out of the parent comment.
no one is ever going to release their training data because it contains every copyrighted work in existence. everyone, even the hecking-wholesome safety-first Anthropic, is using copyrighted data without permission to train their models. there you go.
It is very much a valuable thing already, no need to taint it with wrong promise.
Though I disagree about being used if it was indeed open source: I might not do it inside my home lab today, but at least Qwen and DeepSeek would use and build on what eg. Facebook was doing with Llama, and they might be pushing the open weights model frontier forward faster.
The training scripts are in Megatron and vLLM.
1. Training data is the source. 2. Training is compilation/compression. 3. Weights are the compiled source akin to optimized assembly.
However it's an imperfect analogy on so many levels. Nitpick away.
For reference, the huawei Ascend 950 that this thing runs on is supposed to be roughly comparable to nVidia's H100 from 2022. In other words, things are hotting up in the GPU war!
Nvidia's forward PE ratio is only 20 for 2026. That's much lower than companies like Walmart and Costco. It's also growing nearly 100% YoY and has a $1 trillion backlog.
I think Nvidia is cheap.
One set of models run on 8GB VRAM / 16GB RAM and another set runs on 24GB VRAM / 64GB RAM. Both are very useful for easy and easy-to-moderate complex code, respectively.
The latest open, small models are incredibly useful even at smaller sizes when configured properly (quant size, sampling params, careful use of context etc).
That's a very strange comment. Why would anyone run a dense model on a low-end computer? A 8B model is only going to make sense if you have a dGPU. And a Qwen3.6 or Gemma4 MoE aren't going to be “beaten the hell out” for most tasks especially if you use tools.
Finally, over the lifetime of your computer, your ChatGPT subscription is going to cost more than the cost of your reference computer! So the real question should be whether you're better off with a $1000 computer and a ChatGPT subscription or with a $2000 computer (assuming a conservative lifetime of 4 years for the computer).
My Strix Halo desktop (which I paid ~1700€ before OpenAI derailed the RAM market) paired with Qwen3.5 is a close replacement for a $200/month subscription, so the cost/benefit ratio is strongly in favor of the local model in my use case.
The complexity of following model releases and installing things needed for self-hosting is a valid argument against local models, but it's absolutely not the same thing as saying that local models are too bad to use (which is complete BS).
Biggest risk I see is Nvidia having delays / bad luck with R&D / meh generations for long enough to depress their growth projections; and then everything gets revalued.
It's like ricing your Linux distro, sure it's fun to spend that time but don't make the mistake of thinking it's productive, it's just another form of procrastination (or perhaps a hobby to put it more charitably).
At this point I would just pick the one who's "ethics" and user experience you prefer. The difference in performance between these releases has had no impact on the meaningful work one can do with them, unless perhaps they are on the fringes in some domain.
Personally I am trying out the open models cloud hosted, since I am not interested in being rug pulled by the big two providers. They have come a long way, and for all the work I actually trust to an LLM they seem to be sufficient.
New model comes out, has some nice benchmarks, but the subjective experience of actually using it stays the same. Nothing's really blown my mind since.
Feels like the field has stagnated to a point where only the enthusiasts care.
Since then it's just been a cycle of the old model being progressively lobotomised and a "new" one coming out that if you're lucky might be as good as the OG Opus 4.5 for a couple of weeks.
Subjective but as far as I can tell no progress in almost a year, which is a lifetime in 2022-25 LLM timelines
Can't argue with subjective experience, but if there were some tasks that you thought LLMs can't do two years ago, maybe try again today. You might be surprised.
DeepSeek-V4-Flash: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
DeepSeek-V4-Pro: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Back in Nov 2025, Opus 4.5 (80.9%) was the first proprietary model to do so.
So it os hard to tell how much of a model gain is due to skill, and how much - overfitting.
And we got new base models, wonderful, truly wonderful
https://openrouter.ai/deepseek/deepseek-v4-flash
`https://openrouter.ai/api/messages with model=deepseek/deepseek-v4-pro, OR returns an error because their Anthropic-compat translator doesn't cover V4 yet. The Claude CLI dutifully surfaces that error as "model...does not exist"
This “no harm to me” meme about a foreign totalitarian government (with plenty of incentive to run influence ops on foreigners) hoovering your data is just so mind-bogglingly naive.
Relatively speaking, DeepSeek is less untrustworthy than Grok.
When I try ChatGPT on current events from the White House it interprets them as strange hypotheticals rather than news, which is probably more a problem with DC than with GPT, but whatever.
That would be a great argument if the American models weren’t so heavily censored.
The Chinese model might dodge a question if I ask it about 1-2 specific Chinese cultural issues but then it also doesn’t moralize me at every turn because I asked it to use a piece of security software.
Even for minor stuff like beeing addicted to drugs.
Looks pretty totalitarian to me.
yes, this is exactly what I'm saying.
This is why I’ve been urging everyone I know to move away from American based services and providers. It’s slow but honest work.
China is a nation built for peace, while western nations are built for war.
But for folks on the opposite side of the world, the threats are more like "they're selling us electric cars and solar panels too cheaply" and the hypothetical "these super cheap CCTV cameras could be used for remote spying"
- Sam Altman & Worldcoin collecting everyone's eyeball scan - Discord attempting to roll out worldwide age & id verification - LinkedIn collecting data on your web browser extensions - WhatsApp collecting browser data via a local server running on device
Its sad to see how you have regulated yourselves into a position where Mistral is your only claim.
My country’s per capita income is $2500 a year. We can’t pay perpetual rent to OAI/Anthropic
This sounds whole lot like potatoh potahto. I think the former argument is very much the correct one: China can undercut everyone and win, even at a loss. Happened with solar panels, steel, evs, sea food - it's a well tested strategy and it works really well despite the many flavors it comes in.
That being said a job well done for the wrong reasons is still a job well done so we should very much welcome these contributions, and maybe it's good to upset western big tech a bit so it's remains competitive.
Just this week they published a serious foundational library for LLMs https://github.com/deepseek-ai/TileKernels
Others worth mentioning:
https://github.com/deepseek-ai/DeepGEMM a competitive foundational library
https://github.com/deepseek-ai/Engram
https://github.com/deepseek-ai/DeepSeek-V3
https://github.com/deepseek-ai/DeepSeek-R1
https://github.com/deepseek-ai/DeepSeek-OCR-2
They have 33 repos and counting: https://github.com/orgs/deepseek-ai/repositories?type=all
And DeepSeek often has very cool new approaches to AI copied by the rest. Many others copied their tech. And some of those have 10x or 100x the GPU training budget and that's their moat to stay competitive.
The models from Chinese Big Tech and some of the small ones are open weights only. (and allegedly benchmaxxed) (see https://xcancel.com/N8Programs/status/2044408755790508113). Not the same.
So you can’t see what facts are pruned out, what biases were applied, etc. Even more importantly, you can’t make a slightly improved version.
This model is as open source as a windows XP installation ISO.
And you think the US tech giants don't have any ulterior motives?!
I just want to remind you that this is happening at the same time as Anthropic A/B tests removal of Code from Pro Plan, and as OpenAI releases gpt-5.5 2x more expensive than gpt-5.4...
That’s a big if. It’s my experience that models that perform very well on benchmarks do not necessarily perform well in real life.
I’ve mostly started ignoring the benchmarks and run my own evals.
Well, yeah... Like Opus 4.5, 4.6, 4.7. Top of the benchmarks and yet it's a pile of crap at the moment and has been for months.
>Can the same be said about DeepSeek or any other open-source model provider performing distillation?
Open source models that distill from SoTA reminds me of the story of Robin Hood -- robbing the rich and giving it to the poor. So to answer your question: yes, but it's better than the alternative where only a select few companies have SoTA models.
Oh, so people might be forced to give back the AI earnings? Should I be worried about the last year's capital gains on my portfolio?
Altman and Amodei are so mad about muhh model when they steal our data and pollute the Internet with slop.
https://x.com/teortaxesTex/status/2026130112685416881
For OSS model, I have z.ai yearly subscription during the promo. But it's a lot more expensive now. The model is good imo, and just need to find the right providers. There are a lot of alternatives now. Like I saw some good reviews regarding ollama cloud.
But more broadly: openrouter solves the problem of making a broad range of models available with a single payment endpoint, so you can just switch around as much as you like.
I have tasks that used to take ~3-5min with Sonnet 4.6. With OpenRouter Kimi, the same task takes 10+ min. It's also just obviously slower in opencode sessions. The results are good, and I love the lower cost, but the speed can be frustrating.
If you're trying to make a buck while unemployed, sure get a subscription. Otherwise learn how to work again without AI, just focus on the interesting stuff.
Another way to keep the ability to try out new models is to buy a reseller subscription like Cursor’s.
I'm on Max x5 plan and any of the 'good' models like Kimi 2.6, GLM, DeepSeek would have cost 3-5x in per-token billing for what I used on my Claude plan the last three months
So unless my Claude fudged the maths to make itself look better, seems like I'm getting a good deal
input: $0.14/$0.28 (whereas gemini $0.5/$3)
Does anyone know why output prices have such a big gap?
Model was released and it's amazing. Frontier level (better than Opus 4.6) at a fraction of the cost.
As a non-Opus user, I'll continue to use the cheapest fastest models that get my job done, which (for me anyway) is still MiniMax M2.5. I occasionally try a newer, more expensive model, and I get the same results. I have a feeling we might all be getting swindled by the whole AI industry with benchmarks that just make it look like everything's improving.
The tricky part is that the "number of tokens to good result" does absolutely vary, and you need a decent harness to make it work without too much manual intervention, so figuring out which model is most cost-effective for which tasks is becoming increasingly hard, but several are cost-effective enough.
Substantially worse at following instructions and overoptimized for maximizing token usage
Codex is just so much better, or the genera GPT models.
https://github.blog/news-insights/company-news/changes-to-gi...
I do some stuff with gemini flash and Aider, but mostly because I want to avoid locking myself into a walled garden of models, UIs and company
If you're feeling frisky, Zed has a decent agent harness and a very good editor.
So while I agree mixed model is the way to go, opus is still my workhorse.
In contrast ChatGPT 5.3 and also Opus has a 90% rate at least on this same project. (Embedded)
All other tests were the same. What are you doing with these models?
Opencode was getting there, but it seems the founders lost interest. Pi could be it, but its very focused on OpenClaw. Even Codex cli doesnt have all of it.
which harness works well with Deepseek v4 ?
This is free... as in you can download it, run it on your systems and finetune it to be the way you want it to be.
In theory, sure, but as other have pointed out you need to spend half a million on GPUs just to get enough VRAM to fit a single instance of the model. And you’d better make sure your use case makes full 24/7 use of all that rapidly-depreciating hardware you just spent all your money on, otherwise your actual cost per token will be much higher than you think.
In practice you will get better value from just buying tokens from a third party whose business is hosting open weight models as efficiently as possible and who make full use of their hardware. Even with the small margin they charge on top you will still come out ahead.
It's about 2 months behind GPT 5.5 and Opus 4.7.
As long as it is cheap to run for the hosting providers and it is frontier level, it is a very competitive model and impressive against the others. I give it 2 years maximum for consumer hardware to run models that are 500B - 800B quantized on their machines.
It should be obvious now why Anthropic really doesn't want you to run local models on your machine.
Doesn't mean Deepseek v4 isn't great, just benchmarks alone aren't enough to tell.
> In our internal evaluation, DeepSeek-V4-Pro-Max outperforms Claude Sonnet 4.5 and approaches the level of Opus 4.5.
If its coding abilities are better than Claude Code with Opus 4.6 then I will definitely be switching to this model.
It's still a "preview" version atm.
There we go again :) It seems we have a release each day claiming that. What's weird is that even deepseek doesn't claim it's better than opus w/ thinking. No idea why you'd say that but anyway.
Dsv3 was a good model. Not benchmaxxed at all, it was pretty stable where it was. Did well on tasks that were ood for benchmarks, even if it was behind SotA.
This seems to be similar. Behind SotA, but not by much, and at a much lower price. The big one is being served (by ds themselves now, more providers will come and we'll see the median price) at 1.74$ in / 3.48$ out / 0.14$ cache. Really cheap for what it offers.
The small one is at 0.14$ in / 0.28$ out / 0.028$ cache, which is pretty much "too cheap to matter". This will be what people can run realistically "at home", and should be a contender for things like haiku/gemini-flash, if it can deliver at those levels.
LMAO
I have no idea why you'd think that, but this is straight from their announcement here (https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg):
> According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6's non-thinking mode, but there is still a certain gap compared to Opus 4.6's thinking mode.
This is the model creators saying it, not me.
Claude4.6 was almost 10pp better at at answering questions from long contexts ("corpuses" in CorpusQA and "multiround conversations" in MRCR), while DSv4 was a staggering 14pp better at one math challenge (IMOAnswerBench) and 12pp better at basic Q&A (SimpleQA-Verified).
That's literally what the I Ching calls "good fortune."
Competition, when no single dragon monopolizes the sky, brings fortune for all.
The US-China contest aside - it is in the application layer llms will show their value. There the field, with llm commoditization and no clear monopolies, is wide open.
There was a point in time where it looked like llms would the domain of a single well guarded monopoly - that would have been a very dark world. Luckily we are not there now and there is plenty of grounds for optimism.
And China may have changed in some ways but there have been no signals it would not repeat that event if it thought circumstances warranted.
These are not equivalent.
We conduct amoral behavior with terrorist regimes for dollars.
TikTok and Hasan has really turned the West against itself.
Liberal democracies have moral high ground over authoritarian dictatorships (at least along that one dimension)
The US is backsliding tragically (and stupidly) and may lose that moral high ground, but the rest of the western democracies will still have it
The elected government of the US has the moral highground of over the regime that killed the KMT in it's weakened state after the KMT defeated Japan, went on a rampage against the educated classes, mowed down its own people with machineguns and tanks when they demanded a say in their own governments, and kidnaps people advocating for democracy to this day, including Jack Ma.
> despite starting a new war... on behalf of Israel every six months.
The war started when Hamas, funded by Iran, went on a murder and rape rampage against Israeli civilians.
Thinks America is starting wars on behalf of Israel.
LMAO
Fully agree. From a US perspective, that sucks. For everyone else it's pretty great.
At this point the world's opinions of China are better than those of the US in some polls. One country invests and helps build infrastructure on a massive scale globally, the other alienates allies, causes countless conflicts, and openly threatens to end civilizations.
Indeed, even if one isn't partial to China, there's reasons to be glad that an increasingly hostile US has powerful competition.
> This is about who will dominate the world of tomorrow.
For this you'd need a technological moat. So far the forerunners have burned a lot of money with no moat in sight. Right now Europe is happy just contributing on research and doing the bare-minimum to maintain the know-how. Building a frontier model would be lobbing money into the incinerator for something that will be outdated tomorrow. European investors are too careful for that - and in this case seem to be right.
This is how I see it. The US has openly threatened multiple times to annex my country, and has repeatedly threatened every western nation. Letting the US have a monopoly on... well.. anything, is really bad for the world. The more countries that have their own production for various critical things like computer chips, medicine, etc, the better it is for the world at it distributes power.
People in the US don't seem to understand that with the current administration the US is seen as a potentially very hostile nation. While I don't think China is a friend to Canada or the west, at least it provides alternatives when the US tries to use it's monopolies against us. And vice versa too.
>Building a frontier model would be lobbing money into the incinerator for something that will be outdated tomorrow. European investors are too careful for that - and in this case seem to be right.
Strong disagree here. Mistral does great work, in the long term being a few months or even a year behind is a non-issue. Also Cohere just merged with Aleph Alpha to continue producing foundational models. It's extremely important that the middle powers continue to do this.
I am not washing away the authoritarianism, but take a look at other economic super powers directionality. Or that of tech ceos as well. At least Chinese tech companies aren't going around praising wwii Germany, writing manifestos, and bombing children at school or fisherman on whims. It is difficult not to see more countries regardless of leadership putting their hat in the ring as a net positive. Especially if it increases sustainability and lowers the price, which this very clearly does. It's even open source...
China's policies and government aren't morally defensible and I do fear that they will become more aggressive in spreading their influence and policies onto other countries, but from an economic standpoint what they're doing is super effective. While the previous world power (the US) is stuck in infighting and going through cycles of fixing/undoing the previous administration's damages, instead of planning ahead.
Yet, it's the democratic regime which is causing all the chaos around the world and disrespecting the leadership of other jurisdictions, just to keep pushing the petrol dollar going up.
Do we ever think there's any subtle difference between authoritarian and democratic? Where democracy ultimately makes the world a better place?
And in the hardware side, RISC-V is gaining a lot of traction in China. So the dependency on a single supplier is lower with the Chinese tech stack than with most western options.
Alternative being the current reality and world being dominated by US. Let's ask people in Middle East/Asia/South America about how they feel about that. In this current day and age, how is this statement even relevant?
I personally love the bit "us initiated tech war" lol. thats right, they started making AI its their fault! bad imperialist US !
yeah, v5 will do better
It’s this sort of example (and not properly supporting Ukraine, and not agreeing how to collectively deal with migrants, and not agreeing how to coordinate defence, and myriad other examples) that highlights what a pointless mess the EU is. It’s not a unified block - it’s 27 self-interested entities squabbling and playing petty power games, while totally failing to plan for the future with vision.
The EU could/should have ensured that a European equivalent to OpenAI or Anthropic could thrive, and had competitive frontier models already; instead, they’re years and countless billions behind.
the people and industry arent what matter there
I think they are leaders in the democratization of LLMs. Almost everyone has a computer right now that can run a useful variant of a Mistral model. I hope they keep their focus because what they are aiming for likely has the biggest impact on the average person and would be the best case scenario for the technology in general.
Their main selling point is: They are neither US-American nor Chinese. That's a real moat in today's world. I think at the moment they feel quite comfortable.
I don't know what the problem is. Are we europeans to stupid? Do we just not have enough money / VC money? Are we not proud enough?
:(
I feel uneasy over China dominance as much as the US.
I trust US more still as Europe has a post WW2 relationship. I notice many comments being pro China but they seem to be from the third world (one mentioned a very low salary) I feel the opening of the internet was a mistake.
China is a totilitarian dictatorship. This is a fact.
Look into Mistral AI too :)
For context, I am Swedish.
Yes this is a new account, please focus on the content.
I think their stance often comes from a strong anti-Western bias, and sometimes from feelings of resentment.
Dont get me wrong, Sweden is a cool country, but still my point stands.
Trust whoever you want, I just don't have the patience (or money) for American models.
Yeah, I also really hate when poor people think they're allowed to talk.
The idea that China is worse than America is laughable. LMK when China invades 5 countries in a span of 20 years unimpeded by anyone else in the world and maybe I'll be scared.
Until then it's quite clear how consumers benefit from actual competition and it's not because of the US.
Also you saying you trust the US when they just threatened to invade Greenland (a threat so credible that Denmark was planning a full scale resistance against US troops).
Sorry but the curtains are truly coming down and the US will become one of the most hated nations in the world while 100s of millions will needlessly starve and die because of the actions of Americans that simply don't give a fuck.
FWIW, I'm not just talking about Trump either. Democratic politicians are just as much to blame, they champion corporatism and imperialism as much as Republicans and the only issues D leadership seems to have is that the "right process" wasn't being followed.
I say this as someone who is a literal democratic operative within the party.
It also seems like clashes with India, every southeast asian country with internationally recognized territory rights in the South China sea, the forcible takeover of Hong Kong, arming and economically supporting Russia, Pakistan and Iran are bad, and the increasing probability of a hot war to take over Taiwan should count as bad, perhaps the most urgently dangerous threat to global peace in the 21st century.
The United States track record post WW2 is a complicated combination of monstrously immoral Kissenger and Bush style overthrows of democracies and genuinely valuable maintenance of a post WW2 democratic order focused on things like free speech and human rights. I stay with full sincerity that in the decade plus that I've been here on hn seeing whataboutism as a strategy for defending China, I'm yet to encounter anything that feels like a sincere engagement with United States role in the world as a combination of positives and negatives, it's always flatly one-sided messaging that feels like it's aimed at a favorable audience that already agree rather than like it's sincerely attempting to persuade.
This war could have been handled much differently and better, but acting like America attacked Iran for no reason is laughable. It is in fact America’s inexplicable reticence to kill Iranian civilians that is the reason this is going on for this long. America could have ended this in a few days if it had stopped worrying about being criticised by the rest of the world that hates it anyway.
https://www.nytimes.com/2026/04/07/us/politics/trump-iran-wa...
https://i.kym-cdn.com/photos/images/original/002/352/212/95b...
China and Russia trade in yuan and rubles. India and Russia do oil deals in rupees. China and Brazil trade in yuan. The US hasn't bombed any of them.
Also, feeling the opening of the internet as a mistake show the degree of your ignorance, people from third world countries also have the right to speak as much as you do, your opinion is not more valid than anyone else's.
For context, I am Italian-Brazilian, so I pretty much have been exposed to both sides (western and non-western, even though we can argue that Brazil is more west aligned).
They sanctioned the hell out of Huawei and now Huawei is bigger than ever
America is just not able to digest the idea that another country can be as good, if not better, at innovation
China's fall in the 19th century came at them for the same reason. How could these European savages be stronger, thus better than us? Our intelligence service must be out of their mind.
Sovereign and non-sovereign nations have completely different decision matrices for dealing with external threats
It costs 100-1000x less manpower, money, and time to hug the heels of innovators than to actually pioneer. Say what you will about America but they absolutely lead technological innovation and it's not even remotely close.
China had literally 60M people die in a famine when JFK was president and Elvis was the biggest thing. The country was basically farmland and basic industries 40 years ago
Why would you even compare their capabilities today vs a country that has been a sovereign nation for 250 years?
You look at trajectories, not the present
Walmart is a horrible company owned by horrible people and yet it’s cheap so it dominates.
If the quality really is in the Opus 4.6 range (considering how bad 4.7 is), then it’s a pretty big deal.
Deepseek is a mid model. not SOTA.
It’s a burned ccp money at this point . They will not be able to serve it until H2 2026 . Even at this point if you look at opus 4.7 and gpt 5.5 this model is just mediocre.
By the time they can serve it nobody will care at all.
Also it's tech they can be sure we can't cut them out of or tariff and money flowing from Chinese companies to other Chinese companies which we appreciate the benefits of when the shoe is on the other foot.
For me as a consumer, competition is good - that means companies have less leverage over me, which is beneficial even if I decided to never use a Chinese model ever.
If/when they overtake the US, all things aside, they deserve it. There is no world where the US overtakes China but there’s a world where China overtakes the US. Best outcome for the US atm is parity.
Just remarkable the things they’ve accomplished in the time they’ve accomplished them.
1. There will be no moat where one company "owns" AI. China will see to that. It's simply too much in their national interest for that not to happen;
2. This is incredibly bad news for OpenAI who have raised so much money with so (comparabley( little revenue that the only way they can get a return on that is to "win" and be that company that "owns" AI; and
3. China's chipmaking will catch up with Taiwan within the next decade (with commercial EUV at scale within 5 years). I liken this to American hubris over the development of the atomic bomb where in 1945 many American leaders and military thought the USSR would either never get the atomic bomb or it would take 20+ years. It took 4. And they USSR's first hydrogen bomb was detonated a year after the US's.
Whereas the USSR did this with espionage. times have changed. Now all China has to do is throw a few million dollars at hiring the right people froM ASML and elsewhere. China has the track record of delivering on long term projects. Closing the lithography gap will be no different.
its naive to think they would have stayed on a 'western' stack.
Most of the time 'losing' isn't making a bad choice its being put in a situation where you have no good choices.
I’ve talked to the folks over at Unitree multiple times and they say “yeah we’ll be hiring overseas soon” and then they never do and they only have five openings in China
You just aren't going to this too much in the US or any countries fully aligned with the US for fear of competition. It doesn't benefit anyone really. It's not like I get richer when Ford says more vehicles or Meta makes more teenagers suicidal, so why should we care? It'll hurt the country in the long run too.
Was expecting that the release would be this month [1], since everyone forgot about it and not reading the papers they were releasing and 7 days later here we have it.
One of the key points of this model to look at is the optimization that DeepSeek made with the residual design of the neural network architecture of the LLM, which is manifold-constrained hyper-connections (mHC) which is from this paper [2], which makes this possible to efficiently train it, especially with its hybrid attention mechanism designed for this.
There was not that much discussion around it some months ago here [3] about it but again this is a recommended read of the paper.
I wouldn't trust the benchmarks directly, but would wait for others to try it for themselves to see if it matches the performance of frontier models.
Either way, this is why Anthropic wants to ban open weight models and I cannot wait for the quantized versions to release momentarily.
[0] https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...
[1] https://news.ycombinator.com/item?id=47793880
[2] https://arxiv.org/abs/2512.24880
[3] https://news.ycombinator.com/item?id=46452172
Do you have a source?
It's hard not to see Anthropic's messaging of "this tech that we're pushing on you is going to take your job and maybe kill you" as being about anything other than regulatory capture, with the goal of the government shutting down competitors.
I think OpenAI and Anthropic are both really in a tough spot - spending so much on what is becoming a commodity product for which neither seems positioned to be low cost producer. Maybe a bit like the UK-France channel tunnel project where the product itself is a success but a bloodbath for those who invested to build it.
If I considered myself a 10X programmer, now I am 100X. Love DeepSeek.
[1] https://github.com/kuyawa/mecha-ai
For context, for an agent we're working on, we're using 5-mini, which is $2/1m tokens. This is $0.30/1m tokens. And it's Opus 4.6 level - this can't be real.
I am uncomfortable about sending user data which may contain PII to their servers in China so I won't be using this as appealing as it sounds. I need this to come to a US-hosted environment at an equivalent price.
Hosting this on my own + renting GPUs is much more expensive than DeepSeek's quoted price, so not an option.
As a European I feel deeply uncomfortable about sending data to US companies where I know for sure that the government has access to it.
I also feel uncomfortable sending it to China.
If you'd asked me ten years ago which one made me more uncomfortable. China.
But now I'm not so sure, in fact I'm starting to lean towards the US as being the major risk.
It's doesn't seem all that out there compared to the other Chinese model price/performance? Kimi2.6 is cheaper even than this, and is pretty close in performance
With DS tech though the worry is generally more capacity. Haven't seen issues with v4 but in the past their combination of quality and pricing means they get overloaded.
This is a pretty interesting thing they've built in my opinion, and not something I'd expect to be buried in the model paper like this. Does anyone have any details about it? Google doesn't seem to find anything of note, and I'd love to dive a bit deeper into DSec.
In my tests too[0], it doesn't reach top 10. One issue, which they also mentioned in their post, is that they can't really serve well the model at the moment, so V4-Pro is heavily rate-limited and gives a lot of timeout errors when I try to test it. This shouldn't be an issue though, considering the model is open-source, but it makes it hard to accurately test at the moment.
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
I would say I wouldn't notice this wasn't Opus 4.6. What I asked was looking at a feature implemented recently, and how it could be improved. Consumed 3.3 million tokens and create a much better flow.
It had a bug when I started the implementation though related to the API, which I suppose it is something they didn't catch when making their API compatible with CC.
I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.
(I am confused by the results your website is presenting)
I see that you tried to justify this lower in the thread, but no… it completely invalidates your benchmark. You are not testing the model. You are conflating one specific model host and model performance, and then claiming you are benchmarking the model. All major models are hosted by multiple different services.
In the real world, clients will just retry if there is a server error, and that will not impact response quality at all, and the workflow the model is being used in will not fail. If a workflow is so poorly coded that it doesn’t even have retry logic, then that workflow is doomed no matter which host you use. But again, reliability of the host is separate from the model.
You can make your benchmark valid by having separate leaderboards for model quality and host reliability. I’m not saying to throw the whole thing away. But the current claim is not valid.
And you’re also making an unsourced claim that everyone else has already determined this model sucks? Nah. The first result from Artificial Analysis shows good things: https://x.com/ArtificialAnlys/status/2047547434809880611
But I am still waiting to see the results from the full suite of AA benchmarks.
They have Gemini 2.5 Flash ahead of Opus 4.6: https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...
Absolutely worthless benchmark but every release has a comment linking to this nonsense.
Why does it matter if the model/architecture/weights are open source or not, given it's their proprietary inference hardware they're currently having issues with? Proprietary or not, the same issue would still be there on their platform.
If the conclusion is: "DeepSeek v4 is this good, if you use it from DeepSeek" (which is how most people would use it anyway), then it makes sense to count API errors as failures.
But, if the conclusion must be "The DeepSeek v4 model is this good when self-hosted and ran at ideal conditions", then the model should be tested locally, and skipping all invalid calls.
I am still debating what should I do in this case, because showing a model as #1, and then people try to use it from their official provider and it fails half of the time, then that's also not a good leaderboard.
I am considering adding a "reliability" column. Retry API errors until the test completes, BUT track how many retries was needed and compute a separate reliability score. But here comes a different problem: reliability varies over time and providers, so that's tougher to test.
If you want to measure their API, do so, but don't place it under the same category as testing the model itself, as they're two different metrics.
https://simonwillison.net/2026/Apr/24/deepseek-v4/
Both generated using OpenRouter.
For comparison, here's what I got from DeepSeek 3.2 back in December: https://simonwillison.net/2025/Dec/1/deepseek-v32/
And DeepSeek 3.1 in August: https://simonwillison.net/2025/Aug/22/deepseek-31/
And DeepSeek v3-0324 in March last year: https://simonwillison.net/2025/Mar/24/deepseek/
As in have the model consider its generated SVG, and gradually refine it, using its knowledge of the relative positions and proportions of the shapes generated, and have it spin for a while, and hopefully the end result will be better than just oneshotting it.
Or maybe going even one step further - most modern models have tool use and image recognition capabilities - what if you have it generate an SVG (or parts/layers of it, as per the model's discretion) and feed it back to itself via image recognition, and then improve on the result.
I think it'd be interesting to see, as for a lot of models, their oneshot capability in coding is not necessarily corellated with their in-harness ability, the latter which really matters.
I should try it again with the more recent models.
Could you please try with Opus 4.7? I think there's a chance of it doing well, considering the design/vision focus.
Let me tell you how much the Pro one sucks... It looks like failed Pedersen[1]. The rear wheel intersects with the bottom bracket, so it wouldn't even roll. Or rather, this bike couldn't exist.
The flash one looks surprisingly correct with some wild fork offset and the slackest of seat tubes. It's got some lowrider[2] aspirations with the small wheels, but with longer, Rivendellish[3], chainstays. The seat post has different angle than the seat tube, so good luck lowering that.
[1] https://en.wikipedia.org/wiki/Pedersen_bicycle
[2] https://en.wikipedia.org/wiki/Lowrider_bicycle
[3] https://www.rivbike.com/
I wonder which model will try some more common spoke lacing patterns. Right now there seems to be a preference for radial lacing, which is not super common (but simple to draw). The Flash and Pro one uses 16 spoke rims, which actually exist[1] but are not super common.
The Pro model fails badly at the spokes. Heck, the spokes sit on the outside of the drive side of the rim and tire. Have a nice ride riding on the spokes (instead of the tire) welded to the side of your rim.
Both bikes have the drive side on the left, which is very very uncommon. That can't exist in the training data.
[1] https://cicli-berlinetta.com/product/campagnolo-shamal-16-sp...
1) LLM is not AGI. Because surely if AGI it would imply that pro would do better than flash?
2) and because of the above, Pelican example is most likely already being benchmaxxed.
at the top of the linked pages.
How much does the drawing change if you ask it again?
Have you noticed the deepseek-v4-pro performing worse than deepseek-v4-flash? It performed even worse than qwen3.5-27b. I found it surprising and I'm wondering if there is a bug on my software because I had to implement sending the `reasoning_content` otherwise the API failed with BadRequestError.
It's five times bigger in both total and active parameters!
The website now has a link to the announcement on Twitter here https://x.com/deepseek_ai/status/2047516922263285776
Copying text of that below
DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.
DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.
Try it now at http://chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today!
Tech Report: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...
Open Weights: https://huggingface.co/collections/deepseek-ai/deepseek-v4
https://xcancel.com/deepseek_ai/status/2047516922263285776
"Due to constraints in high-end compute capacity, the current service capacity for Pro is very limited. After the 950 supernodes are launched at scale in the second half of this year, the price of Pro is expected to be reduced significantly."
So it's going to be even cheaper
Gemini-3.1-Pro at 91.0
Opus-4.6 at 89.1
GPT-5.4, Kimi2.6, and DS-V4-Pro tied at 87.5
Pretty impressive
If AI was so good at coding, why can’t it actually make a usable Gemini/AI Studio app?
In my experience, Gemini is the most insightful model for hard problems (particularly math problems that I work on).
https://api-docs.deepseek.com/guides/coding_agents#integrate...
But in this case, it's more likely just to be a tooling issue.
Stuff that was prohibitive six months ago is now up for grabs. We keep on working on the infra level now, swithcing models whenever we run out of credits, or want a different result. The question is how do we build context, architecture and ensure the agent is effective and efficient..... wouldn't it be good if we simply used less energy to make these AI calls?
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
Where previously I was wary to under-provide the intelligence level, I'm now more excited about the idea of being able to give these pretty large intelligent models to my application. The idea that for basically sub-agents, we can fine-tune them, should reasonably expect to perform as well as Opus for a specific subtask of which my applications have many.
In other words, we can run a general-purpose intelligent model, Sonnet or Opus, orchestrating a fleet of, let's say, 30 to 50 of these sub-agents that have been fine-tuned. By doing that, I can get very low pricing versus something that would have occurred if I used Opus or Sonnet for everything.
I've heard so many people saying this for the last year, and even tried doing it myself too, and never seen a successful application of it, nor succeeded myself either with SOTA models that are smart but slow or local models that are dumb but fast (even with beefy hardware).
What makes you believe this is possible in the first place? Every "swarm of agents" implementation I've seen only been able to produce lowest quality of code, most of the time vastly bloated, but surely you must have seen something working in practice that you could share with the rest of us?
Kimi 2.6 went hard and left me with a buggy mess. GLM 5.1 hedged and made a 25 line change (but it was an improvement). DS V4 went hard, fixed its issues along the way, and left me with a significantly nicer codebase! (...that I will now be spending some time testing before releasing to the project)
[0]: lmcli (simple, Go, nice UX, MIT licensed, works well with DS V4) https://codeberg.org/mlow/lmcli
dang, probably the two should be merged and that be the link
Are there comparisons between Pro non thinking and Flash thinking ? i don't really get the use case for Flash thinking and Pro non thinking
Not gonna happen
Which strikes me as odd - Inwoukd have assumed someone had an edge in terms of at least 10% extra GPUs.
Keep an eye on https://huggingface.co/unsloth/models
Update ten minutes later: https://huggingface.co/unsloth/DeepSeek-V4-Pro just appeared but doesn't have files in yet, so they are clearly awake and pushing updates.
I have never tried one yet but I am considering trying that for a medium sized model.
As I understand it if DeepSeek v4 Pro is a 1.6T, 49B active that means you'd need just 49B in memory, so ~100GB at 16 bit or ~50GB at 8bit quantized.
v4 Flash is 284B, 13B active so might even fit in <32GB.
V4 is natively mixed FP4 and FP8, so significantly less than that. 50 GB max unquantized.
My Mac can fit almost 70B (Q3_K_M) in memory at once, so I really need to try this out soon at maybe Q5-ish.
Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.
Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.
Note: these were just two that I starred when I saw them posted here. I have not looked seriously at it at the moment,
https://github.com/danveloper/flash-moe
https://github.com/t8/hypura
"Not seduced by praise, not terrified by slander; following the Way in one's conduct, and rectifying oneself with dignity." (不诱于誉,不恐于诽,率道而行,端然正己)
(It is mainly used to express the way a Confucian gentleman conducts himself in the world. It reminds me of an interview I once watched with an American politician, who said that, at its core, China is still governed through a Confucian meritocratic elite system. It seems some things have never really changed.
In some respects, Liang Wenfeng can be compared to Linux. The political parallel here is that the advantages of rational authoritarianism are often overlooked because of the constraints imposed by modern democratic systems. )
Strix halo has 256 GB/s bandwidth for $2500. The Flash model has 13 GB activations.
256 / 13 = 19.6 tokens per second
Except you cannot fit it into the maximum RAM of 128 GB Strix Halo supports. So move on.
Another option is Threadripper. That's 8 memory channels. Using older DDR4-3200 you get roughly 200 GB/s. For $2000.
200 / 13 = 15.4 tokens per second
But, a chunk of per-token weights is actually always the same and not MoE, so you would offload that to a GPU and get a decent speedup. Say 25 tokens per second total.
Then likely some expensive Mac. No idea.
Eventually you arrive at a mining rig chassis with a beefy board and multiple GPUs. That has the benefit of pipelining. You run part of the model on one GPU and move on, so another batch can start on the first one. Low (say 30-100) tps individually, but a lot more in parallel. Best get it with other people.
A mac with 256 GB memory would run it but be very slow, and so would be a 256GB ram + cheapo GPU desktop, unless you leave it running overnight.
The big model? Forget it, not this decade. You can theoretically load from SSD but waiting for the reply will be a religious experience.
Realistically the biggest models you can run on local-as-in-worth-buying-as-a-person hardware are between 120B and 200B, depending on how far you’re willing to go on quantization. Even this is fairly expensive, and that’s before RAM went to the moon.
The flash version here is 284B A13B, so it might perform OK with a fairly small amount of VRAM for the active params and all regular ram for the other params, but I’d have to see benchmarks. If it turns out that works alright, an eBay server plus a 3090 might be the bang-for-buck champ for about $2.5K (assuming you’re starting from zero).
https://news.ycombinator.com/item?id=47885014
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Codex shows ~258k for me and Claude Code often shows ~200k, so I’m curious how DeepSeek is exposing such a large window.
The 1M window might be usable, but it will probably underperform against a smaller window of course.
DeepInfra, as far as I'm aware, doesn't log your prompts and doesn't retain them in most cases, except "debugging purposes". As their per their privacy policy[1]: "We understand that the inputs you provide to our API and the outputs it generates may contain your Personal Information. We will not store, sell, or train using this data unless we have your explicit consent. We might sometimes store, for a limited period of time, the inputs and outputs to API calls for debugging purposes."
They're not EU-based, though. And I'm not sure how "private" their inference actually is. The throughput is also not the best everywhere, sometimes it can be really slow (although right now both DeepSeek-V4 models seem to be doing fine). However, they have a good pricing, probably on of the best on the market.
I'm not affiliated with them in any way, but when I want to test (I'm not a power user of LLMs, chatbots and agents, not at all; I'm doing it just out of the curiosity) something that is too big for my local hardware, DeepInfra is usually being my go-to provider.
[1] https://deepinfra.com/privacy
DeepSeek also tends to follow prompts more closely IME, plus the thinking is shown, so I think it's able to register as a 'tool' more easily for the non-tech-inclined for whom that appeals.
Yes, you're absolutely right, and no, Jordan Keller does not work for Moonshot. He is the original author of the algorithm, so credit goes to him.
There's a lot of legwork to go from prototyping to proper development though. The reason I said what I did is because Moonshot has the first research publication on it that I'm aware of. Could definitely have used better language though, my apologies to Jordan!
is this not cool tech, available for use?
i look forward to seeing what gets made on top of deepseek 4, more than what it means for US politics.
especially with how open deepseek is with its advancements, im excited to see how they get applied into sota western models
I hope that DeepSeek wins the AI race or at least gets ahead to the point where it becomes infeasible for bans and regulations against it. It's ridiculous that American legislators are advocating for less regulations for DeepSeek except for their own racist ideas about which AI should be approved or not.
> We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.
1: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...
It's trendy to say the US govt is now authorization, but that's just pure naïve groupthink.
But if it does, then in the following week we'll see DeepSeek4 floods every AI-related online space. Thousands of posts swearing how it's better than the latest models OpenAI/Anthropic/Google have but only costs pennies.
Then a few weeks later it'll be forgotten by most.
If one finds it difficult to set up OpenCode to use whatever providers they want, I won't call them 'dev'.
The only real friction (if the model is actually as good as SOTA) is to convince your employer to pay for it. But again if it really provides the same value at a fraction of the cost, it'll eventually cease to be an issue.