Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
80% Positive
Analyzed from 7733 words in the discussion.
Trending Topics
#models#kimi#model#more#open#claude#coding#better#run#code
Discussion Sentiment
Analyzed from 7733 words in the discussion.
Trending Topics
Discussion (218 Comments)Read Original on HackerNews
But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.
They are open source and cost waaaay less per token than American models.
I’m using them right now on the $20 Ollama cloud plan and I can actually work with them on my side projects without reaching the limits too much. With Claude Pro $20 plan my usage can barely survive one or two prompts.
And I choose Ollama cloud just because their CLI is convenient to use but their are a lot of other providers for those models so you aren’t even stuck with shitty conditions and usage rules.
To me that’s a pretty bad thing for American economy.
You know, for the rest of the economy that is not big tech.
And investor pumping money in US AI circular money flow just makes innovation everywhere else slower. If not for the GPU/Memory drought running stuff locally (or just in competition cloud) would be far cheaper
On a different note, is Ollama cloud good?
I'd say they have reliability issues but for the price it's worth it.
I like that usage isn't measured per token but per computation time, which means that you get more usage when models become more efficient.
There is more to American economy than big tech.
And that's precisely why this has started: https://www.wired.com/story/super-pac-backed-by-openai-and-p...
Most of the stock market valuation is big-tech, and most of people's retirements are the stock market, so... if the AI bubble bursts a lot of the US will be affected.
The author didn't do any of that. They ran each model once on each of 13 (so far) problems and then they chose to highlight the results for the 12th problem. That's not even p-hacking, because they didn't stop to think about p-values in the first place.
LLM quality is highly variable across runs, so running each model once tells you about as much about which one is better as flipping two coins once and having one come up heads and the other tails tells you about whether one of them is more biased than the other.
I reckon we'll have similar suites comparing different aspects of models.
And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.
Most people who have computers could run inference for even the biggest LLMs, albeit very slowly when they do not fit in fast memory.
On the other hand, training or even fine tuning requires both more capable hardware and more competent users. Moreover the effort may not be worthwhile when diverse tasks must be performed.
Instead of attempting fine-tuning, a much simpler and more feasible strategy is to keep multiple open-weights LLMs and run them all for a given task, then choose the best solution.
This can be done at little cost with open-weights models, but it can be prohibitively expensive with proprietary models.
We as an industry cannot determine if one software engineer is objectively better than another, on practically any dimension, so why do we think we can come to an objective ranking of models?
But I'm more optimistic about testing programming models. You can run repeated tests, and compare median performance. You can run long tests, like hundreds of hours, while getting more than a few humans to complete half-day tests is a huge project. And you can do ablation testing, where you remove some feature of the environment or tools and see how much it helps/hurts.
And we can judge developer performance, it just takes 6 months to a year working with a team so it's just hard to get metric
https://ghzhang233.github.io/blog/2026/03/05/train-before-te...
It just hasn't been widely adopted yet. And it might be in each of their particular interests that it continues to stay so for a while. It's basically like p-hacking.
I try one or two of my use cases with new models or harnesses, make my own often subjective judgements, and largely ignore benchmarks.
Blogging and writing in general are a business, or feed other tech adjacent businesses, and a lot of writing about evals is attention getting - nothing wrong with that but there is a lot of noise.
Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.
It's very difficult to justify spending on the their models in a world where DeepSeek costs a fraction and Chinese open models exists and they perform as well as what is considered the state of the art, and it only depends on you adjusting how you use them.
A couple of days ago I canceled ChatGPT and started to try out DeepSeek. Let's see how it goes.
> The Word Gem Puzzle is a sliding-tile letter puzzle. The board is a rectangular grid (10×10, 15×15, 20×20, 25×25, or 30×30) filled with letter tiles and one blank space.
Just last week my superior asked to implement that for a customer. /s
Maybe some real, real task would be good? Add sone database, some REST, some random JS framework and let it figure out a full-stack task instead of creating some rectangles?
We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.
GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.
Fits with another comment from yesterday on here who said the flash models are just better at tool calling.
Planning with gpt55 and implementation with a flash model could be bang for the buck route.
And this seems very much in line with the methodology in ARC-AGI-3.
The results here, in the OP article and in https://www.designarena.ai all tell a similar story: Kimi K2.6 is up and in the SOTA mix.
For our testing, we use hundreds of different environments across disciplines, and it seems to line up with subjective experience better than other benchmarks. We test coding, agentic coding, and non-coding reasoning in the environments.
Would you? I am not very knowledgable on LLMs, but my understanding was that each query was essentially a stateless inference with previous input/output as context. In such a case, a single puzzle, yielding hundreds of queries, is essentially hundreds of paths dependent but individual tests?
Not only is performance dependent on the language and tasks gives but also the prompts used and the expected results.
In my own internal tests it was really hard to judge whether GPT 5.5 or Opus 4.7 is the better model.
They have different styles and it's basically up to preference. There where even times where I gave the win to one model only to think about it more and change my mind.
At the end of the day I think I slightly prefer Opus 4.7.
It's a strong signal for a job, but the soft skills are sometimes going to get Claude Opus 4.6 a job over smarter applicants. That's what we'd really like to measure objectively, and are actively working on.
I just had an issue where Claude CLI with Opus 4.7 High could not figure out why my Blazor Server program was inert, buttons didn't do anything etc. After several rounds, I opened the web console and found that it failed to load blazor.js due to 404 on that file. I copied the error message to Claude CLI and after another several unproductive rounds I gave up.
I then moved the Codex, with ChatGTP 5.5 High. I gave it the code base, problem description and error codes. Unlike Claude CLI it spun up the project and used wget/curl to probe for blazor.js, and found indeed it was not served. It then did a lot more probing and some web searches and after a while found my project file was missing a setting. It added that and then probed to verify it worked.
So Codex fixed it in about 20 minutes without me laying hands on it (other than approve some program executions).
However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.
For reference this was me just trying to see how good the vibecoding experience is now, so was trying to do this as much hands-off as possible.
My guess is that it is the fault of the model rather than the harness, I believe Opus to be much worse than it was for whatever reasons. Though I suppose it could be Code’s fault somehow. For the time being though Codex is much better which I never thought I’d be saying.
I plan to run tests using Pi so they have the same system prompt and harness, but I’m suspicious that it’s only the subscription level Claude Code that is worse and we’re not allowed to use that with Pi.
A model that can more effectively make use of the tools presented to it is going to be better. You're not wrong about the system prompt; these can have quite a pronounced effect, especially when what the agent is bridging to is not just a case of bash + read/write; you need the prompt (and tool descriptions) to steer and reinforce what it should actually do because most models are heavily over-trained on executing bash lines.
When it comes to more basic agent usage that just runs in a terminal and executes bash ultimately most models are going to do just fine as long as you provide the very basics.
Regarding your case in this post it could be any number of issues: The provider being over-provisioned, leaving less time for your case, the model just not being particularly great, your previous context (in your original session) subtly nudging the model to not do the correct thing, and so on.
The truth is that you simply can't really know what the exact cause of this behavior you experienced is, but I think you're also working hard to cope on behalf of Anthropic.
All in all I think you're placing a bit too much faith in agents and their effect. If you slim down and use something like Pi instead you'll likely get a more accurate sense of what agents do and don't do, and how it affects things. You can then also add your own things and experiment with how that impacts things as well.
I've written an agent that only allows models to send commands to Kakoune (a niche text editor that I use) and can say that building an agent that just executes bash + read/write in 2026 is probably the easiest proposition ever. I say this because a lot of the work I've had to do has been to point them in the direction of not constantly trying to write bash lines; models all seem to tend towards this so if you just wanted to do that anyway most of your work is already done. The vast amount of the work in those types of agents is better spent fixing model quirks and bad provider behavior in terms of input/output.
While it may be possible to get better numbers from certain providers, we try to establish a common baseline. I.e. if we measure that Kimi K2.6 averages 450s on a task and GLM 5.1 averages 400s, you might be able to improve that number on a provider like Fireworks but GLM 5.1 would also likely be 10% faster on the premium provider. This is a caveat worth considering when comparing to proprietary model speeds on the site, though.
Looking back at chatGPT and claude a couple years ago, very small Qwen models are basically equal in coding to what those cloud based models could do then. Also factoring in scaling laws, a 9b going to 18b is roughly a 40% increase, whereas 18b to 35b is 20%, I expect there will be a change of at least price in cloud based models.
Adobe used to be $600 per month, then it became $20 when distribution scaled.
The simple truth is, cloud models are always going to be strictly superior to open ones, simply because cloud model vendors can run those same open models too. And they still retain economies of scale and efficiency that operating large data centers full of specialized hardware, so at the very least they can always offer open models at price per token that's much less than anyone else's electricity bill for compute. But on top of that, they still have researchers working on models and everything around them; they can afford to put top engineers on keeping their harness always ahead of whatever is currently most popular on Github, etc.
This proves the strict inequality in my claim is preserved, everything beyond that is just debating the size of their advantage.
What if you have a good enough model but the cloud model providers are better in procuring hardware for interference?
What product is this referring to? I haven't heard about Adobe having any offering that is quite that expensive?
I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi.
Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.
Why? Seems to go a giant the opinion of the masses who mostly use Claude Pro for serious coding.
All these coding tools are extremely wasteful as far as resources are concerned. Almost designed to make you move to the next tier. You have to consciously restrict their scope all the time to make your plans last. Even with Kimi/MiniMax a 3-4 hour session often ends up with 50-70M cached reads. Not a small amount at all.
Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models.
Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.
Of course it matters because that makes coding plans much cheaper than those from Anthropic and OpenAI.
For personal use I have coding plans with GLM 5.1, Kimi K2.6, MiniMax M2.7 and Xiaomi MiMo V2.5 Pro and I am getting a lot of bang for the buck.
With Claude Max I was hitting the limits very fast.
You can always distill this for your little RTX at home. But models shaped for consumer hardware will never win wide adoption or remain competitive with frontier labs.
This is something that _can_ compete. And it will both necessitate and inspire a new generation of open cloud infra to run inference. "Push button, deploy" or "Push button, fine tune" shaped products at the start, then far more advanced products that only open weights not locked behind an API can accomplish.
Now we just need open weights Nano Banana Pro / GPT Image 2, and Seedance 2.0 equivalents.
The battle and focus should be on open weights for the data center.
Open weights is great if you want to do additional training, or if you need on-prem for security.
The power of giving universities, companies, and hackers "full" models should not be understated.
Here are a just a few ideas for image, video, and creative media models:
- Suddenly you're not "blocked" for entire innocuous prompts. This is a huge issue.
- You can fine tune the model to learn/do new things. A lighting adjustment model, a pose adjustment model. You can hook up the model to mocap, train it to generate plates, etc.
- You can fine tune it on your brand aesthetic and not have it washed out.
The value of open source is not that you will run it locally, it's that anyone can run it at all.
Even if you can't afford to purchase the hardware to run large open source models, someone would, price it at half the cost of the closed source models and still make a profit.
The only reason you are not seeing that happen right now is because the current front-running token-providers have subsidised their inference costs.
The minute they start their enshittification the market for alternatives becomes viable. Without open-source models, there will never be a viable alternative.
Even if they wanted to charge only 80% of what a developer costs, the existence of open source models that are not far behind is a forcing function on them. There is no moat for them.
The enshittification will go unnoticed at first but I'm already finding my favourite frontier models severely nerfed, doing incredibly dumb stuff they weren't in the past.
We need open weight models to have a stable "platform" when we rely on them, which we do more and more.
That said, I do fully agree that it is valuable to have open near-frontier models, as a balance to the closed ones.
That would take something close to a global conspiracy of every technologist lying continuously to keep the tweaks secret. If necessary, I personally will rent some servers and run a vanilla Kimi K2.6 deployment for people to use at reasonable prices. I don't expect to ever make good on that threat because they are grim times indeed if I'm the first person doing something AI related, but the skill level required to load up a model behind an API is low.
So it isn't hard to see how there will be unadulterated Kimi models available and from there it is really, really straightfoward to tell if someone is quantising a model; just run some benchmarks against 2 different providers who both claim to serve the same thing. If one is quantising and another isn't there's a big difference in quality.
The current ranking of all tests makes more sense (well, except for how well Gemini does)
https://aicc.rayonnant.ai
Personally what I've found that has made coding agents more and more useful over the last year is that they have gotten a higher and higher floor, not that they have gotten a higher and higher ceiling. They were already plenty smart a year ago, it was just that they failed so often and so spectacularly that it made them a liability. Now they have become much more reliable, which is the key thing that has transitioned them into being actually useful. For the most part I don't use them to work on really intellectually difficult tasks. I mostly use them to work on very boring and labor intensive tasks. Most commercial software development work is just boring drudgery like this. Certainly the bulk of what I need them for is. I need them to just not crap their pants all the time while they're at it.
So I'm kinda wary seeing the poor reliability of Kimi.
I'm not sure this is enough data to form an opinion, but going by what we have Kimi would be as reliable as Claude
DNP = Did not participate
In this regard, kimi got more and better medals than Claude.
I could easily see us in a place 2 years from now where this coding application is fully commoditised.
I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require.
As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.
Yes, at least probably with each other
>Truly a great time to be alive.
Do wish outcome had been better for those whose work ended up in the training sets, wish competition could’ve found ways to agree on distillation practices, wish globally we’d planned as fast as we’re developing…
Tremendous excitement too
Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.
Maybe it’s better in one particular case here and there and I think this blog post is example of that.
There always seem to be pockets where closed frontier models perform slightly better.
This has already happened.
I have downloaded both the big Pro model and the smaller but multimodal MiMo-V2.5.
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro
https://huggingface.co/XiaomiMiMo/MiMo-V2.5
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-Base
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Base
The download of MiMo-V2.5-Pro takes 963 GB, while that of MiMo-V2.5 takes 295 GB.
For comparison, the download of Kimi-K2.6 takes 555 GB.
Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch
They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs.
The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines.
All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.
In the real world, you don't hire a plumber and expect him to also do your landscaping, fix your car, and tailor your clothes. It would seem like a much better use of resources if I could download an app that specialized in shell, Python, and C coding for example, or maybe even that would be 3 apps that communicated. Maybe I could even run them on a regular machine with 16GB of RAM. I don't need one huge model that can do that and code in Fortran, COBOL, and Lisp.
As humans, we've done pretty well by specializing. I hope this gets explored more with smaller, focused AI models vs the current path of one model to rule them all that can only be run in a data center the size of a country.
I would daresay for "coding tasks", you actually _want_ a model that can code "in all languages".
Sure, it might be that outdated language XYZ is really useless to you or the task you want, but being exposed to their limits, philosophy and concerns across environment, framework and organization, among other things, means for example you get insights of your problems from other areas and points of view.
That's afterall how we got Newtonian physics and calculus, right? A person studying physics someday noticed how the "math of the day" wasn't able to calculate some results without a lot of elbow grease. He then "found" the "missing math" and with it was able to generalize what at the time was considered a bunch of isolated phenomena into a cohesive corpus of knowledge.
So for example, I want my code to have mechanical sympathy like Fortran; well defined input/output interfaces, and not-interweaved control structures, like COBOL; stateless, side-effects-free business logic like Lisp.
People claim that finetuning is good because no model can be 'that' general since gpt3 and at every generation it's becoming less true
Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases.
Anyone out there relate?
https://www.maxtaylor.me/articles/i-benchmarked-caveman-agai...
> Caveman only affects output tokens — thinking/reasoning tokens are untouched.
The problem is the thinking. But could help to tune my system prompt for Kimi.
Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.
Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.
I have to use a supposedly frontier model at work and I hate it.
As I said, you can blame the model, but it is nothing that the harness cannot take care of more deterministically.
It is a lot trickier to use kimi compared to sonnet - hence why it seems that sonnet is more powerful while I think it is down to the harness.
If someone were to not use your harness and rather use some stock harness though, what is the one that you would recommend? I am curious about that too.
The initial models were corrected by programmers which gave a very high quality feedback signal. Whereas with vibe coding on the rise, you’ll lose that signal.
Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed.
Quantizations lower than Q8 are probably worthless for quality.
Or 2.05TB on disk for the full precision GGUF.
https://huggingface.co/unsloth/Kimi-K2.6-GGUF
If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.
If you run it on your own hardware, you can run it 24/7 without worrying about token price or reaching the subscription limits and it is likely that you can do more work, even on much slower hardware. Customizing an open-source harness can also provide a much greater efficiency than something like Claude Code.
For any serious application, you might be more limited by your ability to review the code, than by hardware speed.
https://huggingface.co/moonshotai/Kimi-K2.6
I have downloaded Kimi-K2.6 (the original release).
For comparison (sorted in decreasing sizes, 3 bigger models and 3 smaller models, all are recently launched):We know these models can solve much more difficult problems, something isn't right.
Gemini doesnt feel quite as smart these days. It does well with very long conversations. Except it has bugs where all context gets lost or pruned, and it will lie and gaslight about it. Theres also no branching, so once context is lost you have to start over. Presentation is decent. Empathy is fairly good, except if users get frustrated, it gets more and more flustered and breaks down.
Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc
Getting a coding plan from Kimi.com will make coding 20x cheaper than using Anthropic.
BTW, I am using it with Claude Code.
What I do see in my own work and that of others around me, is that Claude consistently outperforms Gemini and to a lesser extent Codex.
With Claude eating tokens with declining return, concessions have to be made and Codex is a usable middle ground.
I use Kimi in Kagi's Assistant for non-code or generic programming questions and am quite happy with its no-bullshit responses.
Now imagine a company burning 200,000/month on AI spend. Real numbers. Not every company is but some are.
Why such a company won't deploy an open weight model (Kimi 2.6 or Deepseek v4) on their own hardware (rented or otherwise) to save about 2.4 million dollars a year?
And these are the landmines Chinese cleverly did set up. Not saying intentionally or otherwise.
But end result is that good luck recouping your investments, you can pretty much kiss goodbye to any ROI. The bucket has a hole at the bottom and the bubble bust is guaranteed.
PS: Without open weight models too the economics do not make sense neither the code generated by these SOTA models is reliable enough to be deployed as is. Anyone claiming otherwise either hasn't worked on a real software stack with real users OR didn't use AI long enough to witness the AI slop and how hard it is to untangle or de-slopify the AI generated code therefore these trillion dollar valuations are absurd anyway.