ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
74% Positive
Analyzed from 4119 words in the discussion.
Trending Topics
#code#don#more#fable#opus#working#model#models#software#mythos

Discussion (141 Comments)Read Original on HackerNews
These questions are even not about AI: if I were to give money to a human agency and were given something they tell me works, I would ask the same questions. If I did not know how to evaluate, I would hire people that do. With LLMs the verification part is what bothers me the most.
The only decent software engineering perspective I’ve seen has been from Mitchell Hashimoto.
They can just summon bespoke software out of the ether that only handles the use cases of themselves and a few of their collaborators.
Making “side projects” was mot possible for non-developers before powerful LLMs. Now it is.
Imagine not being an architect and using Claude to put together a building plan, then concluding it’s basically done but we might need a real architect to double check the measurements. It may even be true but I’d be skeptical if it’s always non-architects saying this.
The trick to getting good at using LLMs for software is to learn how to make _all_ projects low-stakes.
this doesn't really work in the real world. There are many things that actually matter, engineering is fundamentally about handling them.
I clicked one of his examples intrigued "a snake game where the snake is self-aware and crazy things happen;". Played for 1-2 minutes, and it's the classic 1980s snake game. Am I missing something? What is "self-aware" about it? Some funny messages at the bottom of the screen? And what are the "crazy things"?
I will say, the act of eating creates a "bulge distortion" that flows down the length of the snake is a nice touch though.
But yes, you are right - I don't build roads and don't know what is a price to build a road and how to determine the quality of correctly built one, nor I will ever care or learn.
The lack of downvotes on posts on HN has always felt like more of a bug than a feature to me.
Yet, I can't deny the reality that I observe working with LLMs every day. If this truly is a step-function (as some are sgguesting), then I have absolutely zero concern for the quality of the code.
It also burned through my usage quota like a late-90s Hummer.
Yeah. I have a Max 5x subscription and Fable burned through 16% of my weekly quota in a 40 minute code review session. It didn't even finish the review, it switched back to Opus 4.8 in the critical memory safety parts where I actually needed Fable.
I feel like I'm going to get priced out of these models soon. I should probably try to get the most out of Fable until June 22nd.
It's not just salary, but also safety/labor regulation, legal risk, vacations, sick time, personal conflicts, HR, benefits.
Even when automation is more expensive on paper, it's generally still cheaper
You underestimate what these models cost. Uber's budget is $1,500/dev/month. I gather that was put in place because the dev's were going through $6,000/dev/month, which Uber decided could not be cost justified.
Fable costs at least twice as much, or $12,000/dev/month.
Fable can apparently work for hours without supervision, which means a skilled engineer can now have it working on many tasks concurrently. I would not be at all surprised if they can put a nought or two on that number. If you do that, you are well out of "what a human costs" territory.
As far as I can tell this part of the job isn't really on anyone's radar anymore.
I can't help thinking there might be some kind of strategic issue here.
Perhaps someone should ask Mythos about it.
If you get $100,000 per year as a SWE, and Anthropic offers a coding model for $100,000 per year (but working 24/7), then you'll have to give up all of those addons that make the fully burdened cost of the employee. Say goodbye to vacation, sick time, benefits, etc.
We know this model will be cheaper and faster with time.
And we have not even reached the timespan/timeframe were we have ASIC style models.
OpenAI has to do something which will beat Fable otherwise Anthropic won. China currently overtakes cars, pv, batteries and very soon silicon chip making, it has all the incentive to also take over AI.
"Posterior beliefs about market demand are purely referencedependent: holding dollars raised constant, they track only performance relative to the founder’s self-chosen goal—jumping half a standard deviation at the threshold, responding steeply for the first ten points past it, and flattening thereafter"
Humans generally don't verbalize data this way. The summary document is also very fluffy.
Academic social scientists on the other hand...
IMO, I don't see anything on that line that looks like an AI. LLMs also have a tendency to bring words from one context into another, but this is also the context I would expect to see those words.
Every sw dev knows this is a very dangerous, and unrealistic, assumption.
In a project like mine (https://github.com/tsz-org/tsz) I am constantly frustrated that models were not doing enough research and were not taking into account other situations. Again and again models would produce code that would fix one thing and break 2 other tests that were "unrelated".
With Fable it seems like tasks are taking much longer (I have not seen a pull request from Fable sessions yet) but reading the transcription of those sessions I can see how it is doing the right thing by not leaving any stone unturned.
As the article says, it's hard to communicate this "feeling" about models because it is very project specific but I thought I share
But overall, this is pretty normal for compilers to have this sort of "unexpected" tests failing due to some work in an area. It happened to me when I was coding everything manually back in the day too
That's not what a clean setup means... I mean good separation of concerns, established invariants, etc.
Personally I don't really care, because I like coding and learning myself and DeepSeek Flash is all I really care about. But it's really easy to have a ton of benchmarks where the top models can't get anywhere close - and I like to test them on these problems to see how good they are getting.
Fable 5 is def a little better than 4.8 btw.
On the margins, suppose the prompt is literally: "Build a feature complete, high polish Facebook clone". Facebook is complex but likely not super complicated tech, and still I would assume that (after having burned through a substantial amount of tokens) you would find substantial enough differences in the outcomes between different models on that prompt on various fronts.
The above ask is obviously not useful, but what's preventing you from taking on bigger chunks until you approach the limit? At some point you would hit a boundary, where the diff will be obvious.
A small portion of this effort is having a high quality Lua in Rust repo. I’m using mythos to fix some of the performance issues with my Lua interpreter that gpt 5.5/ opus 4.8 had stone walled on.
Not sure if Mythos will be able to crack this but it has been running for a couple hours now with some promising results.
Performance charts linked here if your curious https://github.com/ianm199/lua-rs
The other reason is that because mlua is just a wrapper around the C code, it has unsafe you can't really get around. So for example Lua is used in Redis, which has this critical CVE https://github.com/redis/redis/security/advisories/GHSA-4789... that a memory safe version of Lua wouldn't have to deal with.
Mlua is still fine or even better for many other cases though!
Myth. Total myth! I recently had to beg for more RAM after continually hitting swap space which causes tools like dictation to stop working, failure to load certain websites without rebooting, and so on. Devs do in fact need powerful machines and the ~$500-1000 an employer saves upfront in machine costs is dwarfed by productivity losses.
Giving your engineering employees new machines in a 2-year cycle that are between the middle and high end is one of the cheapest ROI decisions that a tech org can make.
> Again, it wasn’t perfect. As an expert, I was able to spot some errors and omissions (some as a result of the design I had asked for) that I had the AI correct
That's the bit that stuck out to me - that's longer than I would expect to work on a problem in a day or even expect to go back & fix the output of something that has a core reward loop of hours.
My customers are currently clamoring to push down my agent response times from 85 seconds down to below the 20s mark.
At the same time, it is very dissonant to see the industry heading towards hour+ long workflows with an agent.
We're gonna go back to the days where our bosses ask why we're just sitting around, but instead of saying "compiling," we'll just say, "waiting for Claude."
It's some prompt engineered AI harness, that guides the AI to create stats after it researches a subject and ingests the data, but I'm not sure what is it that the tool actually does on top of this.
Will Claude's code be perfect in one shot? Probably not, will it get you 80 to 90% of the way there with your chosen design patterns in under a few hours? Absolutely.
Sounds like we've nearly reached in coding the point where Paul Bunyan [0] has his epic competition with the chainsaw... and loses by 1/4" and history forever changes...
[0]https://www.britannica.com/topic/Paul-Bunyan
https://xkcd.com/303/
At this point, pay me significantly more, and I'll do it.
Ha ha, that's how you negotiate yourself out of a job!
I'm amazed we're so far into SOTA bloat that the chinese will kill once they start etching silicon with these models.
https://isochronic-passage-chart.netlify.app/
Doesn’t work too well on mobile but looks interesting
I also see some logic flaws. It overlooks the option of going to a major hub to access faster aircraft, rather than hopping on local hubs.
Also, immigration and customs are cleared at the first airport you arrive at in the country, not at the last one.
In some countries, you need to clear immigration even while going to a third country, so 1 hour is not enough to do it.
> Switched to Opus 4.8: Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.
And I'm excited to try it, but also have a fear that I will like it too much and then won't have access to it in 2 weeks... But maybe I will and maybe it will be worth it and I'll just pay a bunch of extra for it and it'll be great!
I think the article could be improved by actually sharing more feelings. I clicked on the article for feelings but I didn't see that many feelings described.
[1] https://isochronic-passage-chart.netlify.app/
[2] https://mapitout.welcome-to-nl.nl/
[3] https://commutetimemap.com/
[4] https://andrewding.ca/flightisochrones/
He is a professor but sadly also an AI shill. He should switch to advertising washing power.
I don’t see why working longer is a pro. The results don’t seem much better than you’d get from putting Opus in a long loop.
Care to share the results you got from Opus working on the same prompt? It should be easy to compare quality.
The first item on the article, the first thing it showed, was wrong though.
It is 100% faster to go from London to New York in 1881 than Volgagrad. Or any of the Russian hinterland colored green or Turkey or Egypt.
the map is for 2026, yeah?
There is only one hint: 475k tokens in the screenshot when OP asked the model to fix some behaviour, but it would be fascinating to know the total tokens amount.
Is it a hard problem or is it just labor intensive?
Edit: A couple hours in and I just got my first gaslighting attempt from the model. Good times!
Just an FYI this guy is an AI hype-beast. Some of his tweets are truly out there.
What makes me excited is that GPT 5.6 (its actually GPT 6) is going to be crazy
What?