Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

113

GGodelNumbering about 4 hours ago 36 commentsRead Article on github.com

FR version is available. Content is displayed in original English for accuracy.

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.

Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things

1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever

2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)

3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.

I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.

HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...

It is astounding how much the harness matters, based on this and other experiments I have done.

⚡ Community Insights

Discussion Sentiment

74% Positive

Analyzed from 1812 words in the discussion.

Discussion (36 Comments)Read Original on HackerNews

GodelNumbering•about 3 hours ago

Interesting things Dirac does:

1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)

2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads

3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)

4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate

5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next

rgbrgb•9 minutes ago

It would be really cool to do a causality investigation to determine which one of these boosts it so much / quantify how much each matters. Who knows, they may all interact in a sum-is-greater-than-parts way that only improves the score when shipped altogether.

deskamess•about 3 hours ago

I always wondered why AST's were not more of a part in both editing and scoping of changes/parsing code. I thought I read an article where they said 'grep' was just as effective. It kinda made sense for the case they were talking about.

sigbottle•4 minutes ago

It's not intuitive to humans, even after learning parsing theory. I can do basic name refactorings. I've even written neovim plugins to do 1 specific thing with the AST (dfs down and delete one subtree which I understand). Those are fine.

I would not be comfortable doing an on-the-fly "rewrite all subtrees that match this pattern" kind of edit.

It seems like a tool that's good for LLM's though.

GodelNumbering•about 3 hours ago

Grep is effective for the most part, except for situations like when you have huge codebases and the thing you're looking for is used in too many places both as symbol and non-symbol.

Another annoying thing about plain grep is, LLMs often end up pulling in bundled packages when using grep where 1 line is large enough to ruin the context window

embedding-shape•about 3 hours ago

> Grep is effective for the most part

It's very effective in well-written and well-designed code bases where concepts tend to be relatively well formed to not be named the same as everything else, so grepping for symbols give you good search results.

Projects where the god-object or core concepts are generic names like "Tree", "Node" or other things that are used everywhere, tends to be short of impossible to search with grep and friends.

messh•32 minutes ago

Anchor based editing requires injecting new anchors to the context, and dirac does so via a diff. So how is this more efficient (token-wise) than search and replace?? Even at a single token per hash. Also, code is read more than written so these just add up. I experimented once with stable anchors, albeit longer than a single token, and found it a downgrade.

My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default

sally_glance•about 1 hour ago

Is there a complete list of the tools somewhere? I'm interested in how you chose to expose the AST specifically. In my own harness attempts I wanted to keep the number of tools absolutely minimal and briefly experimented with including an AST lib to use via an execute_python tool (plus some examples in the system prompt). Results were mixed though, with most models preferring ripgrep.

jbellis•37 minutes ago

> Batches all operations. Does large number of reads/edits simultaneously...

I wasn't sure what this meant, so I looked at the source. It seems to be referring to tool APIs being designed around taking multiple targets as a list parameter, instead of hoping the model makes appropriately parallel tool calls. (This matches my experience btw, models are reluctant to make a large number of parallel calls simultaneously, and this seems more pronounced with weaker models.)

verdverm•31 minutes ago

I think Anthropic may have mentioned this first, this pattern is also something my custom agent's tools are designed around, pretty sure I picked it up from them.

UncleOxidant•about 1 hour ago

> Utilizes language's AST to decide what to fetch into context,

Does that mean that it's only going to work with certain langauges for which it has parsers available?

GodelNumbering•about 1 hour ago

It uses tree-sitter wasms. Currently, 14 languages are available (https://github.com/dirac-run/dirac/tree/master/src/services/...)

The agent would work even without a language parser, just that the AST-based functionalities won't work

gavinray•about 1 hour ago

Yes

blurbleblurble•about 1 hour ago

Did you consider incorporating ast-grep or gritql?

Congratulations, great work.

sally_glance•about 1 hour ago

Can't speak for OP but I tried providing ast-grep in the execution context of an execute_bash tool, but even with pretty aggressive steering most models just don't seem to use it a lot. More expensive/SOTA models or higher reasoning increases the chances but lowers speed and raises cost. Maybe due to training bias for exploration tasks?

blurbleblurble•about 1 hour ago

Yes, I've tried this passive approach too and didn't dig much further after that. I thought maybe they'd figured out something more intentional in the prompting to enable these kinds of approaches.

mdasen•about 2 hours ago

It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.

Is there a leaderboard out there comparing harness results using the same models?

manx•36 minutes ago

We probably want to compare the cartesian product of model+harness.

GodelNumbering•about 1 hour ago

I really wish there was! I thought of even creating one but it would be conflict of interest

adyavanapalli•about 2 hours ago

I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!

GodelNumbering•about 2 hours ago

A few months ago one afternoon I was very frustrated with how slow Cline was being so decided to look under the hood. Decided to make a couple of changes. Got sucked in. About 70k lines of change, another 40k lines of deletions and two months later, here we are.

mring33621•21 minutes ago

The best kind of project. I'm trying this today. I've been happily using OpenCode so far.

Aeroi•about 1 hour ago

harness definitely makes a difference for the benchmarks. I ran my agent Camera Search against a few benchmarks and was able to beat Opus 4.7.

I created a real world benchmark, for mining, oil&gas, construction ect. called FieldOps-bench and it basically proves that vertical agents and specialized harness, tool, systems outperforms SOTA models alone still.

avereveard•about 2 hours ago

"astounding how much the harness matters" is the right read and it should be the lasting one. the model is rentable, the prompts are rentable, the benchmark numbers are mostly a function of the harness around them. swapping Gemini for Sonnet underneath the same harness has a smaller bench delta than swapping the harness around the model. the cheating-agents post you linked is the same observation through a different lens, the harness is what's being measured, the model is just the substrate.

that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.

himata4113•about 1 hour ago

That's why ARC-AGI-3 doesn't allow the use of a harnesses. The model has to create the harness instead.

bryanhogan•about 3 hours ago

If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?

GodelNumbering•about 3 hours ago

Yes, plan+act mode is one thing I loved about Cline!

michelhabib•about 1 hour ago

woow, looks very good. I'm wondering if you do any optimizations for cli in general, since you're not using MCP. I'm building my own CLI for AI Agents, and was always concerned with context rot.

Mashimo•about 3 hours ago

Interesting. Would love a comparison to pi.dev (Not Ohmypi)

How does this perform in day to day coding tasks, outside of benchmarks?

GodelNumbering•about 3 hours ago

https://github.com/dirac-run/dirac#-evals

README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.

Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences

howdareme•about 3 hours ago

Going to assume you didnt capture the data but could you add time taken to completion for each if you have it?

messh•about 2 hours ago

re. bash type tooling-- it doesnt mean an agent cannot use ast: using treesitter cli this should be perfect possible

martinald•about 3 hours ago

Very interesting! I've often thought static analysis could really help agents (I wrote this last summer: https://martinalderson.com/posts/claude-code-static-analysis...), but despite being hyped for LSPs in Claude Code it turned out to be very underwhelming (for many of the reasons that they can be annoying in a "real" IDE, ie static analysis starts firing mid edit and complaining and cached analysis getting stuck).

Curious to know if this has been an issue with your AST approach on larger projects?

The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).

I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.

GodelNumbering•about 3 hours ago

Wrt LSP, it uses the default LSP mechanism of the ide provider.

For AST, it uses tree-sitter WASMs (ships them with the package), and maintains queries (https://github.com/dirac-run/dirac/tree/master/src/services/...)

To keep performance fast, it stores the symbols DB (using sqlite) in the workspace's directory and incrementally updates it based on timestamps. Then it uses this DB to resolve symbol queries

martinald•about 3 hours ago

Yes I understand, but do you not have issues that it drifts out of date and confuses the agents (especially on longer running tasks)?

Like even "full" Visual Studio and Resharper have issues with this. Eg, you start editing file x, 'intellisense' runs, says there are loads of errors... because you haven't finished editing yet.

tuo-lei•about 1 hour ago

same issue from the other side. when a human is editing, the LSP fires mid-keystroke and shows bogus errors for a second, whatever. with an agent doing 5 edits in a row, the symbol DB is always behind by one edit, so the next lookup pulls stale references. you can re-index synchronously after each edit but that kills the batching speed.

nthypes•about 3 hours ago

Can't OpenCode reach the same just developing this as a feature or plug-in? Like anchored edit?

mdasen•about 2 hours ago

Sure. Dirac is just a fork of the Cline harness and obviously OpenCode could take the same techniques and implement them. I don't know how difficult it would be to implement them in OpenCode, but given that Dirac and OpenCode are both open source, a future version of OpenCode could always be a re-branded Dirac (I'm sure there are ways to implement Dirac's techniques without having to completely replace OpenCode's underlying code base, but this illustrates that at the extreme, they could clearly just take Dirac in its entirety to get the same results).

blueTiger33•about 2 hours ago

Stared it. will try it later. one question though, to make it simpler for me, in what tasks does this model shine, how do you improve the score? I already use some skills to cut down CC costs, like caveman, rtk cli and a few others. just want to understand

GodelNumbering•about 2 hours ago

I did limited testing using Sonnet on CC vs Sonnet on Dirac. I could not confirm the costs however

redrove•about 2 hours ago

I keep trying to use dirac-cli with codex and it won't work: Error: Codex API error: Codex API request failed: 400.

Any ideas?

GodelNumbering•about 2 hours ago

Assuming you logged in with OAuth, I am guessing you are trying to use gpt-5.5?

In my tests, it worked using gpt-5.4 for me and I assumed gpt-5.5 is not available to me because I am on the free plan

Do you have the subscription that allows 5.5? If so, I can look into what changed in API. Sorry I rarely use openAI so it is a bit of an untrodden path

redrove•about 1 hour ago

Yes I'm on ChatGPT Pro (OAuth) and I'm trying to use gpt-5.5-xhigh.

That was the issue, 5.4 works just fine.

Support for service: priority (GPT /fast mode) would also be cool!

snqb•about 2 hours ago

how well does it do on frontier models like Opus 4.6?

GodelNumbering•about 2 hours ago

I have only done functionality testing, no benchmark testing on Opus (decided to pay my rent instead)

aetherspawn•about 3 hours ago

Sorry I couldn’t really figure out if this was a harness, a fine tuned model, or both. Can we use Qwen with this for example? Is the performance expected to be better in that case?

GodelNumbering•about 3 hours ago

The model was the default gemini-3-flash-preview.

Harness was https://www.npmjs.com/package/dirac-cli

Since Dirac is Cline's heavily modified fork, it supports all models Cline supported, including Qwen and all popular open/closed models

As a matter of fact, I am trying to run terminal bench 2.0 using some OSS models at the moment but the slow inference speeds are causing tasks to timeout

neonstatic•about 2 hours ago

I am a bit confused. What languages does it help with? You mention AST manipulation, so I am assuming it's not universally applicable, e.g. to Rust?

deviation•35 minutes ago

AST (Abstract Syntax Tree) is essentially a search algorithm to better help the agent do it's job.

nthypes•about 3 hours ago

No CLI? Only VSCode extension?

GodelNumbering•about 3 hours ago

Cli too (you can't run tbench without cli as it runs in an isolated docker env) `npm install -g dirac-cli`