ES version is available. Content is displayed in original English for accuracy.
I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.
What it does:
- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware
- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it
- Ships with an eval harness and interactive dashboard so you can reproduce every number
I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.
Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)
The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:
- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.
- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.
- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.
I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).
The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.
One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.
Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.
Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.
How to try it:
- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.
- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.
- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.
Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.
Repo: https://github.com/antoinezambelli/forge
Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...
Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

Discussion (231 Comments)Read Original on HackerNews
I've only skimmed the post and read the abstract and in some places you make a nod to how simple tweaks can make something 10x faster/slower, but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed.
Specifically for agentic workflows and local models, accuracy around function/tool calling hasn't been a problem for me now for about 6 - 12 months, personally, since around QwenCoder3. The main issue is context management and the impact on timing, since agents will often swap prompts and break prompt caching and similar timing improvements.
It looks like your work adds a layers and wrappers like guard rails and retries. This would make my local model experience - specifically for agents - unusable because of the delays it would add.
I really appreciate and respect the work you've done, and apologies if you have already addressed this head on, but with so little talk about the impact on timing here, I feel like you're hiding something or overinflating the actual real world improvements here - what are your thoughts?
It's also mildly concerning me that nobody else has raised this - am I doing something wrong here, or is everyone else just not actually using local models in real life?! Talk to me about your speed experiences!
…but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed."
I wonder, if you were to use cloud-based LLMs more often, you might find that accuracy (fidelity?) is indeed more more lacking in your local models.
You can always just throw hardware at your speed problems after all.
I also agree that if I spent more time using cloud based LLMs, I would very much find local LLMs less capable and useful. Comparison is the thief of joy though, and I'd rather feel blissfully ignorance towards SOTA LLMs rather than a dependence on them.
Before taking a local focus approach, LLMs increasingly left me feeling a mixture of FOMO, sadness and futility towards the future of software and tech. I assume it's 100% a me problem, but it has it's benefits:)
On a per-call basis, the wrappers are pure python ifs and such, measured in ms easily, and frankly negligible compared to the LLM call itself which will be on the order of magnitude seconds.
Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.
I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.
> Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.
> I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.
Yeah, that makes sense and seems fair. The sort of delays are almost and inevitability, you're not trying to improve speed, but by improving reliability, it can obviously increase overall throughput.
Having watched the demo video too now, automating retries etc would be helpful for me. It's impressive to see how quick the models run on better hardware, and the performance improvements are impressive, even if the overall run takes longer sometimes because it does more correct things. Thanks again!
Ah that's good to know
when I first saw this posted yesterday I was wondering that, kind of assumed maybe it was doing extra LLM calls to make judgements
But that's the difference between the call failing and succeeding (eventually).
On successful calls the presence of forge should be unnoticeable.
I have a laptop with a broken screen and an RTX2060 at my disposal. I can run 12b - 14b dense usably, just, although I think 4b - 8b dense models give me the best tradeoff of speed and usefulness.
Larger MOE models with more parameters (20b+) but fewer active (2 - 3b) are sometimes a little bit slower, but are often far more capable.
Even the SOTA models have this problem when the work is complicated enough. The problem is amplified more with the small models.
If token costs aren’t a concern I’m using SOTA for everything.
Even SOTA gets it wrong and hallucinates, but at a lower rate. I don’t want to waste my time.
A simple retry loop around your whole workflow could, in some cases, be all you need. But it could mean many blind attempts to get through a workflow successfully. And hopefully there isn't a payment step partway through!
The fewer hard errors nix the whole workflow, the lower your ETTWS.
- Breaking down a problem into a planned execution, with executing agent providing the initial plan which includes explicit objectives such as which tools it calls and what it would consider to be a successful execution.
- The harness then executes the plan in order
- Each step that involves a tool call will be executed by breaking down the tool call into component parts: the harness interrogates the agent for a valid parameter value for the current tool argument. The tool definition contains validators for each argument. If the validator fails, the harness rewinds the conversation and injects the failure reason into the next try.
- Once the agent produces a valid response for the argument, the harness proceeds to the next argument.
- Once all the arguments have been filled, the harness calls the tool. It passes the agent's initial expected value along with the actual value, along with any errors that may have been produced and asks the agent if it is satisfied with the result. If it isn't, the agent provides a reason and the harness then retries the tool call process from the beginning rewinding the conversation and inserting the reasoning for the retry.
- The agent may request to re-plan if it discovers a flaw in its initial plan. The harness will also attempt to re-plan if the agent produces too many failures in a row.
This proves to be quite effective at reducing tool call failures. One benefit is that the sub-agent gets a perfect conversation history where it makes no mistakes. I'm not sure if it's actually better at completing tasks though, I haven't tried to benchmark it.
A few things I noticed related to your points: - on conversation rewind, I implemented a similar tool call collapse on the main agent (the one you chat with). Once it was done with a task, the tool call history was collapsed to keep the context clean - it was more about hygiene than size.
- the harness interrogating the model bit is a bit different, I haven't tried that approach. Forge relies on model self-correction in a bid to avoid having bespoke error modes, but I guess if you can abstract and automate the interrogation based on schema or something that could work!
Overall I like the clean conversation history aspect, but I suspect that you might be doing a lot of round trips for tools with many args, versus "letting it fail and giving it one nudge". That being said, it's an interesting idea for harder scenarios/tasks!
Unfortunately I am caught up right now in other projects at work and otherwise and just tried few dozens of prompts to see if this is even achievable.
I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness?
That seems like comparing apples to apple pie, there's some ingredients missing.
However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile.
The backend affects almost all model families, and was just something I've never seen really talked about.
That's an absolutely bonkers statistic: it would mean spurious differences in hosting container overwhelm the performance differences between models.
I feel like there's some backend decoding or chat template thing going on at a much lower level than what I'm best at. Maybe it's injecting headers or something that eventually compounds to model confusion? I really have no idea.
I really hope folks better than me at backend stuff take a look and dive into it though because it's definitely under-reported and super consistent across model families and backends ranging from ollama, lama.cpp native, prompt, llamafile, and even vLLM that I didn't formally benchmark in the repo.
I've been working on a pytest-first acceptance testing framework called Dokimasia (do-kee-ma-see-ah) that I'd love to get your thoughts on: https://github.com/deevus/dokimasia
Acceptance testing might not be what you need for Forge, but since you're deep in AI tool building I thought you may have opinions.
I think this sits a level higher than Forge - maybe testing the workflow proper and integration points that it might surface (if some tools are giving access to an MCP or something).
Could likely layer both together without much trouble.
Only thing I'd be curious about is how you handle the non-deterministic nature of these models. Sometimes they get the tool call right, sometimes they barf bad json. Does the suite run multiple trials?
It's now completely reasonable to throw a 7900XTX in a spare rig, put it in the basement, give it an absurd goal, and forget about it.
That would certainly work in theory, but I'm not as familiar with parallel calls.
- If you mean the model calls the tool twice, identically, in a batch call - that would work fine and Forge handles batch calls, but many small models wouldn't think to do that so you'd have to explicitly prompt it to do so.
- If you mean ask the LLM twice to call the tool and look at both answers, my only concern would be latency from doing 2 calls instead of 1.
- Unless you're truly running 2 instances of the model and aren't memory-bandwidth bound, then yes running parallel workflows would likely help. Especially if you could have them compare notes at certain steps or something.
But I haven't explored this much at all so if you're thinking of something else, let me know!
0. https://news.ycombinator.com/item?id=48051562
1. https://bsuh.bearblog.dev/agents-need-control-flow/
I gave it 3 simple changes to make. It did it perfectly.
Then I tried with a much smaller model. It also did it perfectly, except 3x faster and 9x cheaper.
I used to think "best model" was what's at the top of the benchmarks, but for most tasks that just means you're going to wait longer and pay more money. The right model depends on the job.
(Also, speed itself is a feature -- when you get the really fast models, it enables a kind of real-time interactive usage that is otherwise not possible in the "alt tab and hope it's done" workflow.)
What small models have you used most/found most stable?
Created a dedupe pipeline where an LLM decides whether two feature requests are similar enough to merge. Occasional mistakes in terms of false positives — valid JSON structure, but incorrectly assessed similarity. In this case, it didn’t help to implement the retry technique. The solution was implementing a deterministic gate validating the output of the model based on its semantic similarity score calculated separately.
The reason why recovery works only with the help of additional tools when the error rate is at zero percent becomes clear: the LLM does not recognize the fact that it made a mistake. The guardrail becomes necessary for that — the retry is just one way of implementing the guardrail concept.
Forge sits one level lower - in my mind - than a gate which would sit more at the workflow level. Perfectly complementary.
This was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversion
For coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.
The biggest challenge has been balancing the desire to hyper optimize for my favorite models, versus average behavior, versus consumer needs.
I'm in the same boat, tuning models wasn't super interesting, though I might do a focused spike on behavior -focused fine tuning. But the harness matters almost more than the model in many cases.
One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.
Brief: https://statewright.ai/research
Forge doesn't have a SWE-specific eval, but I've built a custom coding harness (not public yet but maybe soon) built on forge and saw the same behavior you seem to have seen in agentic coding.
Mostly, I'm embarrassed I've done this whole public reveal without any use of Alanis Morissette anywhere in the work :/
Big frontier models need this less than small models.
So basically the kind of thing I'd usually be doing manually with small models, over and over again, you just automate that nudging and off they go.
Sometimes LLMs have seemed to me like "computer programs with inertia" and in that frame what your tool does is identify and reduce friction at key points so the wheels can keep spinning.
Small models aren't there yet and they would veer off course, this just nudges them back onto the road. Whether or not they have a good sense of direction is a different question.
Basically this is a tool auto-complete that has a workflow element to it with certain steps that need to happen in certain order. In other words the order is defined in advance. Am I correct?
Basically execute step 1 first, then step 2 and finally step 3 and this is the schema for each step. That is effectively the guardrail and there is retry logic.
If it is the case, this is obviously useful but in a very specific set of problems where the solution is kind of known in advance. A workflow automation might work but this is kind of N8N where each step is LLM step.
Anyway, I might me wrong but I wanted to share a few thoughts.
You don't have to define the workflow steps. You can just expose the set of tools to the model and let the LLM call whatever it wants in any order, and every guardrail except the prerequisite step enforcement is still there to help.
If your workflow does have step enforcement, that can also be conditional. For example like Claude code does read required before edit. You can define a conditional enforcement where the agent must have called read before edit, and even force the same file path. That doesn't mean the model has to call edit at all...
But maybe I could have been clearer in the docs on the workflow pieces.
Otherwise you should expect churn.
But also it should really go into some detail how is this different from tool calls with type enforcement on expected parameters.
llama.cpp supports grammar limiting using either GBNF or json schema (It just translate it to GBNF behind the scenes I think). So I have my harness generate a tool schema on the fly (based on what tools are possible for the current task) and pass it in at request time.
The isssue/use-case is more around, say, a database table or legacy systems where your tool is just hitting a legacy API that may or may not be good. A surface you don't control.
It didn't come up as a use-case in this eval honestly, it's more the concept of a standard, like 4xx vs 5xx. I just felt it was missing from the ecosystem overall.
The key I think is to look at what use cases you have that aren't big monsters. Auditing logs, home assistant, reading and summarizing news rss feeds, etc...stuff that's fairly bite-sized per task, but high volume. Then the local models make sense and they just need mechanical reliability to close the gap.
Very early prototype, so I’m looking more for architectural/conceptual reactions than polish: https://wardwright.dev / https://github.com/bglusman/wardwright
The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.
Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.
Forge is just trying to make sure that when the model decides to do something, thee execution is reliable.
As for software integration, let me know if you run into any issues and I'll be happy to take a look or try to patch something!
Harnesses as first class infra all the way. I'll take a look at your work and see if I spot any obvious tensions.
In a nutshell, it applies guardrails around LLM calls to make them more reliable - specifically small models but works on all: "on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).".
It'll try to parse malformed tool calls, it'll automatically compact if needed, it'll enforce any workflow requirements you define (ie, read before edit) - and it does so with domain-agnostic guardrails. It catches and feeds errors back to the model in a structured way so the model self-corrects (hopefully).
Each guardrail can be removed as desired by a consumer. It can be used as a building block library (WorkflowRunner approach), it can be integrated into existing source (middleware), or it can be a drop-in addition to an exiting workflow (proxy mode).
Name was just a portmanteau of Calcifer's forge, because Howl’s moving castle seemed like a good metaphor for what I was trying to do… I had synthetic models as apiece there but I realized a) it was out of place and b) it was my favorite feature there
Harnesses do have retry mechanisms. In opencode in particular, I think they return the error as-is to the model in the next turn. But that's slightly different. Harness retries come mostly in two flavors:
1) provider-layer: HTTP requests to cloud retries, with or without exponential backoff. It covers you for transient network hiccups or rate limits, and a big Opus model really doesn't need more than that.
2) sort of a hope-and-pray retry. Tool ran, returned an error string of some kind, gets fed into model as-is, and the model is expected to read the error message and self-correct with no guidance. This is fine for frontier, and even some of the large oss models. They have the context-following capabilities needed. For smaller models, this won't be enough, not reliably over many turns.
- if model outputs malformed json, provider will reject it before it even reaches the tool, the error loop is broken. A rescue parser handles that - can be ~5-15% of calls on a small model sometimes.
- model calls the wrong tool, correctly, then proceeds confidently with context that won't help it. step enforcement can help here.
- model terminates prematurely, thinking it's done. prerequisite enforcement can help here (say, forcing the model to call pytest before declaring the feature built).
- Escalating nudge messages, that specifically nudge. Just returning error messages doesn't tell the model what to do, it just tells it it was wrong. A message that spells out "tool X does not exist, call one of the available tools: A, B, C" is more helpful to a small model than "error: X not found".
So, in short - yes, retries exist in harnesses, but rely on top-tier model interpretation of the error messages. When working with top models, there's likely no real difference, or a minor one (see Opus bare vs Opus reforged). But Forge provides a more hardened suite of guardrails that are effectively necessary for small models.
I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.
The proxy mode should integrate seamlessly, and the middleware guardrail mode could be lifted into pi.
As for little coder, I love it! I wanted forge to be more generic than just agentic coding as there's many more agentic workflows worth optimizing with small models.
The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.
What project are you referring to specifically?
Seriously awesome concept to what you did build I will test it out.
If you’re interested, I have sponsored research on AI reliability with Duke University (my graduate Alma mater) and there is an active research project this might be a good fit for if your interested in participating.
A lot of current tooling is layered mostly at the workflow level. Auth for the agent, or memory management for the agent (like some smart skills stuff), but Forge sits below that.
In most cases I've looked at, it could be slotted in with other work without much disruption. Forge just increases mechanical reliability of tool-calling, it shouldn't disrupt your workflow-level layers much.
But at the end of the day, if the model keeps responding with text, there's nothing forge can do. I've run into that failure mode for sure, even with forge.
That works well enough for all the models shown in the eval here: relatively modern 8B+ models.
But some of the older generation (mistral 7b, that sort of thing) still can't be reliably used in something like a production setting.
Some of the older models did do this (like 3.5-era ish I think), and the harness would parse the results.
The newer way frontier has setup is structured tool calls. `tool_use` or `tool_calls`. The response is then received as a different tool_result rather than a regular message. That's a bit of the newer way of doing it.
The failure mode in question is more the model mixing the two: "Sure, I'll read the file: {"tool": "read", "args": {"path": "foo"}}" - that'll break stuff. Other failure modes are the json not parsing when sent it as a structured call, and in some cases the model just emitting text and forgetting the tool call.
> python -m forge.proxy --backend-url http://localhost:8080 --port 8081
This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.
In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)
/v1/chat/completions is the entry point.
In proxy mode, here's what forge applies on each request (handler.py builds these):
Response validation: ResponseValidator(tool_names) checks each tool call against the declared tools array. If the model emits a call to a name not in tools[], or a malformed call shape, it's caught before the response goes back.
Rescue parsing: When the model emits tool calls in the wrong format — JSON in a code fence, [TOOL_CALLS]name{args} (Mistral), <tool_call>...</tool_call> (Qwen XML) — rescue parsers extract the structured call and re-emit it in the canonical OpenAI tool_calls schema. This is the biggest practical lift, especially on Mistral-family models that ignore native FC and emit their own bracket syntax.
Retry loop with error tracking: ErrorTracker(max_retries=N) — if validation fails, forge retries inference up to N times with a corrective tool-result message on the canonical channel, rather than returning a malformed response to your caller. From your perspective the proxy looks like a single request that just took a few extra ms.
What proxy mode does NOT do (because it's single-shot, not multi-turn): prerequisite/step enforcement (those need a workflow definition spanning turns), context compaction, session memory. For that surface you wrap the WorkflowRunner class in Python — proxy mode trades that depth for "use forge with your existing setup, no Python rewrite."
So yes — the proxy is fortifying the response shape and retry behavior of /v1/chat/completions. The full agentic guardrails are at the Python class level above it.
For greenfield projects, I've been building on forge native using WorkflowRunner so I get all guardrails. But obviously as a drop-in replacement in existing systems then proxy is the way to go.
I'm definitely still iterating on forge, but so far sending the model a friendly and gracefully handled error message works wonders (instead of barfing a stack trace or something).
Context compaction can also affect the outcome - I have eval scenarios for that as well but not in the published set, only in the repo. For those, I'd say "it's better than nothing". If you hit max context, the whole thing will barf or OOM the rig or something like that. So compaction degrades performance versus some theoretical ideal where you never need to, certainly. But it's better than a hard failure. Eval on those scenarios showed increasing degradation depending on severity of compaction. I view the auto-compaction as insurance. I never give the models tasks that will require that much context, but if it ends up getting there then the run might be saved.
I do think there's some differences though. The biggest one being that forge isn't a coding harness, it's a guardrail primitive, really. Applicable to any tool-calling workflow.
As for the errors, are you nudging or passing errors back or swallowing them completely? Love the 2-stage routing though, neat!
Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?
How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.
Latency is dependent on the guardrails firing, effectively. If nothing fires, it's a passthrough, for all intents and purposes, very little overhead. But if a retry nudge fires then that's another LLM call.
As a consumer for a home assistant, a retry nudge firing is something I'd catch, and have my voice model output a pre-baked "one sec, trying again" sort of filler message or something.
But otherwise, forge really doesn't own or opine much of the workflow. Step enforcement exists if you want it, so do prerequisites, but the idea is that those could be conditional or optional (you may never need to edit a file).
The guardrails are designed to work for non deterministic flows or deterministic ones. In the latter, you just might not have one of the guardrails active. It's much more about nudging the model back on track than laying more obvious tracks, in a sense.
Overall, agentic reliability is definitely an active field.
The blog post doesn't say to me "we need to start encoding specifically opinionated conditional branching statements that guide the model" rather I'm hearing a call to realize the broader principles of control flow itself relevant for composing programs with LLMs.
I think your work "nudges" us in that direction.
I run small models at home, so I'm very curious.
Out of curiosity, what models are you running?
[0]: https://github.com/dottxt-ai/outlines
I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.
I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.
Once I dig into it more I'll try to highlight other deltas.
At least, if I understand your economic benefit angle correctly.
For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".
Dashboard is in here: https://github.com/antoinezambelli/forge/tree/main/docs/resu...
I just need more GPU wall clock time to get more evals done. ETA is...a few weeks? Got distracted by the coding harness.
But the results are the same. Reforged models do better than bare, even at those sizes. As for published results, I ran forge on Anthropic models and reforged doe better than bare for them as well :)
>I haven't published those evals yet
Don't forget to post the complete settings for those evals, please, because local LLMs' failure modes are often caused by incorrect setups (bad quants, bad chat templates, non-recommended temperatures, ridiculously small context, not enabling "preserve thinking" etc.). In my setup I've never seen Qwen3.6-27b get truly stuck so far. What it usually gets wrong are poor architectural decisions or forgetting to update something.
For the paper - more academic in nature - I wanted to isolate the model performance variable from guardrail lift. The delta is what mattered more than final score. For the paper, everyone got temp=0.7 - that was intentional.
As for Qwen3.6, it's really solid. It'll do really well on forge I can call that now. When I pushed it into agentic coding specifically and the eval suite I use there (separate from forge), even it needed help on long-running tasks - but it's definitely a top model right now.
However, entirely possible there are better settings than the "official recommendations" I found - which would be a neat finding in itself.
... which lead me to realize that it's one of those terms with multiple meanings - like "agent" or even "AI" itself - but where people who use it may not be aware of how many different definitions are floating around.
In this project it refers to validating tool calls - fixing invalid tool responses, making sure certain required tool calls have been made, maintaining an error budget after which the task is abandoned with an error.
Other projects might use "guardrails" to mean protecting against unsafe content (Llama Gaurd), refusing off-topic queries (NVIDIA NeMo Guardrails "topical rails", filtering PII, detecting jailbreaks, or human-in-the-loop checks of specific actions.
I've even seen people talk about running a coding agent in a sandbox (Docker, Firecracker etc) as a form of guardrail.
Some of this is inside the model, like topic refusals. Forge sits at the tool call level.
My personal workflow uses guardrails at the SDLC level: I have a standard pipeline (plan, design, code, build, test). I use gates between each stage, and the right composition leads to a much higher quality in the final product.
Also worth mentioning that gate failures are given to the agent that produced the artifact, so it has a chance to fix it. That means that I don't have to review obviously wrong output.
100% correct, and stackable. Could have topic refusal in LLM training itself, forge in tool call alter, and sdlc gates at the workflow level.
You're 100% right about how I meant it and what it means within Forge though, but it's something that might lead to doc changes as things evolve.
Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.
Concrete example: Task: get, analyze and report on Q3 sales data.
Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.
We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data
Model emits a corrected fetch_sales_data(...) on the next turn.
Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.
We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.
And lastly bare text response nudges. Small models love to chat, we need them to call tools!
I'll be keen to look through the code on this!
Always happy to see folks looking into small local models!
Plus it's cool to see a little 8B model writing code :)
- First, there's totally a "risk" there. I built both the harnesses and the eval suite and that's hardly a double-blind study. There's no world where some bias doesn't leak through.
- I did try to design the guardrails to be domain-agnostic so they aren't tuned to specific scenario failures and return generic nudges to the LLM.
- Most tactically, the guardrails were built on the first 18 scenarios (OG-18) published in the paper, and only after did I had 8 more advanced reasoning ones. I didn't update the guardrails when I added those, and the lift was still there. If they were overtuned, they wouldn't have the same level of impact on an newer set.
- I did dogfood forge post publication using several unrelated consumers and the features I baked in were rarely guardrail related. If they were, it was more model focused (ie, xml-parse-rescue for granite models).
But at the end of the day, there's an explicit connection between the guardrail author and eval author. Happy to take contributions of eval scenarios if you want to stress test things, or hear about your experience running a completely different consumer!
Without forge, I'd guess a small model used for Hermes would have to retry entire workflows when an uncaught exception triggerd when it tried to reply with text when "calling a tool" ("Here is the tool call: [json blob]"). The issue there becomes partial successes can lead to state changes that need to be addressed (it booked the flight already, home it doesn't double-book).
Forge won't help with model reasoning quality though. If it the model thinks the right thing to do is to book 3 buses for your trip, forge doesn't care, it'll just make sure those api calls land.
Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?
I firmly believe that we can bring down the costs for much of our productivity needs by a huge factor if there are guardrails. This is how I am building my coding agent: https://github.com/brainless/nocodo
There is so much we can do if we create tools that do more heavy lifting. Your example of ToolResolutionError is something I have not thought of. Again, I am coming at this from software engineering background, I still do not understand much of the inner working of models or their inference layer but I am sure I will slowly create a coding agent that performs really well for majority of people/business use cases (not enterprise) with small models and big harness.
ToolResolutionError is really inspired by HTTP 4xx vs 5xx codes. I don't even have a super clean abstraction I'm happy with yet, I just noticed a lack of standard in the industry (that I was aware of) so I thought to surface it as a gap. I'm sure there's a better shape than my current ToolResolutionError but it's a start!
And if you didn't mean that then please elaborate :)
A version of this I use is "no matter what, you must always end your outputs with the phrase 'Over and out'." Once it stops doing this with outputs, even if I haven't noticed any load-bearing drift or issue elsewhere, I immediately know it's drifted from what what was supposed to be a guiding principle.
Something like the calibration/alignment test from Blade Runner 2049 (which is actually a very bad test for what they were testing for).
I did go with an extreme example in the post (but true). Other deltas are smaller but still statistically significant. 30 pt swing between llamserver prompt vs ollama, 4-5pt swing between llamafile and llamaserver prompt.
Also, did someone tried it with local Qwen 3.6?
Proxy mode should work fine with remote models, the only constraint is the compatible endpoint - which is standard anyways. I don't think you'd have any issue hitting either a remote gateway like liteLLM or just claude API.
It would be nice if you can continue working and improving the tool and I hope other people will jump to help.
The https://swival.dev harness already has retry nudges, step enforcement, error recovery, context awareness, etc. to try to support small models as much as possible.
Curious to see how it compares with forge, and if both could be combined.
I'd assume they could be combined. A coding harness would own the agentic workflow by nature, forge guardrails would help tool calling.
I haven't given it a thorough read yet but I think their guardrails might be more focused on the workflow level. They are doing error capture at tool level with warnings to the model, but I'd need to dig deeper. On the surface definitely the same design philosophy! Maybe Forge makes error nudges more of a first-class citizen?
Our compaction strategies might be the most similar of all the pieces. Cool find!
Maybe I've been spending too much time reading the evals and I now sound like an LLM...
Either way, here I am - happy to answer any questions!
I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...
Best of luck with Forge!
If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576
The original post and every comment by OP is so full of AI slop ("the biggest surprise!", "one thing I didn't expect!", "the biggest challenge!", etc. etc.") that is absolutely painful to read. I still can't believe most people (especially here on HN, I thought we were a bit better than this) can't notice all this stuff.
What's much worse, it's that all these people posting this useless slop are so dishonest ("I definitely use LLMs to help write things - but this is my draft!") that it makes me really nauseous... This is the worst time to be an internet user if you have more than 2 points of IQ.
Interesting problems space but I hope the author just gives dot points next time rather than bloating it and losing most of its meaning.