DE version is available. Content is displayed in original English for accuracy.
Demo: https://www.youtube.com/watch?v=oyfTMXVECbs
Before Twill, building with Claude Code locally, we kept hitting three walls
1. Parallelization: two tasks that both touch your Docker config or the same infra files are painful to run locally at once, and manual port rebinding and separate build contexts don't scale past a couple of tasks.
2. Persistence: close your laptop and the agent stops. We wanted to kick off a batch of tasks before bed and wake up to PRs.
3. Trust: giving an autonomous agent full access to your local filesystem and processes is a leap, and a sandbox per task felt safer to run unattended.
All three pointed to the same answer: move the agents to the cloud, give each task its own isolated environment.
So we built what we wanted. The first version was pure delegation: describe a task, get back a PR. Then multiplayer, so the whole team can talk to the same agent, each in their own thread. Then memory, so "use the existing logger in lib/log.ts, never console.log" becomes a standing instruction on every future task. Then automation: crons for recurring work, event triggers for things like broken CI.
This space is crowded. AI labs ship their own coding products (Claude Code, Codex), local IDEs wrap models in your editor, and a wave of startups build custom cloud agents on bespoke harnesses. We take the following path: reuse the lab-native CLIs in cloud sandboxes. Labs will keep pouring RL into their own harnesses, so they only get better over time. That way, no vendor lock-in, and you can pick a different CLI per task or combine them.
When you give Twill a task, it spins up a dedicated sandbox, clones your repo, installs dependencies, and invokes the CLI you chose. Each task gets its own filesystem, ports, and process isolation. Secrets are injected at runtime through environment variables. After a task finishes, Twill snapshots the sandbox filesystem so the next run on the same repo starts warm with dependencies already installed. We chose this architecture because every time the labs ship an improvement to their coding harness, Twill picks up the improvement automatically.
We’re also open-sourcing agentbox-sdk, https://github.com/TwillAI/agentbox-sdk, an SDK for running and interacting with agent CLIs across sandbox providers.
Here’s an example: a three-person team assigned Twill to a Linear backlog ticket about adding a CSV import feature to their Rails app. Twill cloned the repo, set up the dev environment, implemented the feature, ran the test suite, took screenshots and attached them to the PR. The PR needed one round of revision, which they requested through Github. For more complex tasks, Twill asks clarifying questions before writing code and records a browser session video (using Vercel's Webreel) as proof of work.
Free tier: 10 credits per month (1 credit = $1 of AI compute at cost, no markup), no credit card. Paid plans start at $50/month for 50 credits, with BYOK support on higher tiers. Free pro tier for open-source projects.
We’d love to hear how cloud coding agents fit into your workflow today, and if you try Twill, what worked, what broke, and what’s still missing.

Discussion (91 Comments)Read Original on HackerNews
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
[1] https://airut.org [2] https://haulos.com/blog/building-agents-over-email/
I've also (separately) got a tool for local dev that sets up containers and does SSL interception on traffic from the agent, so it could also swap creds and similar.
The reason they're separate is that in a corp environment the expectation is very strongly that an email account = a human. You can't easily provision full employee accounts for AIs, HR doesn't know anything about that :) In my own company I am HR, so that's not a problem.
I love the idea of emailing agents like we email humans! Thank you for sharing your learnings:
1. Network constraints vary quite a bit from one enterprise customer to another, so right now this is something we handle on a case-by-case basis with them.
2. We came to the same conclusion. For sensitive credentials like LLM API keys, we generate ephemeral keys so the real keys never touch the sandbox.
3. Totally right, we support constrained tasks too (ask mode, automated CI fixes). We've gone back and forth on whether to go vertical-first or stay generic. We're still figuring out where the sweet spot is. The constrained tasks are more reliable today, but the open-ended ones are where teams get the most leverage.
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
My efforts will be in improving agentic requirements gathering and assessment.
This assertion needs some support for those of us that don't have a macro insight into the industry. Are you seeing this from within FAANG shops? As a solo developer? What? Honest question.
I anticipate that once I have some more complex agentic scaffolds set up to do things like automatically explore promising directions for the project, then leaving the AI system on overnight becomes a necessity.
I also have Claude Cowork automations running constantly. As-is, I can't shut down my laptop, and it gets frustrating when my laptop is in my backpack all day because of commutes or travel.
Cloud starts to matter when you want to (a) run a swarm of agents on multiple independent tasks in parallel, (b) share agents across a team, or (c) not worry about keeping a machine online
Other than that, I agree with what you said. I don't know what the tradeoffs for local on-premises and cloud agents are in terms of other areas like convenience, but I do think that scalability in the cloud is a big advantage.
Additional question - what types of sandboxes you use? (just docker or also firecracker etc...)
Original comment:
Congrats on the launch!
What's the benefit over cursor cloud agents with computer use? (other than preventing vendor lock in?)
https://cursor.com/blog/agent-computer-use
Or the existing Claude Code Web?
The biggest practical difference from cloud solutions: agents that run on your own machine can interact with your actual environment. Our agents browse the web, manage a Discord bot, push to git repos, and read email. They share a filesystem so one agent's output is another agent's input.
The tradeoff is obvious: no isolation, no scaling, and if your home server goes down, everything stops. But for a single developer who wants an AI that actually does things (not just produces PRs), local gives you reach that sandboxed cloud agents cannot.
The "prompt to PR" model is clean for dev work. For everything else (marketing, monitoring, data collection, content creation), agents need to touch the real world, and that is harder to sandbox.
On triggers: Cron, GitHub (PRs, issues, @twill mentions in review comments), Slack, Linear, Notion, Asana webhooks, plus CLI and web. Our PR-comment workflow is you would have to tag @twill with an instruction. That being said, you can also setup a daily cron on Twill that checks PRs with a specific label like Confidence Score : x/5 and tell it to auto-approve when 5/5 for example.
On setup scripts: Per-repo entrypoint script, env vars, and ports, all accessible on the UI. There is a dedicated Dev Environment agent mode that you start with to setup the infra. You can steer the agent into how to setup if it gets stuck. So this should be smooth. The agent can also rewrite the entrypoint mid-task.
There is also a Twill skill you can add to your local agents to dispatch tasks to Twill. Meaning you can research and plan locally using your CLI and delegate the implementation to a sandbox on Twill.
However reusing the GitHub workflows out of the box feels really nice
One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"?
Also how do you add more agent types, do you support just ACP?
For the lightweight sandbox, can you give an example?
Currently we support main coding CLIs, ACP support is not shipped yet.
For example Monty by the pydantic team, or the Anthropic sandbox which I believe uses OS level primitives.
1. It’s really not that hard to stand this up on your own. GitHub agentic workflows gets you 95% of the way there already. 2. Anthropic and Cursor are already playing in this space and likely will eat your lunch.
IMO, the only way you can survive is to make this deployable behind the firewall. If you could do that then I would seriously consider using your product.
On labs eating our lunch: it's definitely a risk. Our bet is that reusing lab-native CLIs is enough to position ourselves in the market
On behind the firewall: it's something we're looking into. We open-sourced agentbox-sdk in that direction.
Defining the plan/acceptance criteria for long running task is the hard part.
We recently added a Ralph loop mode in that spirit. The implementation won't start until the human and agent align on verifiable criteria and a different agent judges if criteria are met at the end of each run.
Overall I think this problem is not yet completely solved and improvement on both the UX and model judgement are needed
The ralph loop mode also has the concept of a budget per task.
You are also free to swap/combine these harnesses as you please, which is something Anthropic can't do. For instance claude code implements and codex reviews.
The analysis request failed.
Hosted shell completed without parseable score_repo.py JSON output. 11 command(s), 11 output(s). (rest redacted)
And so the SWE workflow is pre-built (research, planning, verification, PR, proof of work). Twill is also agnostic to the agent, so you can use codex for instance. Additionally you have more flexibility on sandbox sizing on Twill
Are there benchmarks out there that back this claim?
Unfortunately, undesirable in practice due to people being token-constrained even before. One case is retrying only on failure, but even that is a bit tricky...
Install LXC on a server Start a container called dev
Add 25 lines to your zshrc
I say dev1 and it spins up fresh
Dev2 copies from that and is a fresh container.
Auto uses tmux.
Claude code with bypass mode. Do anything. Close laptop. Come back later.
Even have a lock mode blocking all internet access except to the llm provider.
Ssh key agent forwarding through 1pw CLI so it can't even push to github unless I reconnect.
I feel like the Dropbox quote years ago but its a lot easier than people think, and its weird to delegate to another service something that a dev should already understand how to do.
All I have to do to have the same issue to PR flow is open dev1. Open Claude. Have GH CLI and task system MCP.
Do /loop watch for new ticket assigned to me and complete it and push it up.
However, once you want to trigger tasks from Slack, Linear, or GitHub issues or onboard teammates who aren't comfortable wiring up LXC + tmux + agent forwarding, a managed layer is needed.
I think we're at a moment where builders with great setups like yours and products like ours are feeding each other good ideas. The patterns you figure out in your zshrc inform what we productize, and the workflows we ship give you new things to try. It's a virtuous circle. Everyone should use the right-sized solution for their situation.
I am anecdotally aware of at least one project where the author can’t recall off the top of his head what the stack is.
This is what enables Twill to self verify its work before opening a PR
- Twill is CLI-agnostic, meaning you can use Claude Code, Codex or Gemini. Jules only works with Gemini.
- We focus on the delegation experience: Twill has native integrations with your typical stack like Slack or Linear. The PRs comes back with proofs of work, such as screenshots or videos.
Something very useful that will be harder for you most likely is code search. Having a proper index over hundreds of code repos so the agent can find where code is called from or work out what the user means when they use an acronym or slightly incorrect name.
It's quite nice to use and I'm sure someone will make a strong commercial offering. Good luck
That said, there are workarounds, like cloning all repos and enabling LSP (coding CLIs added that feature) or using a dedicated solution for codebase indexing and add a skill/mcp.
Super fast models spamming grep commands are also fun to watch!
Curious to know how you implemented it in house.
Run a copy of this in the same VPC. Monorepos would definitely help, but that's not the structure we have. I didn't want to rely on API limits (or stability) at GitHub for such a core feature.
Using this we've had agents find dead APIs across multiple repos that can be cleaned up and the like. Very useful.
One learning we had is that most of the time CLI > MCP
On the cost for solo devs, yeah, if you're one person running one agent at a time on your laptop, the sub is probably the better deal today. No argument there. The cloud agent model starts to make sense when you want to fire off multiple tasks in parallel.
Also you can fire and forget tasks (my favorite) and don't have to keep your laptop running at night.