Build a background coding agent that works while you sleep
Written by Utpal Nadiger · · read as markdown
I have a backlog of issues I'd rather not touch. Stuff like missing CLI flags, typos, flaky tests, and the list goes on. Out of equal parts curiosity and laziness, I wanted to hand these issues to one of these background coding agents and see if they could handle the work without me having to do it.
After testing hosted options such as Devin, I decided to control the entire stack.
I'll walk you through building your own background coding agent using OpenComputer's VMs and sandboxes in about 250 lines of Python, at roughly $0.30 per task at Claude's current pricing.
Know the basics? Jump to the implementation of the recipe or just grab the full working code.
What you'll build
A self-hosted background coding agent that activates when you tag a GitHub issue with agent, works out a fix, and opens a draft PR. You only verify the result.
Here's how I structured the setup:
- A FastAPI server. Something to catch GitHub webhook events and dispatch them. FastAPI was the easy option.
- The OpenComputer API. Each tagged issue gets its own disposable Linux VM, booted from a base image that already ships Claude Code, git, Node, and Python, with the
ghCLI added at boot (or baked into a snapshot, see Step 1). The VM is torn down the moment the run ends, so I pay for the compute only while it's working. - Claude Code in headless mode. Inside the VM, it reads the issue, edits the repo, runs the tests, and commits.
- The
ghCLI. It pushes the branch and opens the draft PR as the last step, using a token that the server never has to handle itself.
A typical task spends $0.20 to $0.40 on the Anthropic API, and the VM time underneath it is a fraction of a cent. Route the trivial tickets to a cheaper model, and a task runs for a couple of cents.
This recipe was inspired by Ramp's Inspect agent, where they completely own the stack top to bottom: Sentry, Datadog, LaunchDarkly, and Braintrust wired in, internal tools on hand, dev images refreshed every 30 minutes. The agent is only limited by model intelligence. What we're building is the foundation a background coding agent system can sit on: a VM you control, an agent inside it, and a trigger that wakes it up.
How the pieces fit
Every issue gets its own throwaway Linux VM that boots in 2 to 3 seconds, does the work, and gets torn down when it's done. That one property is what keeps the rest of the design simple. Here's how OpenComputer delivers it.
The persistent box
A sandbox is a full Linux VM with its own kernel, memory, disk, network namespace, and process table, isolated by KVM at the hardware level. You get root, you install what you want, and it persists between runs.
This setup works quite similarly to a Docker image with two differences: a snapshot captures live VM state, and it's addressed by the content hash of its definition instead of a tag.
OpenComputer has persistent sandboxes which run at $8.76/month for a 1 GB instance, which is similar to a VPS or a Railway instance with similar config. And the base version of OpenComputer's persistent sandbox is more than enough for running a FastAPI server.
A sandbox per issue
When an issue arrives, the server spins up a fresh sandbox for it, runs Claude Code inside, opens the PR, and kills the sandbox once it's done. The repo cloning, code edits, and testing happen on this isolated server.
Why can't we run everything in the same persistent sandbox?
Nearly everything the agent touches is code you didn't write, running with --dangerously-skip-permissions. A poisoned README or a bad dependency could read the GitHub token or trash the filesystem, so you want it nowhere near the server. Inside a throwaway sandbox, the damage has nowhere to go: the sandbox dies at the end of the run, and the box keeps serving webhooks as if nothing happened.
When a run fills the disk, leaks memory, or wedges in a loop, it takes its own sandbox down with it and leaves the server untouched, and the persistent box never accumulates random code. You also get concurrency without doing anything for it. Two issues labeled at the same moment simply run side by side.
Prerequisites
- Python 3.10 or newer
- An OpenComputer account and API key from app.opencomputer.dev
- An Anthropic API key from console.anthropic.com with Claude Code access
- A GitHub Personal Access Token with
reposcope (the agent pushes branches and opens PRs as this token's user) - A GitHub repository you control with at least one bug-fix-sized issue in it
- A persistent OpenComputer instance to run the webhook server on
Environment setup
Install the SDK and the rest of the dependencies:
pip install opencomputer-sdk fastapi "uvicorn[standard]" httpx python-dotenvConfigure your keys in .env:
.env
OPENCOMPUTER_API_KEY=osb_... # app.opencomputer.dev — API Keys
ANTHROPIC_API_KEY=sk-ant-... # console.anthropic.com — API Keys
GITHUB_TOKEN=ghp_... # github.com/settings/tokens — repo scope
GITHUB_WEBHOOK_SECRET=letmein # any random string, paste the same one into GitHub
PORT=3000Project structure (only the files this cookbook creates):
build_snapshot.py — run once to bake the coder snapshot agent.py — spins up a sandbox, hands the task to Claude Code, opens the PR server.py — FastAPI webhook server .env
Want your coding agent to set this up for you?
Just grab the API keys and tokens and hand this prompt over to your coding agent. It'll take care of the rest.
Build the self-hosted background coding agent described in this post: [LINK]. Read it and follow it as the source of truth for the code and setup.
What it is: a FastAPI webhook server that, when a GitHub issue is labeled `agent`, spins up a disposable OpenComputer sandbox, runs Claude Code headless on the issue, and opens a draft PR.
Files to create:
- build_snapshot.py (optional, one-time snapshot build)
- agent.py (handles one issue inside a sandbox)
- server.py (the webhook receiver)
- .env (secrets and config)
Environment variables (.env):
- OPENCOMPUTER_API_KEY
- ANTHROPIC_API_KEY
- GITHUB_TOKEN (repo scope)
- GITHUB_WEBHOOK_SECRET
Ask me for: the four values above, the target GitHub repo, and confirmation once I've created the `agent` label and pointed a webhook (Issues events) at the server.Implementation
Three files, built in the order they run: build_snapshot.py (one-time setup), agent.py (handles one issue), and server.py (the webhook receiver).
Step 1: Pre-install the agent's tools into a snapshot
Installing Claude Code and gh on every run would cost ~90 seconds per issue, waiting on npm and apt. We'll instead bake the required tools into a snapshot so every sandbox boots with Claude and gh already in $PATH.
Here's the snapshot file that we're using.
build_snapshot.py
# build_snapshot.py
import asyncio
import sys
from opencomputer import Snapshots, Image
from dotenv import load_dotenv
load_dotenv()
async def main() -> None:
snapshots = Snapshots()
image = (
Image.base()
.apt_install(["curl", "git", "jq", "build-essential", "ca-certificates"])
.run_commands(
# gh CLI from the official apt repo (build runs as non-root, so sudo)
"curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg "
"| sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg",
"sudo chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg",
"echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg]"
" https://cli.github.com/packages stable main'"
" | sudo tee /etc/apt/sources.list.d/github-cli.list",
"sudo apt update && sudo apt install -y gh",
"sudo npm install -g @anthropic-ai/claude-code",
"git config --global user.email 'agent@sleep.dev'",
"git config --global user.name 'sleep-agent'",
"git config --global init.defaultBranch main",
)
.workdir("/workspace")
)
snap = await snapshots.create(
name="coder",
image=image,
on_build_logs=lambda line: sys.stdout.write(line),
)
print(f"\nSnapshot ready: {snap['name']} ({snap['id']})")
if __name__ == "__main__":
asyncio.run(main())Run it once:
python build_snapshot.pyThe build starts from OpenComputer's Ubuntu 22.04 base (Python, Node, and build tools already present) and layers on what the agent needs: gh from its official apt repo, Claude Code from npm, and a git identity so commits have an author. The first build takes 3 to 4 minutes, but after that it's nearly instant.
Step 2a: Booting and prepping the sandbox
With the snapshot ready, we'll work on agent.py. This is one function that takes a GitHub issue, boots a sandbox, clones the repo, hands the job to Claude Code, and opens a draft PR with the fix.
agent.py
# agent.py
import asyncio
import os
from dataclasses import dataclass
import httpx
from opencomputer import Sandbox
from opencomputer.exec import ProcessResult
USE_SNAPSHOT = os.environ.get("USE_SNAPSHOT", "0") == "1"
# Cheaper/faster than Opus for bug-fix-sized tickets. Override with CLAUDE_MODEL.
MODEL = os.environ.get("CLAUDE_MODEL", "claude-sonnet-4-6")
# Base already ships claude/git/node/python; only gh is missing.
RUNTIME_PREP = " && ".join([
"curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg "
"| sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg",
"sudo chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg",
"echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg]"
" https://cli.github.com/packages stable main'"
" | sudo tee /etc/apt/sources.list.d/github-cli.list",
"sudo apt-get update -qq",
"sudo apt-get install -y -qq gh",
"git config --global user.email 'agent@sleep.dev'",
"git config --global user.name 'sleep-agent'",
"git config --global init.defaultBranch main",
])
@dataclass
class IssueTask:
repo: str # "owner/name"
issue_number: int
title: str
body: str
async def _exec(sandbox, cmd, *, cwd=None, env=None, timeout=60,
retries=3, check=True) -> ProcessResult:
"""Run a command, retrying transient gateway errors (the platform throws the
occasional 524). With check=True, a non-zero exit raises with stderr."""
last = None
for attempt in range(retries):
try:
r = await sandbox.exec.run(cmd, timeout=timeout, cwd=cwd, env=env)
break
except (httpx.HTTPStatusError, httpx.ReadTimeout, httpx.RemoteProtocolError) as e:
last = e
await asyncio.sleep(4)
else:
raise RuntimeError(f"exec failed after {retries} tries: {last}")
if check and r.exit_code != 0:
raise RuntimeError(f"\`{cmd[:60]}...\` exited {r.exit_code}: {r.stderr[:500]}")
return r
async def run_agent(task: IssueTask) -> str:
token = os.environ["GITHUB_TOKEN"]
# Secrets are injected per-command (exec env), NOT via Sandbox.create(envs=).
# Passing them at create time routes egress through the secrets proxy, which
# blocks all outbound traffic unless an egress allowlist is configured. Per
# exec keeps the sandbox on open egress so git/claude/pip just work.
secret_env = {
"ANTHROPIC_API_KEY": os.environ["ANTHROPIC_API_KEY"],
"GITHUB_TOKEN": token,
"GH_TOKEN": token,
}
metadata = {"issue": f"{task.repo}#{task.issue_number}"}
if USE_SNAPSHOT:
sandbox = await Sandbox.create(
snapshot="coder",
timeout=1800, # 30 min idle ceiling
metadata=metadata,
)
else:
sandbox = await Sandbox.create(
template="base",
timeout=1800,
metadata=metadata,
)
# exec.run is one buffered HTTP call; the SDK's default client timeout (30s)
# is shorter than a real Claude run, so widen it to the process ceiling.
sandbox._client._timeout = httpx.Timeout(1700.0)
try:
if not USE_SNAPSHOT:
await _exec(sandbox, RUNTIME_PREP, timeout=300)
# Clone the repo with a token-authenticated URL
await _exec(
sandbox,
f"git clone https://x-access-token:{token}@github.com/{task.repo}.git repo",
cwd="/workspace",
timeout=120,
)
branch = f"agent/issue-{task.issue_number}"
# Write the task so Claude can re-read it when it gets lost mid-loop
await sandbox.files.write(
"/workspace/repo/TASK.md",
"\n".join([
f"# Issue #{task.issue_number}: {task.title}",
"",
task.body or "_(no body)_",
"",
"## Working Instructions",
"",
"- Read the relevant code before editing.",
"- Run the project's existing test suite after your changes.",
"- If tests fail, fix them before stopping.",
"- Keep the diff focused. Do not refactor unrelated files.",
"- Delete this TASK.md file before committing.",
]),
)
await _exec(sandbox, f"git checkout -b {branch}", cwd="/workspace/repo", timeout=30)
# ... continues belowA fresh VM boots in 2 to 3 seconds. By default, it's OpenComputer's base image, which already ships Claude Code, git, Node, and Python, so RUNTIME_PREP only has to add the gh CLI; set USE_SNAPSHOT=1 to boot the coder snapshot from Step 1 instead and skip even that. The API key and GitHub token are handed to the commands that need them as environment variables rather than baked into the image, and a 30-minute timeout deletes the VM if a run stalls.
Step 2b: Running Claude and shipping the PR
This is the second half of our agent.py file. Here, Claude runs and commits the result and kills the VM so we don't have zombie instances on OpenComputer.
agent.py (continued)
# Hand it to Claude. --dangerously-skip-permissions is safe here because
# we're in a fresh, disposable VM (and lets it read CLAUDE.md if present).
# No retry: re-running a non-idempotent agent would double the spend.
claude_result = await _exec(
sandbox,
'claude -p "$(cat TASK.md)" --dangerously-skip-permissions '
f"--model {MODEL} --max-turns 50 --output-format json",
cwd="/workspace/repo",
env=secret_env,
timeout=1500,
retries=1,
)
if claude_result.exit_code != 0:
raise RuntimeError(
f"Claude exited {claude_result.exit_code}: {claude_result.stderr[:500]}"
)
safe_title = task.title.replace('"', '\\"')
# Commit and push. Claude often commits its own work, so only commit when
# something is actually staged, then always push (a hard && chain here
# would swallow the push whenever Claude already committed).
await _exec(
sandbox,
"rm -f TASK.md && git add -A && "
"(git diff --cached --quiet || "
f'git commit -m "fix: address #{task.issue_number} ({safe_title})") && '
f"git push --set-upstream origin {branch}",
cwd="/workspace/repo",
env=secret_env,
timeout=120,
)
# Open the draft PR
pr = await _exec(
sandbox,
f'gh pr create --draft --title "fix: {safe_title}" '
f'--body "Closes #{task.issue_number}\n\n'
'_Drafted by sleep-agent. Review the diff before merging._"',
cwd="/workspace/repo",
env=secret_env,
timeout=60,
)
return pr.stdout.strip()
finally:
await sandbox.kill()We hand the task to Claude Code in headless mode, capped to a maximum of 50 turns so it doesn't get stuck in a loop, and the emit is set to JSON so we can track the cost per task later.
You'll also note I'm using the --dangerously-skip-permissions flag. That's because the VM is disposable and completely isolated (the main reason why we didn't run the background coding agent in our persistent VM). In case a Claude run goes awry, there's zero risk to our environment or our codebase.
When Claude's done, we commit, push the branch, and open a draft PR through the gh CLI, which already has the token and prints the PR URL back to us.
Step 3: Receive GitHub webhooks and trigger the agent
Our persistent FastAPI server listens for labeled issues and reacts instantly. When an issue arrives tagged to an agent, the handler does three things:
- ack the webhook before GitHub's 10-second timer fires.
- Launch the agent in the background.
- Comment on the issue so you know it triggered.
Here's the server.py file that you can copy over.
server.py
# server.py
import asyncio
import hashlib
import hmac
import os
import httpx
from fastapi import FastAPI, Header, HTTPException, Request
from dotenv import load_dotenv
from agent import IssueTask, run_agent
load_dotenv()
WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"].encode()
GH_TOKEN = os.environ["GITHUB_TOKEN"]
GH_API = "https://api.github.com"
GH_HEADERS = {
"Authorization": f"Bearer {GH_TOKEN}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
app = FastAPI()
def verify_signature(body: bytes, signature: str) -> bool:
expected = "sha256=" + hmac.new(WEBHOOK_SECRET, body, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, signature)
async def post_comment(repo: str, issue_number: int, body: str) -> None:
async with httpx.AsyncClient(timeout=30) as client:
r = await client.post(
f"{GH_API}/repos/{repo}/issues/{issue_number}/comments",
headers=GH_HEADERS,
json={"body": body},
)
r.raise_for_status()
async def handle_labeled(payload: dict) -> None:
repo = payload["repository"]["full_name"]
issue_number = payload["issue"]["number"]
task = IssueTask(
repo=repo,
issue_number=issue_number,
title=payload["issue"]["title"],
body=payload["issue"].get("body") or "",
)
await post_comment(repo, issue_number, "🤖 sleep-agent picked this up. Draft PR incoming.")
try:
pr_url = await run_agent(task)
await post_comment(repo, issue_number, f"✅ Draft PR ready: {pr_url}")
except Exception as exc:
await post_comment(
repo, issue_number,
f"❌ sleep-agent failed:\n\n\`\`\`\n{exc}\n\`\`\`",
)
@app.post("/webhook")
async def webhook(
request: Request,
x_hub_signature_256: str = Header(default=""),
x_github_event: str = Header(default=""),
):
body = await request.body()
if not verify_signature(body, x_hub_signature_256):
raise HTTPException(status_code=401, detail="bad signature")
if x_github_event != "issues":
return {"status": "ignored"}
payload = await request.json()
if payload.get("action") != "labeled":
return {"status": "ignored"}
if (payload.get("label") or {}).get("name") != "agent":
return {"status": "ignored"}
# Fire-and-forget. Don't block the webhook response on the agent run.
asyncio.create_task(handle_labeled(payload))
return {"status": "accepted"}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "3000")))You need to run this on the persistent VM solely because GitHub webhooks have a 10-second window for your server to acknowledge the event, and for that, the server needs to be actively listening for those events.
The handler we've designed verifies the signature, checks for the agent label, schedules the work to keep running after the response goes out, and returns in milliseconds. Each labeled issue runs as its own background task with its own sandbox, so two issues labeled at once become two independent runs. The handler comments once when it picks up the issue, then again with the PR link or the error trace when the run finishes.
Step 4: Point GitHub at your server
Expose your server with ngrok during development or production on the persistent OpenComputer VM:
ngrok http 3000Then in your repo, go to Settings → Webhooks → Add webhook. Set the payload URL to your ngrok HTTPS URL with /webhook appended (something like https://sb-abc123-p3000.workers.opencomputer.dev/webhook), set the content type to application/json, paste in the same secret you put in GITHUB_WEBHOOK_SECRET, choose "Let me choose individual events", and check Issues only.
Create the label the agent listens for under Issues → Labels → New label → agent.
Step 5: Run it and ship your first sleeping PR
Start the server. You could also turn the server into a service to allow the script to run persistently.
python server.pyTo test this, open an issue that's small and quick to resolve. Some good first targets:
- Add a
--versionflag to the CLI that prints the version frompyproject.toml. - Sort the
/api/usersresponse bycreated_atdescending. - Replace bare
except:clauses withexcept Exception:acrosssrc/.
Label the issue agent, come back in a couple of minutes, depending on repo size and how much code Claude has to read. The bot comments with the PR URL when it's done.
Read the diff, mark it ready for review if it's close, or comment with more direction and re-label to run it again.
What it actually costs
A typical issue runs $0.20 to $0.40 in Anthropic API spend, and the VM time underneath it (3 to 8 minutes, then gone) comes in under a cent. To spend less, cut Claude's turns or route the easy stuff to Haiku, because fiddling with the VM won't move the needle.
The best predictor of cost is how much Claude has to read to build context.
Where your background coding agent can fail
A handful of failure patterns showed up often enough to name.
Claude refactoring files you didn't ask it to
Claude opens a file, decides the indentation is wrong or a name could be clearer, and reformats it on the way past. The TASK.md line ("Keep the diff focused. Do not refactor unrelated files.") gets enforced maybe 70% of the time. But you need manual code review for the remaining 30%.
The edit-test-fail loop
Claude makes a change, a test fails, it tries something else, that fails too, and the cycle repeats for forty turns. The --max-turns cap stops it eventually, but the loop can still burn $0.50 first, and you need to figure out what happened to avoid wasting tokens.
Repos with no obvious test command make this worse, since Claude can easily go through the turn limits just figuring out how to run the tests.
Drop a CLAUDE.md in the repo root with the exact test command and your conventions. The headless CLI reads it on startup, and it's the single change that moves the success rate the most.
GitHub PATs rate-limit at 5,000 req/hour
A single run spends a few dozen REST calls (read the issue, post comments, push the branch, open the PR), so one PAT handles about a hundred runs an hour. Run ten agents in parallel off the same token, and you'll hit the ceiling, especially if that PAT also drives other automation. Instead, switch to a GitHub App with per-installation tokens once you're there, since each gets its own 5,000/hour budget.
Now for the tradeoff
It feels awesome to have work done without lifting a finger. But it's not all A-OK.
I'd spend a lot of time testing and refining these workflows on low-risk repositories before they ever touch production.
That's because we have to trust Claude (or another LLM) to make judgment calls in a sandbox with no human review until the draft PR is opened. And considering how easily LLMs overengineer solutions, I would not start merging PRs blindly.
Apart from that, I don't see much of an issue. A well-built agent could be extremely helpful in getting the simpler parts of your day-to-day workflows settled.