How I run dev work with Claude, on autopilot and at the keyboard
A few of you asked how my setup works, so here it is end to end. It runs in two modes that share one quality bar. On autopilot, a cron scheduler wakes a Claude Code session every hour to pull a GitHub issue and work it in an isolated git worktree. Before any pull request reaches me, it's been read by a second model and a separate coach agent whose only job is to ask whether the work is good. Hands-on, I'm at the keyboard and we run heavier multi-agent reviews together on bigger changes. Either way, the Codex cloud connector reviews the PR, I have the final say, and I get a Telegram message when something's done.
Quality comes from stacking independent checks. No single model has to be brilliant; enough of them have to disagree. Stack a few and a single miss gets caught in review, before it can become a bad merge.
Not a developer? Start with the plain-language key below — it defines every term the page leans on. None of this is as complicated as the words make it sound.
This is a developer setup, but the ideas behind it are simple. If a term below is unfamiliar, here's the plain version. The rest of the page builds on these.
- issue
- A to-do item logged on GitHub, the website where the code lives. The list of open issues is the work queue.
- pull request
- A finished change, bundled up and held for review. Nothing in it goes live until it's approved. Usually shortened to PR.
- worktree
- A separate, private copy of the project's files. Each session gets its own, so two can run at the same time without overwriting each other.
- session
- One run of Claude Code working on a task, from start to handoff.
- model
- An AI system. This setup uses four: Claude writes the code, Codex reviews it, Copilot breaks ties, and Gemini sorts email.
- commit / merge
- To commit is to save a set of changes as one step. To merge is to accept those changes into the live version of the code.
- cron
- A built-in timer that runs a job automatically at set times, with no one watching. The thing that wakes a session every hour.
- CLI
- A command-line tool: you run it by typing a command, instead of clicking through an app or website.
- context window
- How much text a model can hold in mind at once. It's limited, and the work gets worse as it fills up.
- compaction
- Summarizing a long session down to its essentials, to free up room in the context window.
- reasoning effort
- A setting for how hard a model thinks before it answers. More effort costs more and runs slower, and isn't always better.
The big loop
autopilotThe autonomous mode runs as one cycle. Open issues are the queue. A scheduled session pulls one, does the work in a worktree, runs it through the pre-PR review chain, opens a pull request, and pings me. I review on my phone and either approve the merge or send it back. The loop closes and the next wake starts again.
Fig. 1 — the autonomous loop · pinch or +/− to zoom · drag to pan · ⛶ full size
When sessions wake up
autopilotTwo Raspberry Pis (small, inexpensive, low-power computers) do the work: a small services Pi runs the scheduler, the Telegram bot, and the notifications poller; a second Pi handles heavier compute. Cron on the services Pi drives everything.
- Scheduled check-ins — a wake fires every hour from 7 a.m. to 9 p.m. This is the one that pulls an issue and progresses it.
- Event wakes — every 15 minutes a lighter pass drains a queue of things that just happened: a new meeting transcript, an email from me, a forwarded message, a Slack mention.
- The notifications poller — also every 15 minutes, it checks email, shared Drive files, and 38 Slack channels, and writes anything interesting into that queue.
A wake is a Claude Code session in a tmux pane (a terminal that keeps running on the Pi after I disconnect), at high reasoning effort with a hard timeout (a time limit that ends a session if it runs too long). The session picks its work, does 15 to 30 minutes of it, then hands off: a Telegram summary, a work-log doc, and any board updates.
Fig. 2 — wake and event lifecycle · pinch or +/− to zoom · drag to pan · ⛶ full size
Fig. 2b — what a finished wake sends me, names and projects redacted
Claude check-in wake complete (18 min)
Midday wake (Fri May 29) — I'm out.
Progressed issue #18 in a repo (tech-debt): a freshness check was flagging years buried in HTML attributes (upload paths, CSS classes, script bodies) as false-positive findings. Fixed it with a parser that reads reader-facing text only (element text + alt / title / aria-label). Live re-audit on a section: 6 false positives down to 2 true ones. PR #61, 17 new tests, 131 green, lint clean. Awaiting the cloud review + my merge.
New since last check-in: Google Docs comments on a draft doc from two colleagues — FYI. No new emails.
codex: 1 finding / 1 fixed / 2 clean passes · coach: applied
Wrap the session in timeout --foreground. Plain timeout starts a new process group and kills the Node tree before it produces output — you get zero-byte result files and silent multi-day outages that look like nothing ran. The idle-killer also watches CPU time in the process subtree, so long, quiet tool phases (a slow API call, a browser run) don't trip a false "this session is stuck."
What each session is told
autopilotThe prompt is assembled fresh each wake. A picker shortlists three candidates — an open GitHub issue, a backlog item, or a wildcard — and the session commits to one. Then comes a stack of shared instruction blocks: how to write a work-log doc, how to update the board and CRM safely (my task tracker and contacts list), an SMTP-only email rule (send through a standard mail server, never a model's own email feature), and how to leave a Telegram summary.
A quality-bar check brackets the work. Before it starts, the session has to name the single best move that will make the output better than a rote pass — a fact to verify instead of assert, an earlier work-log to build on instead of repeating it, or a claim to test empirically. Before it wraps up, it re-reads what it produced and confirms it made that move.
The sessions commit under my git identity, so I needed a way to tell "a wake did this" from "I did this." Each wake carries a receipt token like wake-20260529T1405-a1b2c3. A verifier later confirms the session progressed its item by finding that token in a commit, a PR body, or an issue comment — and records a verdict (shipped an artifact, changed state, blocked on me, and so on). No token, no credit, so the record of what each wake moved stays accurate.
Hands-on sessions
at the keyboardNot everything runs on a timer. When I'm at the keyboard the same machinery is there, but I'm in the loop in real time — so we can take on bigger, riskier changes than an unattended wake should, and I can steer mid-flight.
- superjawn for anything non-trivial. I run it through superjawn — my fork of superpowers (a workflow add-on for Claude Code) that forces a research phase before any brainstorming or planning, so the work starts from evidence. It writes the spec before touching code and I approve it; ambiguous specs produce ambiguous code, so the spec is where the time goes.
- Fan out. Instead of one session grinding through a problem, I have it spawn several subagents in parallel (extra AI helpers the main session launches, each on one piece of the problem) — one exploring the codebase, one running a codex review, one acting as the coach, one driving the Copilot CLI — then synthesize what they bring back. More compute on the same problem, and independent passes that catch each other's misses.
- Reviews on demand. The codex and coach passes from the autonomous flow are available here too, and I can fire a Copilot CLI review on a single fix without spending one of the once-per-PR cloud reviews.
- Verify before "done." Nothing gets called finished on a claim. The session has to prove it — run the test and show the output, diff the behavior against main, demonstrate the fix fires. "It should work now" is not done.
- I'm the gate at every step — the plan, the fan-out, the diff, the merge — which is what makes the bigger changes safe to attempt.
Fig. 3 — hands-on multi-agent fan-out · pinch or +/− to zoom · drag to pan · ⛶ full size
Autopilot and hands-on share the same reviewers and the same merge discipline. What changes is who closes the loop: when I'm away, a Telegram approve/deny on a finished PR; at the keyboard, me approving the plan, the fan-out, and the diff as we go. Both modes run through the next section.
Interfaces have their own AI-slop failure mode, so design gets the same treatment as code. When a session builds a page, it works against an explicit design rulebook: a distinctive typeface over the default sans, one committed aesthetic over a timid mix, and a list of template tells to avoid — the accent stripe down the edge of a card, the kicker label floating above a heading, the lone gradient on plain text. This page went through that pass. Same idea as the code reviews: name the generic defaults up front so the work clears a higher bar than "it renders."
Three reviewers check one change
both modesThe review stack is the core of the setup, and it runs in both modes. Work gets done with Claude Opus at high reasoning. Before anything is committed, a different model reads the change.
First, Codex reviews the uncommitted diff — the exact lines the change adds or removes, before they're saved — for bugs, security, swallowed errors, and style, in at most two passes: one to surface findings, at most one more to confirm the fixes. The effort is matched to the risk — low for a trivial, well-tested diff; high for a non-trivial or security-touching one. High-severity findings get fixed in place; anything that would balloon the scope becomes its own issue instead.
Then an independent coach: a separate Claude subagent at high reasoning that judges quality. Codex already covered correctness, so the coach reads the diff and answers two questions — would this impress a sharp reviewer or land as competent but forgettable, and what single change would raise it most? It runs even when Codex is down, and skips itself when there's no code to review.
Only then does the change get committed and the PR opened, where the Codex cloud connector reviews it again, independently, on GitHub — reasoning about the live repo, not just the diff. The last gate is me.
A second model reviews the high-reasoning model's work before I open the PR — then a coach asks the question neither bug-checker does: is this good?
Codex reviews at an effort set by the change, not a fixed default. A trivial, well-tested diff gets a low-reasoning pass: it's far lighter on tokens, and it keeps the reviewer from overthinking — turn a code reviewer up too high on a simple change and it starts inventing problems, rewriting things that were already fine, and burying the two findings that matter under a pile of speculation. A non-trivial or security-touching change gets a high-reasoning pass instead, where the stronger judgment is worth the cost and is usually right in one shot. Either way it's capped at two passes, not run to convergence. The model writing the code and the coach judging it always run at high reasoning.
Fig. 4 — the pre-PR review chain · pinch or +/− to zoom · drag to pan · ⛶ full size
Stacked end to end, that's five independent checks before anything lands — each one looking for something the others don't:
Why two reviewers instead of one stronger one
- Different blind spots. The model that wrote the code is the worst judge of it. A different model, and especially a different vendor, fails differently — so it catches things the author rationalized.
- Different jobs. A correctness pass and a high-reasoning "is this good" pass are not the same review. Asking one model to do both at once gets you a worse version of each.
- It's nearly free. Codex runs on my ChatGPT subscription and the coach runs on my Claude subscription — both through CLIs, no metered API calls. Adding a reviewer costs latency, not dollars.
Issues are the to-do list
both modesI abuse GitHub issues on every repo. They're the queue the scheduler pulls from, and they're how out-of-scope work gets captured without derailing the change in front of me. When a session notices a problem outside its current task — a bug it tripped over, a follow-up a reviewer flagged as scope creep — it doesn't widen the diff. It files an issue right there, with a finding, the impact, and a suggested fix, and moves on. That issue becomes a future session's work.
It's the most useful habit in the whole setup. Small PRs stay small and reviewable, nothing gets lost, and the backlog builds itself from real work.
Fig. 5 — issues as the to-do list · pinch or +/− to zoom · drag to pan · ⛶ full size
A couple of times, a session picked up an issue I'd already half-finished and never closed, and redid work that was mostly done. Each time, the independent review on the PR caught the duplication before I merged. That's why the reviewers are stacked and independent: when something slips past one layer, the next one catches it as a comment on a PR, before it reaches main.
Worktrees keep sessions out of each other's way
both modesAll work happens in git worktrees. Each session gets its own checkout of the repo on its own branch, so two sessions can touch the same repo at the same time without stepping on each other, and the main checkout always stays clean. When a session is done, its branch becomes a PR and the worktree goes away.
Fig. 6 — worktree isolation
One sharp edge with worktrees and automated review
If a worktree's branch is behind main (missing changes that have since gone live), an automated reviewer reading "the diff" can see the PR's real changes plus the reverse of every sibling change that already merged into main — and flag those reversals as new bugs. The fix is to rebase the worktree onto current main (replay its changes on top of the latest code) before running the review, so the reviewer only sees what the PR changes.
Who does what
both modesFour models, each on a job that fits it, and a hard rule underneath: no direct API calls — the pay-as-you-go way of calling a model, billed by the word. Everything runs through a CLI on a subscription I already pay for, so adding a model to the pipeline doesn't add a metered bill.
Fig. 7 — who does what
Claude Opus runs the sessions and the coach subagent — the model doing the work, and a separate instance judging it.
Codex reviews the uncommitted diff before commit, at an effort set by the change — low for a simple diff, high for a risky one. Runs on my ChatGPT subscription.
The Codex cloud connector reviews the PR on GitHub once it opens, reasoning about the live repo state, not just the diff — a second look from a different surface than the pre-PR pass.
Gemini handles the cheap, high-volume jobs: triaging email and returning structured output (answers in a fixed, machine-readable format) from the CLI.
| Model | Role | Invoked via | Marginal cost |
|---|---|---|---|
| Claude Opus | Sessions + coach review | claude CLI | subscription |
| Codex | Pre-PR review, effort by risk | codex exec CLI | none (OAuth) |
| Codex (cloud connector) | Independent PR review | chatgpt-codex-connector bot | subscription (weekly allowance) |
| Gemini | Email triage, structured output | gemini CLI | subscription |
The reviews are grounded in each repo. Every repo carries a per-repo CLAUDE.md that spells out the project's architecture, the patterns to enforce, and the specific mistakes to watch for — concrete, repo-specific bugs like a missing query limit, a NaN write, a wrong dependency arrow, or a factual error in the docs. The Claude sessions read it on every wake, and the cloud connector reasons about the live repo on GitHub rather than the diff alone — so the review comes back shaped by that repo's context. The project carries its own standards, and every model that works in it inherits them.
The cloud connector fires once when the PR opens and reasons about live repo state. It's subscription-billed — no metered minutes — but it still spends the ChatGPT plan's finite weekly and 5-hour allowance, so I treat it as a once-per-PR resource: I address what it raises in one fix round and don't re-trigger it. The local passes — the Codex CLI review and the coach — do the iterating before the PR ever opens.
Context and pre-compaction prep
both modesLong sessions degrade as the context window fills — the model gets measurably worse well before it runs out of room. I call the last stretch the dumb zone and try to never work in it. Three habits keep sessions sharp.
First, I compact early and on purpose. Autocompact is set to trigger around 300K tokens (the units models count text in, each one roughly three-quarters of a word), and when I hit a natural stopping point I run a manual compact with a note about what we're doing next — so the summary that survives is built around the next task, not whatever happened to be on screen.
Second, before a planned compaction I have the session write a durable handoff file to disk — the task, the decisions made so far, the exact file paths, what's done and what's next. Compaction is lossy: the auto-summary keeps what the model guesses matters, and you can't predict what it drops. A file is byte-exact, and the first step after compacting is to re-read it, which restores full fidelity. This page was built across a compaction exactly that way.
Third, I keep the always-loaded project instructions lean. The CLAUDE.md in each repo gets condensed to pointers — the anti-slop writing rules and the core engineering principles stay in full because they apply every session, but changelogs and old handoffs move to linked files. Every token of standing instructions is a token spent on every session, so the file earns its length or it gets trimmed.
Fig. 8 — the pre-compaction handoff
Steal this
If you want to build something like it, the parts that matter are simpler than the wiring suggests — and they hold whether you're running on a timer or sitting at the keyboard:
- Make a queue you trust. Issues work because they're already where the work lives. The scheduler just pulls from them one at a time.
- Pull one item per session. Small, bounded work produces small, reviewable PRs.
- Isolate every session. Worktrees mean concurrency never corrupts your main checkout.
- Put at least one independent reviewer between the writer and main. Use a different model than the one that wrote the code; a different vendor is better still. Stack a couple and the cost of a miss drops to a comment.
- Match the reasoning level to the task. High for writing and judgment; low for a fast, literal code review that won't overthink.
- Prove "done," don't claim it. A passing test, pasted output, a before/after beats "this should work now." Make verification a step the session has to run.
- Respect the context window. Compact before the dumb zone, and write a handoff to disk before you do.
- Keep one human gate. Everything is built to make that gate a quick yes — from my phone when I'm away, or step by step when I'm there.
An idea from Lars worth stealing: pipe the week's trending GitHub repos into the session brief, filtered for relevance, so a session reaches for an existing tool instead of reinventing one. Not wired up yet — but it's next on the list.
Prompts you can borrow
None of this depends on my exact stack. The principles above are instructions you give an agent, so here they are as text — paste them into whatever you use, Claude or Codex or something else, and adapt. They're tool-agnostic and loose on purpose; the shape carries the value. Take what fits, change the rest.
Get the kit
Most people I show this to didn't know it was a thing you could do — point an agent at your own GitHub issues and let it work them on a schedule, with the review and approval steps built in. So I packaged the autopilot half as a drop-in kit. You don't have to write the code or know Python: you hand two files to your own agent, answer its questions, and it builds the loop for your machine, whether that's a Mac, a Windows box, or Linux.
It's the careful version on purpose. What keeps autonomous work from turning into unreviewed output that quietly breaks something is bounded scope, a second model on every diff, and a person at the merge. The kit is built around those three, and it's the part most people ask about first.
You aim it as tightly as you want, then widen one ring at a time:
- Which repos — and which one is the focus right now.
- Which issues — a label like
agent-readyyou apply by hand is the tightest gate. - Which files — allow
docs/**only, deny.github/workflows/, or both. - How big — cap the files and lines a single change can touch.
- Who checks and who approves — a second model reviews every diff; you merge. It never merges its own work.
github.com/jamditis/claude-skills-journalism → docs/autonomy/kit — a spec your agent follows, the config you fill in, the prompt blocks above in full, reference code, per-OS scheduler templates, and a costs breakdown. MIT-licensed. Open a session with your agent, give it BUILD-WITH-YOUR-AGENT.md and config.example.yaml, and say “set this up for me.”
I've run this end to end on one setup: a Raspberry Pi 5 (8GB) on Ubuntu 25.10, ARM64, with Python 3.13, cron, tmux, and timeout --foreground from uutils coreutils. The macOS and Windows recipes follow the same design but haven't been run end to end as of this version — testing them is planned. And “Linux” has roughly 58 million permutations once you count distros, kernels, and coreutils flavors, so even a different Linux box can differ in the details. Unless you're on that exact setup, have your agent treat the matching template with some skepticism: verify each piece on a throwaway issue before you trust an unattended schedule.
That's the whole thing. Built on a couple of Raspberry Pis, a cron file, and the rule that no model gets to be the only one who checked its own work. Questions or "I built one too" notes welcome.