This guide builds an autonomous bug-fix agent using Flue and Daytona sandboxes. Given a GitHub issue, the agent reproduces the bug with a failing test, implements the minimal fix, runs the full test suite, and opens a real pull request.
A sandbox is essential for this workflow. The agent clones unknown code, installs unknown dependencies, and executes the project’s test suite — operations that need strict isolation from your host. Daytona provisions a fresh isolated environment for every run and tears it down on completion, so an untrusted repository can never affect your host.
1. Workflow Overview
Section titled “1. Workflow Overview”You point the agent at an open issue on any GitHub repository. The agent provisions a Daytona sandbox, clones the repo into it, then executes a strict Reproduce → Fix → Verify → PR workflow. When it’s done, it returns the URL of a real pull request you can review in the GitHub UI.
A successful run against vercel/ms issue #284 looks like this in the flue dev terminal:
[bug-fix] target: your-username/your-fork#284 (model: anthropic/claude-sonnet-4-6)[bug-fix] sandbox ready (id: a44a184e-cf0a-4407-bb1a-02f1b8000466)[bug-fix] installing gh CLI in sandbox...[bug-fix] commits will be authored as Your Name <12345+your-username@users.noreply.github.com>[bug-fix] cloning your-username/your-fork into sandbox...[bug-fix] detected package manager: pnpm[bug-fix] installing pnpm...[bug-fix] installing project dependencies...[bug-fix] resolving issue source: vercel/ms[bug-fix] fetching issue #284 from vercel/ms...[bug-fix] uploading skill into sandbox + excluding it from git...[bug-fix] running TDD workflow (reproduce → fix → PR)...[bug-fix] PR opened: https://github.com/your-username/your-fork/pull/1[bug-fix] branch: flue/fix-issue-284[bug-fix] files changed: src/index.ts, src/parse.test.ts[bug-fix] tearing down agents + sandbox...The four-phase TDD work (Understand → Reproduce → Fix → Pull Request) happens entirely inside the LLM’s session in the sandbox, so it doesn’t surface in the dev-server log line by line. To see those events streamed live, switch to the SSE invocation shown later in How bug-fix.ts is actually invoked.
The HTTP response body returned to your curl is the structured result the agent emits:
{ "result": { "branch": "flue/fix-issue-284", "prUrl": "https://github.com/your-username/your-fork/pull/1", "testFile": "src/parse.test.ts", "filesChanged": ["src/index.ts", "src/parse.test.ts"], "summary": "The parse() regex only matched plain decimal numbers in the value group (`-?\\d*\\.?\\d+`), so when format() produced scientific notation (e.g. `5.696545792019405e+297y`) via JavaScript's default number serialisation for very large Math.round() results, parse() returned NaN; the fix extends the value capture group with an optional exponent part (`(?:e[+-]?\\d+)?`) so scientific notation is accepted transparently." }}2. Project Setup
Section titled “2. Project Setup”Clone the Repository
Section titled “Clone the Repository”Clone the Daytona repository and navigate to the example directory:
git clone https://github.com/daytonaio/daytona.gitcd daytona/guides/typescript/flueFork a Demo Target
Section titled “Fork a Demo Target”You need a target repository to demo against. We recommend vercel/ms, the well-known millisecond conversion utility. It’s small (one ~244-line source file), uses Jest for tests, has an MIT license, and ships with real open issues. Fork it so the agent can push branches and open PRs against your copy:
gh repo fork vercel/ms --clone=falseThe agent will operate on your fork (referred to in this guide as your-username/your-fork, where your-fork is whatever you named it), so any branches and pull requests it creates land on your fork, never upstream.
Configure Environment
Section titled “Configure Environment”Copy .env.example to .env and fill in your keys:
cp .env.example .env| Variable | Required | Source |
|---|---|---|
DAYTONA_API_KEY | yes | Daytona Dashboard |
ANTHROPIC_API_KEY | yes | For this agent’s default model, anthropic/claude-sonnet-4-6. Required only if you don’t override MODEL. (Flue itself has no default; bug-fix.ts picks one.) |
GITHUB_TOKEN | yes | A Personal Access Token with repo scope (create one) |
MODEL | no | Override this agent’s default model. Any provider/model-id recognized by @mariozechner/pi-ai. Examples: anthropic/claude-opus-4-7, openai/gpt-5.5 |
DEMO_REPO | no¹ | Default target fork in <owner>/<repo> form (e.g. your-username/your-fork). Used when the webhook payload omits repo |
DEMO_ISSUE | no¹ | Default issue number (e.g. 284). Used when the webhook payload omits issueNumber |
ISSUE_REPO | no | Override the issue source, in <owner>/<repo> form. By default the agent auto-detects the upstream parent of DEMO_REPO; set this if DEMO_REPO is not a fork or you want to point at a different repo |
¹ Either set both DEMO_REPO / DEMO_ISSUE in .env and trigger with an empty body, or omit them and pass repo / issueNumber on every webhook call. Payload always wins over .env.
Install Dependencies
Section titled “Install Dependencies”npm installStart the Agent Server
Section titled “Start the Agent Server”npm run devFlue boots a webhook server on port 3583 and discovers the bug-fix agent automatically:
[flue] Starting dev server (target: node)[flue] Target: node[flue] Found 1 role(s): test-driven-developer[flue] Found 1 agent(s): bug-fix[flue] Webhook agents: bug-fix[flue] Built: dist/server.mjs[flue] Server: http://localhost:3583[flue] Try: curl -X POST http://localhost:3583/agents/bug-fix/test-1 \ -H 'Content-Type: application/json' -d '{}'[flue] Press Ctrl+C to stopTrigger the Agent
Section titled “Trigger the Agent”There are three equivalent ways to trigger the agent. Pick whichever fits your workflow.
Option A: drive everything from .env (default sync mode). With DEMO_REPO=your-username/your-fork and DEMO_ISSUE=<number> set in .env, fire an empty payload:
curl -X POST http://localhost:3583/agents/bug-fix/run-1 \ -H "Content-Type: application/json" \ -d '{}'Option B: pass the target per call (default sync mode). Override .env (or skip it entirely) by sending the target in the payload:
curl -X POST http://localhost:3583/agents/bug-fix/run-1 \ -H "Content-Type: application/json" \ -d '{ "repo": "your-username/your-fork", "issueNumber": <number> }'Replace your-username/your-fork with your fork’s slug and <number> with the issue number you want to target.
The flue dev terminal shows the orchestrator’s setup logs in real time. When the agent finishes, the response body contains the structured result.
Option C: one-shot with live tool tracing (recommended when iterating or debugging):
Stop flue dev (or leave it; doesn’t matter) and run:
npm run runThat maps to flue run bug-fix --target node --id run-1 --env .env --payload '{}'. Unlike Options A and B, this builds and spawns its own ephemeral server, POSTs with Accept: text/event-stream, and decorates every agent event (tool:start, tool:done, the LLM’s reasoning text) into a readable progress line so you can watch the LLM work in real time. The final structured result is printed at the end, ready to pipe into downstream tooling. See the Example Walkthrough below for a real npm run run trace.
Use Options A/B for quiet production-style runs against a long-lived flue dev server. Use Option C when you want to watch the LLM’s tool calls tool-by-tool.
3. Understanding the Architecture
Section titled “3. Understanding the Architecture”This example splits responsibility between TypeScript and Markdown, the idiomatic Flue pattern. Plumbing (sandbox lifecycle, payload validation, structured outputs) lives in .ts; the agent’s reasoning and workflow live in .md.
The directory layout is defined by Flue, not by us. Flue’s CLI looks for a .flue/ workspace at the project root containing agents/, connectors/, and roles/ subdirectories, and discovers skills under .agents/skills/<skill-name>/SKILL.md. We just populate those well-known locations:
guides/typescript/flue/├── .flue/ # Flue workspace (convention)│ ├── agents/ # one .ts file per agent (Flue scans this dir)│ │ └── bug-fix.ts # orchestrator│ ├── connectors/ # connector files referenced by agents│ │ └── daytona.ts # Daytona → Flue SandboxFactory adapter│ └── roles/ # role markdown files (subagent personas)│ └── test-driven-developer.md└── .agents/ # Flue skill workspace (convention) └── skills/ └── bug-fix/ # skill folder name = skill identifier └── SKILL.md # actual TDD workflow logicSo session.skill('bug-fix', { ... }) in our agent code maps directly to .agents/skills/bug-fix/SKILL.md: Flue resolves the skill by folder name by default. If you set a name: field in the file’s frontmatter, that takes precedence over the folder name — useful when you want to keep the directory layout but rename the skill.
The Daytona Connector
Section titled “The Daytona Connector”Daytona is a first-class Flue connector. The canonical way to install it is to pipe Flue’s connector registry to your AI coding agent:
flue add daytona | claude# or: opencode | codex | cursor-agent | piThis requires an AI coding-agent CLI to already be installed and authenticated locally — pick whichever one you already use (claude, opencode, codex, cursor-agent, pi, etc.). If you don’t have any installed yet, this guide ships the resulting connector pre-built so you can skip the flue add step entirely.
flue add daytona fetches the official installation instructions from https://flueframework.com/cli/connectors/daytona.md and writes them to stdout. Your AI agent reads those instructions and writes the connector adapter (.flue/connectors/daytona.ts) into your project automatically. No manual file copying, no version drift.
This guide ships the resulting .flue/connectors/daytona.ts pre-built so the demo runs without an extra step, but the file is byte-identical to what flue add daytona | <agent> produces. Once installed, you import the connector and pass it to init():
import { Daytona } from '@daytona/sdk'import { daytona } from '../connectors/daytona'
const client = new Daytona({ apiKey: env.DAYTONA_API_KEY })const sandbox = await client.create()
const agent = await init({ // cleanup: true arms sandbox.delete() to fire on agent.destroy() // Flue does NOT auto-destroy on handler return; see Section 5. sandbox: daytona(sandbox, { cleanup: true }), model: 'anthropic/claude-sonnet-4-6',})The user owns the Daytona client lifecycle (you decide how the sandbox is created, reused, or cleaned up); Flue just adapts it for agent use. The cleanup: true option arms a sandbox.delete() callback that fires when agent.destroy() is called but Flue does NOT auto-destroy on handler return. The orchestrator must explicitly call destroy(), which our try/finally does (covered in Section 5: Cleanup).
The Orchestrator
Section titled “The Orchestrator”The agent file (.flue/agents/bug-fix.ts) is small on purpose. It provisions the sandbox, prepares the environment, and hands off to a skill that does the real work. Here’s the structural shape — for the full file (env validation, slug-format checks, package-manager auto-install, gh/git config, the try/finally cleanup), open .flue/agents/bug-fix.ts from the guide directory you cloned earlier:
// First agent: setup phase. cleanup: true arms sandbox.delete() on destroy().const setupAgent = await init({ sandbox: daytona(sandbox, { cleanup: true }), model,})const setup = await setupAgent.session()// ... installing gh, setting up git, cloning the fork, installing deps,// fetching the issue body, uploading the SKILL.md ...
// Second agent: shares the same sandbox, but rooted in the cloned project dir// so Flue discovers our SKILL.md from `.agents/skills/bug-fix/SKILL.md`.const projectAgent = await init({ id: `bug-fix-${issueNumber}`, sandbox: daytona(sandbox), // no cleanup option — setupAgent owns teardown cwd: projectDir, model,})const session = await projectAgent.session()
return await session.skill('bug-fix', { args: { issueNumber, issueData, repo, issueRepo, packageManager }, role: 'test-driven-developer', result: ResultSchema,})A few details worth highlighting:
- Two agents, one sandbox. The setup agent installs
gh, configures auth, clones the repo, and installs dependencies. A second agent (given a differentidandcwd) operates inside the cloned repo and discovers ourbug-fixskill from.agents/skills/bug-fix/SKILL.md, which the orchestrator uploads into the cloned worktree just before init. Both agents share the same Daytona sandbox. The distinctidmatters: a freshidopens a fresh Flue session — see What<id>actually means: sessions for the full lifecycle. - No
AGENTS.mdupload. Flue would happily read anAGENTS.mdfrom the session cwd and prepend it to every system prompt, but that file lives at the cloned repo’s root and uploading our own would overwrite the target’sAGENTS.mdif it has one. Every guardrail we’d put there (TDD discipline, minimal change, match host code style) is already covered by thetest-driven-developerrole and thebug-fixskill body, so the harness ships nothing at the worktree root. .git/info/excludekeeps the worktree clean. After uploadingSKILL.mdto.agents/skills/bug-fix/, the orchestrator appends.agents/to the cloned repo’s.git/info/exclude(git’s local-only ignore — does NOT modify the target’s.gitignore). The harness scaffolding stays invisible togit statusand never accidentally lands in a commit.- Two repos, one workflow.
repois the user’s fork (where branches and the PR land);issueRepois where the issue lives (the upstream parent, auto-detected viagh repo view --json parentbecause GitHub disables issues on forks). - Package-manager auto-detection. The setup phase detects the package manager from the project’s lockfile, installs it if missing, and passes the name into the skill so the LLM uses the right test command.
- Structured input and output. The
PayloadSchemavalidates the incoming HTTP body with Valibot, andResultSchemaforces the agent to return a typed{ branch, prUrl, testFile, filesChanged, summary }object you can pipe into downstream automation. - Skills, not prompts. Instead of cramming the TDD workflow into a string, the agent calls a named skill and supplies a role. The actual logic lives in markdown.
The Skill (Where the Real Logic Lives)
Section titled “The Skill (Where the Real Logic Lives)”.agents/skills/bug-fix/SKILL.md defines the TDD workflow as four strict phases. The agent is required to run them in order, and it cannot proceed to the next phase without producing concrete evidence (a read, a failing test, a passing test, a commit):
## Phase 1: UnderstandRead the issue body. Identify expected vs. actual behavior.Inspect package.json, README.md, AGENTS.md. Identify the test framework.Read the source file(s) most likely involved. Read at least one test file.
## Phase 2: ReproduceCreate branch flue/fix-issue-{{issueNumber}}.Write a single, focused test that asserts the expected behavior.Run the test command. The test MUST fail.
## Phase 3: FixMake the minimal code change required to make the failing test pass.Run the full test suite. All tests MUST pass.
## Phase 4: Pull RequestCommit with `fix: <summary> (#{{issueNumber}})`.Push the branch to the user's fork.Open a PR via `gh pr create` with reproduction + verification output.Because the workflow is markdown, you can tighten it (add a “no --force push” rule), loosen it (allow multi-file fixes), or fork it for a different language without touching TypeScript.
The Role
Section titled “The Role”.flue/roles/test-driven-developer.md defines the agent’s persona: a disciplined contributor who treats the target repository as someone else’s project. The role is referenced in the skill call (role: 'test-driven-developer') and shapes how the agent makes tradeoffs (minimal change, match host code style, never disable existing tests).
How bug-fix.ts is actually invoked
Section titled “How bug-fix.ts is actually invoked”Nothing in our code calls our agent’s default export directly; Flue’s CLI does. Here’s the full chain from npm run dev to handler(ctx):
Build time (flue dev startup):
flue dev --target nodecallsdev()from@flue/sdk, which runsbuild().build()doesfs.readdirSync('.flue/agents')and keeps any entry matching/\.(ts|js|mts|mjs)$/. Ourbug-fix.tsmatches → agent name isbug-fix(filename without extension).- For each agent file, Flue uses the TypeScript AST to find the static
export const triggers = {...}declaration, validating thatwebhookistrueorfalse. Ourtriggers = { webhook: true }registers the agent for HTTP access. - The build generates a Hono server entry that imports each agent’s default export, then esbuilds it to
dist/server.mjs. - The dev server spawns
node dist/server.mjswithPORT=3583andFLUE_MODE=local.
Request time (when you curl):
The generated server mounts a single dynamic route, POST /agents/:name/:id. When a request arrives:
-
Validate the method, agent name, and webhook accessibility.
-
Parse the JSON body →
payload(defaults to{}for empty bodies). -
Pick a response mode based on headers:
Headers sent by client Server behavior Status Content-Type: application/json(default)Wait for handler, return { "result": <handlerReturn> }200Accept: text/event-streamStream SSE events (channel names are tool_start,text_delta, …, finallyresult)200x-webhook: trueFire-and-forget; run handler in background 202The CLI’s pretty-print form (
[flue] tool:start,[flue] tool:done) you see influe runoutput isflue run’s own decoration; the underlying SSE channel names use underscores (tool_start,tool_end). If you’re consuming the SSE stream from your own client, listen for the underscore form. -
Construct a
FlueContext({ id, payload, env, init }) and invoke our default export:handler(ctx). -
Return whatever the handler resolves with, in whichever mode was selected.
So when you run our curl example without special headers, you hit the sync mode: the connection stays open until the agent finishes (PR opened), then the server returns { "result": { branch, prUrl, ... } }.
If you want to watch the agent’s progress live, switch to SSE:
curl -N -X POST http://localhost:3583/agents/bug-fix/run-1 \ -H "Content-Type: application/json" \ -H "Accept: text/event-stream" \ -d '{}'Or use Flue’s one-shot CLI invoker, which handles SSE for you and prints the result to stdout:
npm run run# equivalent to:# flue run bug-fix --target node --id run-1 --env .env --payload '{}'flue run builds, spawns the server, POSTs with Accept: text/event-stream, streams events to stderr, prints the final result to stdout, and shuts the server down. Perfect for CI.
What <id> actually means: sessions
Section titled “What <id> actually means: sessions”The <id> segment in POST /agents/bug-fix/<id> is not just a label. It identifies a Flue session: the persistent message history and conversation metadata that agent.session() opens inside your handler.
POST /agents/bug-fix/run-1 ← <id> = "run-1" → session "run-1" for the bug-fix agentPOST /agents/bug-fix/run-2 ← different <id> → fresh session, no shared statePOST /agents/bug-fix/run-1 ← same <id> as before → REUSES session "run-1"Same <id> reused, what actually happens:
- Your handler function runs from the top, every time (it’s just a function, not auto-resumed).
client.create()makes a new Daytona sandbox each call (because our code calls it unconditionally).- But
await agent.session()inside the handler resolves to the same Flue session object as the previous call with that id, so the LLM sees the previous run’s message history as context for this run.
So same-id reuse persists the conversation, not the sandbox. For a chat-style agent that’s exactly what you want; for our one-shot bug-fix agent, it’s mostly noise (the LLM might short-circuit with “I already analyzed this”) and ends up confusing things.
Practical guidance for this guide:
- One unique
<id>per run (run-1,run-2, or$(uuidgen)). Treat each invocation as fresh. - Pick a stable
<id>only if you want resumability (e.g., agent crashed mid-fix and you want the LLM to remember its prior reasoning). You’d also need to extend the orchestrator to skip sandbox setup when a sandbox already exists for that id.
4. Example Walkthrough
Section titled “4. Example Walkthrough”Let’s trace what happens when you trigger the agent against vercel/ms issue #284. The reporter found that ms() violates its own roundtrip contract: ms(Number.MAX_VALUE) returns a string in scientific notation that parse() can no longer read back.
To watch the agent work tool-by-tool, use Flue’s one-shot CLI invoker (already wired into our package.json):
npm run runThat runs flue run bug-fix --target node --id run-1 --env .env --payload '{}', which builds, spawns an ephemeral server, POSTs with Accept: text/event-stream, and decorates each agent event into a readable progress line. Below is a trimmed real run against a fork of vercel/ms. The orchestrator’s [bug-fix] ... lines come from our setup phase; the [flue] tool:start/done ... lines and inline italicized text are the LLM’s tool calls and reasoning streamed back via SSE:
[bug-fix] target: your-username/your-fork#284 (model: anthropic/claude-sonnet-4-6)[bug-fix] sandbox ready (id: b5e8e152-7c67-4176-8721-c37aab6aa331)[bug-fix] installing gh CLI in sandbox...[bug-fix] commits will be authored as Your Name <12345+your-username@users.noreply.github.com>[bug-fix] cloning your-username/your-fork into sandbox...[bug-fix] detected package manager: pnpm[bug-fix] installing pnpm...[bug-fix] installing project dependencies...[bug-fix] resolving issue source: vercel/ms[bug-fix] fetching issue #284 from vercel/ms...[bug-fix] uploading skill into sandbox + excluding it from git...[bug-fix] running TDD workflow (reproduce → fix → PR)...[flue] Agent started I'll work through this systematically following the test-driven approach.[flue] tool:start read /home/daytona/project/package.json[flue] tool:done read (1477 chars)[flue] tool:start glob **/*.test.*[flue] tool:done glob[flue] tool:start read /home/daytona/project/src/index.ts[flue] tool:start read /home/daytona/project/src/parse.test.ts[flue] tool:done read (5864 chars)[flue] tool:done read (5082 chars) Now I have a clear picture. Let me verify what `format(Number.MAX_VALUE)` actually produces, then set up the branch and write the failing test:[flue] tool:start bash $ node -e "const out = ms(Number.MAX_VALUE); ..."[flue] tool:done bash (218 chars) I've confirmed the bug: format(Number.MAX_VALUE) produces "5.696545792019405e+297y" which the regex in parse() cannot match. Setting up the branch and writing the test:[flue] tool:start bash $ git checkout -b flue/fix-issue-284[flue] tool:done bash[flue] tool:start edit /home/daytona/project/src/parse.test.ts[flue] tool:done edit[flue] tool:start bash $ pnpm test:nodejs[flue] tool:done bash (2102 chars) The tests fail as expected. Now let me implement the fix:[flue] tool:start edit /home/daytona/project/src/index.ts[flue] tool:done edit[flue] tool:start bash $ pnpm test[flue] tool:done bash (1319 chars) All 172 tests pass (167 pre-existing + 5 new). Now let's commit and push:[flue] tool:start bash $ git add src/index.ts src/parse.test.ts && git commit -m "fix: ..."[flue] tool:done bash[flue] tool:start bash $ git push origin flue/fix-issue-284[flue] tool:done bash[flue] tool:start bash $ gh pr create --repo your-username/your-fork --base main \ --head flue/fix-issue-284 --title "fix: ..." --body "Closes #284 ..."[flue] tool:done bash[bug-fix] PR opened: https://github.com/your-username/your-fork/pull/1[bug-fix] branch: flue/fix-issue-284[bug-fix] files changed: src/index.ts, src/parse.test.ts[bug-fix] tearing down agents + sandbox...Reading top to bottom you can see the agent following our SKILL.md phases: it understands the project (multiple parallel read calls), confirms the bug interactively before changing anything (the node -e reproduction in bash), creates a branch + writes the failing test, runs the suite to confirm the test fails, makes the fix, reruns the suite, then commits, pushes, and opens the PR. The final [bug-fix] tearing down agents + sandbox... line is the orchestrator’s try/finally doing its work — both agents get destroyed and the Daytona sandbox is deleted before the response is returned.
Note that exact wording, file paths, test counts, and PR numbers vary between runs (the LLM is non-deterministic, and the PR number depends on how many PRs your fork already has). The shape — sandbox provision → setup → four-phase TDD workflow → PR URL → cleanup — is what’s deterministic.
The four phases below zoom in on each step.
Phase 1: Understand
Section titled “Phase 1: Understand”The agent reads package.json to identify the test runner (Jest, run via pnpm test), then reads the single-file source at src/index.ts (244 lines) and an existing parse test (src/parse.test.ts) to learn the project’s assertion style.
The relevant code is the parse() regex around line 77 of src/index.ts:
const match = /^(?<value>-?\d*\.?\d+) *(?<unit>...)?$/i.exec(str);The value group only matches plain decimal numbers; it doesn’t accept scientific notation (e+297). That’s the bug.
Phase 2: Reproduce
Section titled “Phase 2: Reproduce”The agent creates a new branch and writes a single, focused test that asserts the roundtrip property the reporter described:
import { ms } from './index';
describe('issue #284: roundtrip with very large numbers', () => { it('format() output is always parseable back to a number', () => { const out = ms(Number.MAX_VALUE); expect(typeof out).toBe('string'); expect(ms(out)).not.toBeNaN(); });});It runs pnpm test and confirms the failures with the current implementation. From a real run:
FAIL src/parse.test.ts ● parse(scientific notation) › should parse scientific notation values with a unit (roundtrip with format) Expected: false Received: true (Number.isNaN was true — parse returned NaN)
● parse(scientific notation) › should parse scientific notation with y unit ● parse(scientific notation) › should parse scientific notation with ms unit ● parse(scientific notation) › should parse scientific notation with s unit ● parse(scientific notation) › should parse negative scientific notation with a unitIf the tests had unexpectedly passed, the agent would refuse to continue: a test that doesn’t fail isn’t a reproduction.
Phase 3: Fix
Section titled “Phase 3: Fix”With the bug reproduced, the agent makes a minimal, surgical change to src/index.ts so parse() accepts scientific notation in the numeric value group, then reruns the full suite:
PASS src/parse.test.tsPASS src/index.test.tsPASS src/format.test.tsPASS src/parse-strict.test.ts
Test Suites: 4 passed, 4 totalTests: 172 passed, 172 totalBoth the new tests and every pre-existing test pass. The fix is a small regex extension (one optional capture group): exactly the kind of minimal change the test-driven-developer role rewards.
Phase 4: Pull Request
Section titled “Phase 4: Pull Request”The agent commits, pushes, and opens a PR against your fork:
$ git commit -m "fix: support scientific notation in parse() to fix roundtrip with large numbers (#284)"$ git push origin flue/fix-issue-284$ gh pr create --repo your-username/your-fork \ --base main \ --head flue/fix-issue-284 \ --title "fix: support scientific notation in parse() to fix roundtrip with large numbers (#284)" \ --body "Closes vercel#284 ..."The PR body the agent generated includes the failing-test output from Phase 2, a one-paragraph root-cause analysis, and the passing-test output from Phase 3 — everything a human reviewer needs to merge in under five minutes.
The HTTP response body returned to your curl wraps the handler’s return value under a result key:
{ "result": { "branch": "flue/fix-issue-284", "prUrl": "https://github.com/your-username/your-fork/pull/1", "testFile": "src/parse.test.ts", "filesChanged": ["src/index.ts", "src/parse.test.ts"], "summary": "The parse() regex only matched plain decimal numbers in the value group (`-?\\d*\\.?\\d+`), so when format() produced scientific notation (e.g. `5.696545792019405e+297y`) via JavaScript's default number serialisation for very large Math.round() results, parse() returned NaN; the fix extends the value capture group with an optional exponent part (`(?:e[+-]?\\d+)?`) so scientific notation is accepted transparently." }}Open the PR URL in your browser to review the diff and merge, exactly as you would for a human-authored contribution.
5. Cleanup
Section titled “5. Cleanup”Flue does not auto-destroy sessions when a handler returns — sessions persist for resumability via the same <id>, and the cleanup: true callback registered on our connector only fires when agent.destroy() is explicitly called. The orchestrator therefore wraps the entire two-agent flow in a try { ... } finally { ... } block:
try { setupAgent = await init({ sandbox: daytona(sandbox, { cleanup: true }), ... }) // ... setup work + projectAgent + skill invocation return result} finally { console.log('[bug-fix] tearing down agents + sandbox...') if (projectAgent) { try { await projectAgent.destroy() } catch (err) { console.error(err) } } if (setupAgent) { try { await setupAgent.destroy() } catch (err) { console.error(err) } } else { // setupAgent never armed cleanup: true; delete sandbox directly try { await sandbox.delete() } catch (err) { console.error(err) } }}Order matters: the project agent is destroyed first (closes its session, no sandbox impact since it doesn’t have cleanup: true), then the setup agent’s destroy fires the registered sandbox.delete() callback. The fallback else branch handles the case where init() itself threw before setupAgent was created — in that scenario nothing armed cleanup: true, so the orchestrator calls sandbox.delete() directly on the already-created sandbox. (If client.create() itself fails earlier, no sandbox object exists at all, so there’s nothing to leak.)
This covers the common failure paths: successful runs, caught LLM errors, and exceptions thrown during the workflow. Cleanup is best-effort — if destroy() or sandbox.delete() itself throws (transient API error, network drop), the failure is logged but the sandbox is not retried. Confirm in your Daytona Dashboard after each run, and clean up any orphans manually if you spot them.
Key Advantages
- TDD by construction: the skill’s phase ordering forces a failing test before any fix lands, so every PR is reproducible.
- Real pull requests: the agent uses
ghinside the sandbox to push and open PRs you can merge in the GitHub UI, not just patches you copy by hand. - Skill-first design: tweak the workflow by editing markdown. No recompile, no redeploy.
- Structured outputs: results are validated by Valibot schemas, so downstream automation never has to parse free-form text.
- Sandbox-isolated execution: cloning, dependency installation, and test runs all happen inside Daytona, so your host stays clean even if the target repo is malicious or its dependencies are.