Build an Autonomous Bug-Fix Agent with Flue and Daytona

このコンテンツはまだ日本語訳がありません。

This guide builds an autonomous bug-fix agent using Flue and Daytona sandboxes. Given a GitHub issue, the agent reproduces the bug with a failing test, implements the minimal fix, runs the full test suite, and opens a real pull request.

A sandbox is essential for this workflow. The agent clones unknown code, installs unknown dependencies, and executes the project’s test suite — operations that need strict isolation from your host. Daytona provisions a fresh isolated environment for every run and tears it down on completion, so an untrusted repository can never affect your host.

1. Workflow Overview

You point the agent at an open issue on any GitHub repository. The agent provisions a Daytona sandbox, clones the repo into it, then executes a strict Reproduce → Fix → Verify → PR workflow. When it’s done, it returns the URL of a real pull request you can review in the GitHub UI.

A successful run against vercel/ms issue #284 looks like this in the flue dev terminal:

[bug-fix] target: your-username/your-fork#284 (model: anthropic/claude-sonnet-4-6)
[bug-fix] sandbox ready (id: a44a184e-cf0a-4407-bb1a-02f1b8000466)
[bug-fix] installing gh CLI in sandbox...
[bug-fix] commits will be authored as Your Name <12345+your-username@users.noreply.github.com>
[bug-fix] cloning your-username/your-fork into sandbox...
[bug-fix] detected package manager: pnpm
[bug-fix] installing pnpm...
[bug-fix] installing project dependencies...
[bug-fix] resolving issue source: vercel/ms
[bug-fix] fetching issue #284 from vercel/ms...
[bug-fix] uploading skill into sandbox + excluding it from git...
[bug-fix] running TDD workflow (reproduce → fix → PR)...
[bug-fix] PR opened: https://github.com/your-username/your-fork/pull/1
[bug-fix] branch: flue/fix-issue-284
[bug-fix] files changed: src/index.ts, src/parse.test.ts
[bug-fix] tearing down agents + sandbox...

The four-phase TDD work (Understand → Reproduce → Fix → Pull Request) happens entirely inside the LLM’s session in the sandbox, so it doesn’t surface in the dev-server log line by line. To see those events streamed live, switch to the SSE invocation shown later in How bug-fix.ts is actually invoked.

The HTTP response body returned to your curl is the structured result the agent emits:

{
  "result": {
    "branch": "flue/fix-issue-284",
    "prUrl": "https://github.com/your-username/your-fork/pull/1",
    "testFile": "src/parse.test.ts",
    "filesChanged": ["src/index.ts", "src/parse.test.ts"],
    "summary": "The parse() regex only matched plain decimal numbers in the value group (`-?\\d*\\.?\\d+`), so when format() produced scientific notation (e.g. `5.696545792019405e+297y`) via JavaScript's default number serialisation for very large Math.round() results, parse() returned NaN; the fix extends the value capture group with an optional exponent part (`(?:e[+-]?\\d+)?`) so scientific notation is accepted transparently."
  }
}

2. Project Setup

Clone the Repository

Clone the Daytona repository and navigate to the example directory:

git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/typescript/flue

Fork a Demo Target

You need a target repository to demo against. We recommend vercel/ms, the well-known millisecond conversion utility. It’s small (one ~244-line source file), uses Jest for tests, has an MIT license, and ships with real open issues. Fork it so the agent can push branches and open PRs against your copy:

gh repo fork vercel/ms --clone=false

The agent will operate on your fork (referred to in this guide as your-username/your-fork, where your-fork is whatever you named it), so any branches and pull requests it creates land on your fork, never upstream.

Configure Environment

Copy .env.example to .env and fill in your keys:

cp .env.example .env

Variable	Required	Source
`DAYTONA_API_KEY`	yes	Daytona Dashboard
`ANTHROPIC_API_KEY`	yes	For this agent’s default model, `anthropic/claude-sonnet-4-6`. Required only if you don’t override `MODEL`. (Flue itself has no default; `bug-fix.ts` picks one.)
`GITHUB_TOKEN`	yes	A Personal Access Token with `repo` scope (create one)
`MODEL`	no	Override this agent’s default model. Any `provider/model-id` recognized by `@mariozechner/pi-ai`. Examples: `anthropic/claude-opus-4-7`, `openai/gpt-5.5`
`DEMO_REPO`	no¹	Default target fork in `<owner>/<repo>` form (e.g. `your-username/your-fork`). Used when the webhook payload omits `repo`
`DEMO_ISSUE`	no¹	Default issue number (e.g. `284`). Used when the webhook payload omits `issueNumber`
`ISSUE_REPO`	no	Override the issue source, in `<owner>/<repo>` form. By default the agent auto-detects the upstream parent of `DEMO_REPO`; set this if `DEMO_REPO` is not a fork or you want to point at a different repo

¹ Either set both DEMO_REPO / DEMO_ISSUE in .env and trigger with an empty body, or omit them and pass repo / issueNumber on every webhook call. Payload always wins over .env.

Install Dependencies

npm install

Start the Agent Server

npm run dev

Flue boots a webhook server on port 3583 and discovers the bug-fix agent automatically:

[flue] Starting dev server (target: node)
[flue] Target: node
[flue] Found 1 role(s): test-driven-developer
[flue] Found 1 agent(s): bug-fix
[flue] Webhook agents: bug-fix
[flue] Built: dist/server.mjs
[flue] Server: http://localhost:3583
[flue] Try: curl -X POST http://localhost:3583/agents/bug-fix/test-1 \
         -H 'Content-Type: application/json' -d '{}'
[flue] Press Ctrl+C to stop

Trigger the Agent

There are three equivalent ways to trigger the agent. Pick whichever fits your workflow.

Option A: drive everything from .env (default sync mode). With DEMO_REPO=your-username/your-fork and DEMO_ISSUE=<number> set in .env, fire an empty payload:

curl -X POST http://localhost:3583/agents/bug-fix/run-1 \
  -H "Content-Type: application/json" \
  -d '{}'

Option B: pass the target per call (default sync mode). Override .env (or skip it entirely) by sending the target in the payload:

curl -X POST http://localhost:3583/agents/bug-fix/run-1 \
  -H "Content-Type: application/json" \
  -d '{
    "repo": "your-username/your-fork",
    "issueNumber": <number>
  }'

Replace your-username/your-fork with your fork’s slug and <number> with the issue number you want to target.

The flue dev terminal shows the orchestrator’s setup logs in real time. When the agent finishes, the response body contains the structured result.

Option C: one-shot with live tool tracing (recommended when iterating or debugging):

Stop flue dev (or leave it; doesn’t matter) and run:

npm run run

That maps to flue run bug-fix --target node --id run-1 --env .env --payload '{}'. Unlike Options A and B, this builds and spawns its own ephemeral server, POSTs with Accept: text/event-stream, and decorates every agent event (tool:start, tool:done, the LLM’s reasoning text) into a readable progress line so you can watch the LLM work in real time. The final structured result is printed at the end, ready to pipe into downstream tooling. See the Example Walkthrough below for a real npm run run trace.

Use Options A/B for quiet production-style runs against a long-lived flue dev server. Use Option C when you want to watch the LLM’s tool calls tool-by-tool.

3. Understanding the Architecture

This example splits responsibility between TypeScript and Markdown, the idiomatic Flue pattern. Plumbing (sandbox lifecycle, payload validation, structured outputs) lives in .ts; the agent’s reasoning and workflow live in .md.

The directory layout is defined by Flue, not by us. Flue’s CLI looks for a .flue/ workspace at the project root containing agents/, connectors/, and roles/ subdirectories, and discovers skills under .agents/skills/<skill-name>/SKILL.md. We just populate those well-known locations:

guides/typescript/flue/
├── .flue/                            # Flue workspace (convention)
│   ├── agents/                       # one .ts file per agent (Flue scans this dir)
│   │   └── bug-fix.ts                # orchestrator
│   ├── connectors/                   # connector files referenced by agents
│   │   └── daytona.ts                # Daytona → Flue SandboxFactory adapter
│   └── roles/                        # role markdown files (subagent personas)
│       └── test-driven-developer.md
└── .agents/                          # Flue skill workspace (convention)
    └── skills/
        └── bug-fix/                  # skill folder name = skill identifier
            └── SKILL.md              # actual TDD workflow logic

So session.skill('bug-fix', { ... }) in our agent code maps directly to .agents/skills/bug-fix/SKILL.md: Flue resolves the skill by folder name by default. If you set a name: field in the file’s frontmatter, that takes precedence over the folder name — useful when you want to keep the directory layout but rename the skill.

The Daytona Connector

Daytona is a first-class Flue connector. The canonical way to install it is to pipe Flue’s connector registry to your AI coding agent:

flue add daytona | claude
# or: opencode | codex | cursor-agent | pi

This requires an AI coding-agent CLI to already be installed and authenticated locally — pick whichever one you already use (claude, opencode, codex, cursor-agent, pi, etc.). If you don’t have any installed yet, this guide ships the resulting connector pre-built so you can skip the flue add step entirely.

flue add daytona fetches the official installation instructions from https://flueframework.com/cli/connectors/daytona.md and writes them to stdout. Your AI agent reads those instructions and writes the connector adapter (.flue/connectors/daytona.ts) into your project automatically. No manual file copying, no version drift.

This guide ships the resulting .flue/connectors/daytona.ts pre-built so the demo runs without an extra step, but the file is byte-identical to what flue add daytona | <agent> produces. Once installed, you import the connector and pass it to init():

import { Daytona } from '@daytona/sdk'
import { daytona } from '../connectors/daytona'

const client = new Daytona({ apiKey: env.DAYTONA_API_KEY })
const sandbox = await client.create()

const agent = await init({
  // cleanup: true arms sandbox.delete() to fire on agent.destroy()
  // Flue does NOT auto-destroy on handler return; see Section 5.
  sandbox: daytona(sandbox, { cleanup: true }),
  model: 'anthropic/claude-sonnet-4-6',
})

The user owns the Daytona client lifecycle (you decide how the sandbox is created, reused, or cleaned up); Flue just adapts it for agent use. The cleanup: true option arms a sandbox.delete() callback that fires when agent.destroy() is called but Flue does NOT auto-destroy on handler return. The orchestrator must explicitly call destroy(), which our try/finally does (covered in Section 5: Cleanup).

The Orchestrator

The agent file (.flue/agents/bug-fix.ts) is small on purpose. It provisions the sandbox, prepares the environment, and hands off to a skill that does the real work. Here’s the structural shape — for the full file (env validation, slug-format checks, package-manager auto-install, gh/git config, the try/finally cleanup), open .flue/agents/bug-fix.ts from the guide directory you cloned earlier:

// First agent: setup phase. cleanup: true arms sandbox.delete() on destroy().
const setupAgent = await init({
  sandbox: daytona(sandbox, { cleanup: true }),
  model,
})
const setup = await setupAgent.session()
// ... installing gh, setting up git, cloning the fork, installing deps,
//     fetching the issue body, uploading the SKILL.md ...

// Second agent: shares the same sandbox, but rooted in the cloned project dir
// so Flue discovers our SKILL.md from `.agents/skills/bug-fix/SKILL.md`.
const projectAgent = await init({
  id: `bug-fix-${issueNumber}`,
  sandbox: daytona(sandbox), // no cleanup option — setupAgent owns teardown
  cwd: projectDir,
  model,
})
const session = await projectAgent.session()

return await session.skill('bug-fix', {
  args: { issueNumber, issueData, repo, issueRepo, packageManager },
  role: 'test-driven-developer',
  result: ResultSchema,
})

A few details worth highlighting:

Two agents, one sandbox. The setup agent installs gh, configures auth, clones the repo, and installs dependencies. A second agent (given a different id and cwd) operates inside the cloned repo and discovers our bug-fix skill from .agents/skills/bug-fix/SKILL.md, which the orchestrator uploads into the cloned worktree just before init. Both agents share the same Daytona sandbox. The distinct id matters: a fresh id opens a fresh Flue session — see What <id> actually means: sessions for the full lifecycle.
No AGENTS.md upload. Flue would happily read an AGENTS.md from the session cwd and prepend it to every system prompt, but that file lives at the cloned repo’s root and uploading our own would overwrite the target’s AGENTS.md if it has one. Every guardrail we’d put there (TDD discipline, minimal change, match host code style) is already covered by the test-driven-developer role and the bug-fix skill body, so the harness ships nothing at the worktree root.
.git/info/exclude keeps the worktree clean. After uploading SKILL.md to .agents/skills/bug-fix/, the orchestrator appends .agents/ to the cloned repo’s .git/info/exclude (git’s local-only ignore — does NOT modify the target’s .gitignore). The harness scaffolding stays invisible to git status and never accidentally lands in a commit.
Two repos, one workflow. repo is the user’s fork (where branches and the PR land); issueRepo is where the issue lives (the upstream parent, auto-detected via gh repo view --json parent because GitHub disables issues on forks).
Package-manager auto-detection. The setup phase detects the package manager from the project’s lockfile, installs it if missing, and passes the name into the skill so the LLM uses the right test command.
Structured input and output. The PayloadSchema validates the incoming HTTP body with Valibot, and ResultSchema forces the agent to return a typed { branch, prUrl, testFile, filesChanged, summary } object you can pipe into downstream automation.
Skills, not prompts. Instead of cramming the TDD workflow into a string, the agent calls a named skill and supplies a role. The actual logic lives in markdown.

The Skill (Where the Real Logic Lives)

.agents/skills/bug-fix/SKILL.md defines the TDD workflow as four strict phases. The agent is required to run them in order, and it cannot proceed to the next phase without producing concrete evidence (a read, a failing test, a passing test, a commit):

## Phase 1: Understand
Read the issue body. Identify expected vs. actual behavior.
Inspect package.json, README.md, AGENTS.md. Identify the test framework.
Read the source file(s) most likely involved. Read at least one test file.

## Phase 2: Reproduce
Create branch flue/fix-issue-{{issueNumber}}.
Write a single, focused test that asserts the expected behavior.
Run the test command. The test MUST fail.

## Phase 3: Fix
Make the minimal code change required to make the failing test pass.
Run the full test suite. All tests MUST pass.

## Phase 4: Pull Request
Commit with `fix: <summary> (#{{issueNumber}})`.
Push the branch to the user's fork.
Open a PR via `gh pr create` with reproduction + verification output.

Because the workflow is markdown, you can tighten it (add a “no --force push” rule), loosen it (allow multi-file fixes), or fork it for a different language without touching TypeScript.

The Role

.flue/roles/test-driven-developer.md defines the agent’s persona: a disciplined contributor who treats the target repository as someone else’s project. The role is referenced in the skill call (role: 'test-driven-developer') and shapes how the agent makes tradeoffs (minimal change, match host code style, never disable existing tests).

How `bug-fix.ts` is actually invoked

Nothing in our code calls our agent’s default export directly; Flue’s CLI does. Here’s the full chain from npm run dev to handler(ctx):

Build time (flue dev startup):

flue dev --target node calls dev() from @flue/sdk, which runs build().
build() does fs.readdirSync('.flue/agents') and keeps any entry matching /\.(ts|js|mts|mjs)$/. Our bug-fix.ts matches → agent name is bug-fix (filename without extension).
For each agent file, Flue uses the TypeScript AST to find the static export const triggers = {...} declaration, validating that webhook is true or false. Our triggers = { webhook: true } registers the agent for HTTP access.
The build generates a Hono server entry that imports each agent’s default export, then esbuilds it to dist/server.mjs.
The dev server spawns node dist/server.mjs with PORT=3583 and FLUE_MODE=local.

Request time (when you curl):

The generated server mounts a single dynamic route, POST /agents/:name/:id. When a request arrives:

Validate the method, agent name, and webhook accessibility.
Parse the JSON body → payload (defaults to {} for empty bodies).

Pick a response mode based on headers:

Headers sent by client	Server behavior	Status
`Content-Type: application/json` (default)	Wait for handler, return `{ "result": <handlerReturn> }`	`200`
`Accept: text/event-stream`	Stream SSE events (channel names are `tool_start`, `text_delta`, …, finally `result`)	`200`
`x-webhook: true`	Fire-and-forget; run handler in background	`202`

The CLI’s pretty-print form ([flue] tool:start, [flue] tool:done) you see in flue run output is flue run’s own decoration; the underlying SSE channel names use underscores (tool_start, tool_end). If you’re consuming the SSE stream from your own client, listen for the underscore form.

Construct a FlueContext ({ id, payload, env, init }) and invoke our default export: handler(ctx).
Return whatever the handler resolves with, in whichever mode was selected.

So when you run our curl example without special headers, you hit the sync mode: the connection stays open until the agent finishes (PR opened), then the server returns { "result": { branch, prUrl, ... } }.

If you want to watch the agent’s progress live, switch to SSE:

curl -N -X POST http://localhost:3583/agents/bug-fix/run-1 \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{}'

Or use Flue’s one-shot CLI invoker, which handles SSE for you and prints the result to stdout:

npm run run
# equivalent to:
# flue run bug-fix --target node --id run-1 --env .env --payload '{}'

flue run builds, spawns the server, POSTs with Accept: text/event-stream, streams events to stderr, prints the final result to stdout, and shuts the server down. Perfect for CI.

What `<id>` actually means: sessions

The <id> segment in POST /agents/bug-fix/<id> is not just a label. It identifies a Flue session: the persistent message history and conversation metadata that agent.session() opens inside your handler.

POST /agents/bug-fix/run-1   ← <id> = "run-1" → session "run-1" for the bug-fix agent
POST /agents/bug-fix/run-2   ← different <id> → fresh session, no shared state
POST /agents/bug-fix/run-1   ← same <id> as before → REUSES session "run-1"

Same <id> reused, what actually happens:

Your handler function runs from the top, every time (it’s just a function, not auto-resumed).
client.create() makes a new Daytona sandbox each call (because our code calls it unconditionally).
But await agent.session() inside the handler resolves to the same Flue session object as the previous call with that id, so the LLM sees the previous run’s message history as context for this run.

So same-id reuse persists the conversation, not the sandbox. For a chat-style agent that’s exactly what you want; for our one-shot bug-fix agent, it’s mostly noise (the LLM might short-circuit with “I already analyzed this”) and ends up confusing things.

Practical guidance for this guide:

One unique <id> per run (run-1, run-2, or $(uuidgen)). Treat each invocation as fresh.
Pick a stable <id> only if you want resumability (e.g., agent crashed mid-fix and you want the LLM to remember its prior reasoning). You’d also need to extend the orchestrator to skip sandbox setup when a sandbox already exists for that id.

4. Example Walkthrough

Let’s trace what happens when you trigger the agent against vercel/ms issue #284. The reporter found that ms() violates its own roundtrip contract: ms(Number.MAX_VALUE) returns a string in scientific notation that parse() can no longer read back.

To watch the agent work tool-by-tool, use Flue’s one-shot CLI invoker (already wired into our package.json):

npm run run

That runs flue run bug-fix --target node --id run-1 --env .env --payload '{}', which builds, spawns an ephemeral server, POSTs with Accept: text/event-stream, and decorates each agent event into a readable progress line. Below is a trimmed real run against a fork of vercel/ms. The orchestrator’s [bug-fix] ... lines come from our setup phase; the [flue] tool:start/done ... lines and inline italicized text are the LLM’s tool calls and reasoning streamed back via SSE:

[bug-fix] target: your-username/your-fork#284 (model: anthropic/claude-sonnet-4-6)
[bug-fix] sandbox ready (id: b5e8e152-7c67-4176-8721-c37aab6aa331)
[bug-fix] installing gh CLI in sandbox...
[bug-fix] commits will be authored as Your Name <12345+your-username@users.noreply.github.com>
[bug-fix] cloning your-username/your-fork into sandbox...
[bug-fix] detected package manager: pnpm
[bug-fix] installing pnpm...
[bug-fix] installing project dependencies...
[bug-fix] resolving issue source: vercel/ms
[bug-fix] fetching issue #284 from vercel/ms...
[bug-fix] uploading skill into sandbox + excluding it from git...
[bug-fix] running TDD workflow (reproduce → fix → PR)...
[flue] Agent started
  I'll work through this systematically following the test-driven approach.
[flue] tool:start  read  /home/daytona/project/package.json
[flue] tool:done   read  (1477 chars)
[flue] tool:start  glob  **/*.test.*
[flue] tool:done   glob
[flue] tool:start  read  /home/daytona/project/src/index.ts
[flue] tool:start  read  /home/daytona/project/src/parse.test.ts
[flue] tool:done   read  (5864 chars)
[flue] tool:done   read  (5082 chars)
  Now I have a clear picture. Let me verify what `format(Number.MAX_VALUE)` actually
  produces, then set up the branch and write the failing test:
[flue] tool:start  bash  $ node -e "const out = ms(Number.MAX_VALUE); ..."
[flue] tool:done   bash  (218 chars)
  I've confirmed the bug: format(Number.MAX_VALUE) produces "5.696545792019405e+297y"
  which the regex in parse() cannot match. Setting up the branch and writing the test:
[flue] tool:start  bash  $ git checkout -b flue/fix-issue-284
[flue] tool:done   bash
[flue] tool:start  edit  /home/daytona/project/src/parse.test.ts
[flue] tool:done   edit
[flue] tool:start  bash  $ pnpm test:nodejs
[flue] tool:done   bash  (2102 chars)
  The tests fail as expected. Now let me implement the fix:
[flue] tool:start  edit  /home/daytona/project/src/index.ts
[flue] tool:done   edit
[flue] tool:start  bash  $ pnpm test
[flue] tool:done   bash  (1319 chars)
  All 172 tests pass (167 pre-existing + 5 new). Now let's commit and push:
[flue] tool:start  bash  $ git add src/index.ts src/parse.test.ts && git commit -m "fix: ..."
[flue] tool:done   bash
[flue] tool:start  bash  $ git push origin flue/fix-issue-284
[flue] tool:done   bash
[flue] tool:start  bash  $ gh pr create --repo your-username/your-fork --base main \
                          --head flue/fix-issue-284 --title "fix: ..." --body "Closes #284 ..."
[flue] tool:done   bash
[bug-fix] PR opened: https://github.com/your-username/your-fork/pull/1
[bug-fix] branch: flue/fix-issue-284
[bug-fix] files changed: src/index.ts, src/parse.test.ts
[bug-fix] tearing down agents + sandbox...

Reading top to bottom you can see the agent following our SKILL.md phases: it understands the project (multiple parallel read calls), confirms the bug interactively before changing anything (the node -e reproduction in bash), creates a branch + writes the failing test, runs the suite to confirm the test fails, makes the fix, reruns the suite, then commits, pushes, and opens the PR. The final [bug-fix] tearing down agents + sandbox... line is the orchestrator’s try/finally doing its work — both agents get destroyed and the Daytona sandbox is deleted before the response is returned.

Note that exact wording, file paths, test counts, and PR numbers vary between runs (the LLM is non-deterministic, and the PR number depends on how many PRs your fork already has). The shape — sandbox provision → setup → four-phase TDD workflow → PR URL → cleanup — is what’s deterministic.

The four phases below zoom in on each step.

Phase 1: Understand

The agent reads package.json to identify the test runner (Jest, run via pnpm test), then reads the single-file source at src/index.ts (244 lines) and an existing parse test (src/parse.test.ts) to learn the project’s assertion style.

The relevant code is the parse() regex around line 77 of src/index.ts:

const match = /^(?<value>-?\d*\.?\d+) *(?<unit>...)?$/i.exec(str);

The value group only matches plain decimal numbers; it doesn’t accept scientific notation (e+297). That’s the bug.

Phase 2: Reproduce

The agent creates a new branch and writes a single, focused test that asserts the roundtrip property the reporter described:

import { ms } from './index';

describe('issue #284: roundtrip with very large numbers', () => {
  it('format() output is always parseable back to a number', () => {
    const out = ms(Number.MAX_VALUE);
    expect(typeof out).toBe('string');
    expect(ms(out)).not.toBeNaN();
  });
});

It runs pnpm test and confirms the failures with the current implementation. From a real run:

FAIL  src/parse.test.ts
  ● parse(scientific notation) › should parse scientific notation values with a unit (roundtrip with format)
    Expected: false
    Received: true   (Number.isNaN was true — parse returned NaN)

  ● parse(scientific notation) › should parse scientific notation with y unit
  ● parse(scientific notation) › should parse scientific notation with ms unit
  ● parse(scientific notation) › should parse scientific notation with s unit
  ● parse(scientific notation) › should parse negative scientific notation with a unit

If the tests had unexpectedly passed, the agent would refuse to continue: a test that doesn’t fail isn’t a reproduction.

Phase 3: Fix

With the bug reproduced, the agent makes a minimal, surgical change to src/index.ts so parse() accepts scientific notation in the numeric value group, then reruns the full suite:

PASS  src/parse.test.ts
PASS  src/index.test.ts
PASS  src/format.test.ts
PASS  src/parse-strict.test.ts

Test Suites: 4 passed, 4 total
Tests:       172 passed, 172 total

Both the new tests and every pre-existing test pass. The fix is a small regex extension (one optional capture group): exactly the kind of minimal change the test-driven-developer role rewards.

Phase 4: Pull Request

The agent commits, pushes, and opens a PR against your fork:

$ git commit -m "fix: support scientific notation in parse() to fix roundtrip with large numbers (#284)"
$ git push origin flue/fix-issue-284
$ gh pr create --repo your-username/your-fork \
    --base main \
    --head flue/fix-issue-284 \
    --title "fix: support scientific notation in parse() to fix roundtrip with large numbers (#284)" \
    --body "Closes vercel#284 ..."

The PR body the agent generated includes the failing-test output from Phase 2, a one-paragraph root-cause analysis, and the passing-test output from Phase 3 — everything a human reviewer needs to merge in under five minutes.

The HTTP response body returned to your curl wraps the handler’s return value under a result key:

{
  "result": {
    "branch": "flue/fix-issue-284",
    "prUrl": "https://github.com/your-username/your-fork/pull/1",
    "testFile": "src/parse.test.ts",
    "filesChanged": ["src/index.ts", "src/parse.test.ts"],
    "summary": "The parse() regex only matched plain decimal numbers in the value group (`-?\\d*\\.?\\d+`), so when format() produced scientific notation (e.g. `5.696545792019405e+297y`) via JavaScript's default number serialisation for very large Math.round() results, parse() returned NaN; the fix extends the value capture group with an optional exponent part (`(?:e[+-]?\\d+)?`) so scientific notation is accepted transparently."
  }
}

Open the PR URL in your browser to review the diff and merge, exactly as you would for a human-authored contribution.

5. Cleanup

Flue does not auto-destroy sessions when a handler returns — sessions persist for resumability via the same <id>, and the cleanup: true callback registered on our connector only fires when agent.destroy() is explicitly called. The orchestrator therefore wraps the entire two-agent flow in a try { ... } finally { ... } block:

try {
  setupAgent = await init({ sandbox: daytona(sandbox, { cleanup: true }), ... })
  // ... setup work + projectAgent + skill invocation
  return result
} finally {
  console.log('[bug-fix] tearing down agents + sandbox...')
  if (projectAgent) {
    try { await projectAgent.destroy() } catch (err) { console.error(err) }
  }
  if (setupAgent) {
    try { await setupAgent.destroy() } catch (err) { console.error(err) }
  } else {
    // setupAgent never armed cleanup: true; delete sandbox directly
    try { await sandbox.delete() } catch (err) { console.error(err) }
  }
}

Order matters: the project agent is destroyed first (closes its session, no sandbox impact since it doesn’t have cleanup: true), then the setup agent’s destroy fires the registered sandbox.delete() callback. The fallback else branch handles the case where init() itself threw before setupAgent was created — in that scenario nothing armed cleanup: true, so the orchestrator calls sandbox.delete() directly on the already-created sandbox. (If client.create() itself fails earlier, no sandbox object exists at all, so there’s nothing to leak.)

This covers the common failure paths: successful runs, caught LLM errors, and exceptions thrown during the workflow. Cleanup is best-effort — if destroy() or sandbox.delete() itself throws (transient API error, network drop), the failure is logged but the sandbox is not retried. Confirm in your Daytona Dashboard after each run, and clean up any orphans manually if you spot them.

Key Advantages

TDD by construction: the skill’s phase ordering forces a failing test before any fix lands, so every PR is reproducible.
Real pull requests: the agent uses gh inside the sandbox to push and open PRs you can merge in the GitHub UI, not just patches you copy by hand.
Skill-first design: tweak the workflow by editing markdown. No recompile, no redeploy.
Structured outputs: results are validated by Valibot schemas, so downstream automation never has to parse free-form text.
Sandbox-isolated execution: cloning, dependency installation, and test runs all happen inside Daytona, so your host stays clean even if the target repo is malicious or its dependencies are.