This guide builds an autonomous bug-fix agent using [Flue](https://flueframework.com/) and [Daytona](https://www.daytona.io/) sandboxes. Given a GitHub issue, the agent reproduces the bug with a failing test, implements the minimal fix, runs the full test suite, and opens a real pull request.

A sandbox is essential for this workflow. The agent clones unknown code, installs unknown dependencies, and executes the project's test suite — operations that need strict isolation from your host. Daytona provisions a fresh isolated environment for every run and tears it down on completion, so an untrusted repository can never affect your host.

---

### 1. Workflow Overview

You point the agent at an open issue on any GitHub repository. The agent provisions a Daytona sandbox, clones the repo into it, then executes a strict **Reproduce → Fix → Verify → PR** workflow. When it's done, it returns the URL of a real pull request you can review in the GitHub UI.

A successful run against `vercel/ms` issue #284 looks like this in the `flue dev` terminal:

```
[bug-fix] target: your-username/your-fork#284 (model: anthropic/claude-sonnet-4-6)
[bug-fix] sandbox ready (id: a44a184e-cf0a-4407-bb1a-02f1b8000466)
[bug-fix] installing gh CLI in sandbox...
[bug-fix] commits will be authored as Your Name <12345+your-username@users.noreply.github.com>
[bug-fix] cloning your-username/your-fork into sandbox...
[bug-fix] detected package manager: pnpm
[bug-fix] installing pnpm...
[bug-fix] installing project dependencies...
[bug-fix] resolving issue source: vercel/ms
[bug-fix] fetching issue #284 from vercel/ms...
[bug-fix] uploading skill into sandbox + excluding it from git...
[bug-fix] running TDD workflow (reproduce → fix → PR)...
[bug-fix] PR opened: https://github.com/your-username/your-fork/pull/1
[bug-fix] branch: flue/fix-issue-284
[bug-fix] files changed: src/index.ts, src/parse.test.ts
[bug-fix] tearing down agents + sandbox...
```

The four-phase TDD work (Understand → Reproduce → Fix → Pull Request) happens entirely inside the LLM's session in the sandbox, so it doesn't surface in the dev-server log line by line. To see those events streamed live, switch to the SSE invocation shown later in [How `bug-fix.ts` is actually invoked](#how-bug-fixts-is-actually-invoked).

The HTTP response body returned to your `curl` is the structured result the agent emits:

```json
{
  "result": {
    "branch": "flue/fix-issue-284",
    "prUrl": "https://github.com/your-username/your-fork/pull/1",
    "testFile": "src/parse.test.ts",
    "filesChanged": ["src/index.ts", "src/parse.test.ts"],
    "summary": "The parse() regex only matched plain decimal numbers in the value group (`-?\\d*\\.?\\d+`), so when format() produced scientific notation (e.g. `5.696545792019405e+297y`) via JavaScript's default number serialisation for very large Math.round() results, parse() returned NaN; the fix extends the value capture group with an optional exponent part (`(?:e[+-]?\\d+)?`) so scientific notation is accepted transparently."
  }
}
```

### 2. Project Setup

#### Clone the Repository

Clone the Daytona [repository](https://github.com/daytonaio/daytona.git) and navigate to the example directory:

```bash
git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/typescript/flue
```

#### Fork a Demo Target

You need a target repository to demo against. We recommend [`vercel/ms`](https://github.com/vercel/ms), the well-known millisecond conversion utility. It's small (one ~244-line source file), uses Jest for tests, has an MIT license, and ships with real open issues. Fork it so the agent can push branches and open PRs against your copy:

```bash
gh repo fork vercel/ms --clone=false
```

The agent will operate on your fork (referred to in this guide as `your-username/your-fork`, where `your-fork` is whatever you named it), so any branches and pull requests it creates land on your fork, never upstream.

#### Configure Environment

Copy `.env.example` to `.env` and fill in your keys:

```bash
cp .env.example .env
```

| Variable | Required | Source |
|---|---|---|
| `DAYTONA_API_KEY` | yes | [Daytona Dashboard](https://app.daytona.io/dashboard/keys) |
| `ANTHROPIC_API_KEY` | yes | For this agent's default model, `anthropic/claude-sonnet-4-6`. Required only if you don't override `MODEL`. (Flue itself has no default; `bug-fix.ts` picks one.) |
| `GITHUB_TOKEN` | yes | A Personal Access Token with `repo` scope ([create one](https://github.com/settings/tokens)) |
| `MODEL` | no | Override this agent's default model. Any `provider/model-id` recognized by [`@mariozechner/pi-ai`](https://www.npmjs.com/package/@mariozechner/pi-ai). Examples: `anthropic/claude-opus-4-7`, `openai/gpt-5.5` |
| `DEMO_REPO` | no¹ | Default target fork in `<owner>/<repo>` form (e.g. `your-username/your-fork`). Used when the webhook payload omits `repo` |
| `DEMO_ISSUE` | no¹ | Default issue number (e.g. `284`). Used when the webhook payload omits `issueNumber` |
| `ISSUE_REPO` | no | Override the issue source, in `<owner>/<repo>` form. By default the agent auto-detects the upstream parent of `DEMO_REPO`; set this if `DEMO_REPO` is not a fork or you want to point at a different repo |

¹ Either set both `DEMO_REPO` / `DEMO_ISSUE` in `.env` and trigger with an empty body, **or** omit them and pass `repo` / `issueNumber` on every webhook call. Payload always wins over `.env`.

:::note[Why the agent reads issues from one repo and PRs against another]
GitHub forks have **issues disabled by default** and never inherit issues from the upstream. So the agent reads the issue from the fork's upstream parent (auto-detected via `gh repo view --json parent`) but pushes its branch and opens the PR against your fork. This keeps the demo isolated to your account: no spam to `vercel/ms` maintainers, no extra setup on your side. Use `ISSUE_REPO` to override the auto-detection if your `repo` isn't a fork.
:::

:::caution[Token scope]
The `GITHUB_TOKEN` is passed into the sandbox so `gh` can clone, push, and open PRs from inside it. Use a [classic PAT](https://github.com/settings/tokens/new) with the `repo` scope. The token can be revoked at any time from your GitHub settings once you're done with the demo.
:::

#### Install Dependencies

:::note[Node.js version]
Flue requires Node.js 22 or newer. Confirm your version with `node --version` before continuing.
:::

```bash
npm install
```

#### Start the Agent Server

```bash
npm run dev
```

Flue boots a webhook server on port `3583` and discovers the `bug-fix` agent automatically:

```
[flue] Starting dev server (target: node)
[flue] Target: node
[flue] Found 1 role(s): test-driven-developer
[flue] Found 1 agent(s): bug-fix
[flue] Webhook agents: bug-fix
[flue] Built: dist/server.mjs
[flue] Server: http://localhost:3583
[flue] Try: curl -X POST http://localhost:3583/agents/bug-fix/test-1 \
         -H 'Content-Type: application/json' -d '{}'
[flue] Press Ctrl+C to stop
```

#### Trigger the Agent

There are three equivalent ways to trigger the agent. Pick whichever fits your workflow.

**Option A: drive everything from `.env`** (default sync mode). With `DEMO_REPO=your-username/your-fork` and `DEMO_ISSUE=<number>` set in `.env`, fire an empty payload:

```bash
curl -X POST http://localhost:3583/agents/bug-fix/run-1 \
  -H "Content-Type: application/json" \
  -d '{}'
```

**Option B: pass the target per call** (default sync mode). Override `.env` (or skip it entirely) by sending the target in the payload:

```bash
curl -X POST http://localhost:3583/agents/bug-fix/run-1 \
  -H "Content-Type: application/json" \
  -d '{
    "repo": "your-username/your-fork",
    "issueNumber": <number>
  }'
```

Replace `your-username/your-fork` with your fork's slug and `<number>` with the issue number you want to target.

The `flue dev` terminal shows the orchestrator's setup logs in real time. When the agent finishes, the response body contains the structured result.

**Option C: one-shot with live tool tracing** (recommended when iterating or debugging):

Stop `flue dev` (or leave it; doesn't matter) and run:

```bash
npm run run
```

That maps to `flue run bug-fix --target node --id run-1 --env .env --payload '{}'`. Unlike Options A and B, this **builds and spawns its own ephemeral server**, POSTs with `Accept: text/event-stream`, and decorates every agent event (`tool:start`, `tool:done`, the LLM's reasoning text) into a readable progress line so you can watch the LLM work in real time. The final structured result is printed at the end, ready to pipe into downstream tooling. See the [Example Walkthrough](#4-example-walkthrough) below for a real `npm run run` trace.

Use Options A/B for quiet production-style runs against a long-lived `flue dev` server. Use Option C when you want to watch the LLM's tool calls tool-by-tool.

:::tip[What if the issue you target no longer exists?]
If the issue you point the agent at has been closed, deleted, or never existed, the run will fail honestly instead of fabricating a fix. Two failure modes:

- **Issue not found**: the setup phase's `gh issue view` returns a non-zero exit code, and the agent throws before reaching the LLM. Pick a different issue and rerun.
- **Issue is already fixed**: the LLM proceeds through Phase 1, then in Phase 2 writes a "failing" test that actually passes. The `test-driven-developer` role's hard rule (_"a test that doesn't fail isn't a reproduction"_) makes the agent stop and return early with `prUrl: ""` and a `summary` explaining the situation. No PR is opened.

Update `DEMO_ISSUE` in `.env` (or pass `issueNumber` in the payload) and try another open issue.
:::

### 3. Understanding the Architecture

This example splits responsibility between TypeScript and Markdown, the idiomatic Flue pattern. Plumbing (sandbox lifecycle, payload validation, structured outputs) lives in `.ts`; the agent's reasoning and workflow live in `.md`.

The directory layout is **defined by Flue**, not by us. Flue's CLI looks for a `.flue/` workspace at the project root containing `agents/`, `connectors/`, and `roles/` subdirectories, and discovers skills under `.agents/skills/<skill-name>/SKILL.md`. We just populate those well-known locations:

```
guides/typescript/flue/
├── .flue/                            # Flue workspace (convention)
│   ├── agents/                       # one .ts file per agent (Flue scans this dir)
│   │   └── bug-fix.ts                # orchestrator
│   ├── connectors/                   # connector files referenced by agents
│   │   └── daytona.ts                # Daytona → Flue SandboxFactory adapter
│   └── roles/                        # role markdown files (subagent personas)
│       └── test-driven-developer.md
└── .agents/                          # Flue skill workspace (convention)
    └── skills/
        └── bug-fix/                  # skill folder name = skill identifier
            └── SKILL.md              # actual TDD workflow logic
```

So `session.skill('bug-fix', { ... })` in our agent code maps directly to `.agents/skills/bug-fix/SKILL.md`: Flue resolves the skill by folder name by default. If you set a `name:` field in the file's frontmatter, that takes precedence over the folder name — useful when you want to keep the directory layout but rename the skill.

#### The Daytona Connector

Daytona is a first-class Flue connector. The canonical way to install it is to pipe Flue's connector registry to your AI coding agent:

```bash
flue add daytona | claude
# or: opencode | codex | cursor-agent | pi
```

This requires an AI coding-agent CLI to already be installed and authenticated locally — pick whichever one you already use (`claude`, `opencode`, `codex`, `cursor-agent`, `pi`, etc.). If you don't have any installed yet, this guide ships the resulting connector pre-built so you can skip the `flue add` step entirely.

`flue add daytona` fetches the official installation instructions from `https://flueframework.com/cli/connectors/daytona.md` and writes them to stdout. Your AI agent reads those instructions and writes the connector adapter (`.flue/connectors/daytona.ts`) into your project automatically. No manual file copying, no version drift.

This guide ships the resulting `.flue/connectors/daytona.ts` pre-built so the demo runs without an extra step, but the file is byte-identical to what `flue add daytona | <agent>` produces. Once installed, you import the connector and pass it to `init()`:

```typescript
import { Daytona } from '@daytona/sdk'
import { daytona } from '../connectors/daytona'

const client = new Daytona({ apiKey: env.DAYTONA_API_KEY })
const sandbox = await client.create()

const agent = await init({
  // cleanup: true arms sandbox.delete() to fire on agent.destroy()
  // Flue does NOT auto-destroy on handler return; see Section 5.
  sandbox: daytona(sandbox, { cleanup: true }),
  model: 'anthropic/claude-sonnet-4-6',
})
```

The user owns the Daytona client lifecycle (you decide how the sandbox is created, reused, or cleaned up); Flue just adapts it for agent use. The `cleanup: true` option **arms** a `sandbox.delete()` callback that fires when `agent.destroy()` is called but Flue does NOT auto-destroy on handler return. The orchestrator must explicitly call `destroy()`, which our `try/finally` does (covered in [Section 5: Cleanup](#5-cleanup)).

#### The Orchestrator

The agent file (`.flue/agents/bug-fix.ts`) is small on purpose. It provisions the sandbox, prepares the environment, and hands off to a skill that does the real work. Here's the structural shape — for the full file (env validation, slug-format checks, package-manager auto-install, gh/git config, the `try/finally` cleanup), open `.flue/agents/bug-fix.ts` from the guide directory you cloned earlier:

```typescript
// First agent: setup phase. cleanup: true arms sandbox.delete() on destroy().
const setupAgent = await init({
  sandbox: daytona(sandbox, { cleanup: true }),
  model,
})
const setup = await setupAgent.session()
// ... installing gh, setting up git, cloning the fork, installing deps,
//     fetching the issue body, uploading the SKILL.md ...

// Second agent: shares the same sandbox, but rooted in the cloned project dir
// so Flue discovers our SKILL.md from `.agents/skills/bug-fix/SKILL.md`.
const projectAgent = await init({
  id: `bug-fix-${issueNumber}`,
  sandbox: daytona(sandbox), // no cleanup option — setupAgent owns teardown
  cwd: projectDir,
  model,
})
const session = await projectAgent.session()

return await session.skill('bug-fix', {
  args: { issueNumber, issueData, repo, issueRepo, packageManager },
  role: 'test-driven-developer',
  result: ResultSchema,
})
```

A few details worth highlighting:

- **Two agents, one sandbox.** The setup agent installs `gh`, configures auth, clones the repo, and installs dependencies. A second agent (given a different `id` and `cwd`) operates inside the cloned repo and discovers our `bug-fix` skill from `.agents/skills/bug-fix/SKILL.md`, which the orchestrator uploads into the cloned worktree just before init. Both agents share the same Daytona sandbox. The distinct `id` matters: a fresh `id` opens a fresh Flue session — see [What `<id>` actually means: sessions](#what-id-actually-means-sessions) for the full lifecycle.
- **No `AGENTS.md` upload.** Flue would happily read an `AGENTS.md` from the session cwd and prepend it to every system prompt, but that file lives at the cloned repo's root and uploading our own would overwrite the target's `AGENTS.md` if it has one. Every guardrail we'd put there (TDD discipline, minimal change, match host code style) is already covered by the `test-driven-developer` role and the `bug-fix` skill body, so the harness ships nothing at the worktree root.
- **`.git/info/exclude` keeps the worktree clean.** After uploading `SKILL.md` to `.agents/skills/bug-fix/`, the orchestrator appends `.agents/` to the cloned repo's `.git/info/exclude` (git's local-only ignore — does NOT modify the target's `.gitignore`). The harness scaffolding stays invisible to `git status` and never accidentally lands in a commit.
- **Two repos, one workflow.** `repo` is the user's fork (where branches and the PR land); `issueRepo` is where the issue lives (the upstream parent, auto-detected via `gh repo view --json parent` because GitHub disables issues on forks).
- **Package-manager auto-detection.** The setup phase detects the package manager from the project's lockfile, installs it if missing, and passes the name into the skill so the LLM uses the right test command.
- **Structured input and output.** The `PayloadSchema` validates the incoming HTTP body with [Valibot](https://valibot.dev/), and `ResultSchema` forces the agent to return a typed `{ branch, prUrl, testFile, filesChanged, summary }` object you can pipe into downstream automation.
- **Skills, not prompts.** Instead of cramming the TDD workflow into a string, the agent calls a named skill and supplies a role. The actual logic lives in markdown.

#### The Skill (Where the Real Logic Lives)

`.agents/skills/bug-fix/SKILL.md` defines the TDD workflow as four strict phases. The agent is required to run them in order, and it cannot proceed to the next phase without producing concrete evidence (a read, a failing test, a passing test, a commit):

```markdown
## Phase 1: Understand
Read the issue body. Identify expected vs. actual behavior.
Inspect package.json, README.md, AGENTS.md. Identify the test framework.
Read the source file(s) most likely involved. Read at least one test file.

## Phase 2: Reproduce
Create branch flue/fix-issue-{{issueNumber}}.
Write a single, focused test that asserts the expected behavior.
Run the test command. The test MUST fail.

## Phase 3: Fix
Make the minimal code change required to make the failing test pass.
Run the full test suite. All tests MUST pass.

## Phase 4: Pull Request
Commit with `fix: <summary> (#{{issueNumber}})`.
Push the branch to the user's fork.
Open a PR via `gh pr create` with reproduction + verification output.
```

Because the workflow is markdown, you can tighten it (add a "no `--force` push" rule), loosen it (allow multi-file fixes), or fork it for a different language without touching TypeScript.

#### The Role

`.flue/roles/test-driven-developer.md` defines the agent's persona: a disciplined contributor who treats the target repository as someone else's project. The role is referenced in the skill call (`role: 'test-driven-developer'`) and shapes how the agent makes tradeoffs (minimal change, match host code style, never disable existing tests).

#### How `bug-fix.ts` is actually invoked

Nothing in our code calls our agent's default export directly; Flue's CLI does. Here's the full chain from `npm run dev` to `handler(ctx)`:

**Build time (`flue dev` startup):**

1. `flue dev --target node` calls `dev()` from `@flue/sdk`, which runs `build()`.
2. `build()` does `fs.readdirSync('.flue/agents')` and keeps any entry matching `/\.(ts|js|mts|mjs)$/`. Our `bug-fix.ts` matches → agent name is `bug-fix` (filename without extension).
3. For each agent file, Flue uses the TypeScript AST to find the static `export const triggers = {...}` declaration, validating that `webhook` is `true` or `false`. Our `triggers = { webhook: true }` registers the agent for HTTP access.
4. The build generates a [Hono](https://hono.dev/) server entry that imports each agent's default export, then esbuilds it to `dist/server.mjs`.
5. The dev server spawns `node dist/server.mjs` with `PORT=3583` and `FLUE_MODE=local`.

**Request time (when you `curl`):**

The generated server mounts a single dynamic route, `POST /agents/:name/:id`. When a request arrives:

1. Validate the method, agent name, and webhook accessibility.
2. Parse the JSON body → `payload` (defaults to `{}` for empty bodies).
3. **Pick a response mode based on headers:**

   | Headers sent by client | Server behavior | Status |
   |---|---|---|
   | `Content-Type: application/json` (default) | Wait for handler, return `{ "result": <handlerReturn> }` | `200` |
   | `Accept: text/event-stream` | Stream SSE events (channel names are `tool_start`, `text_delta`, …, finally `result`) | `200` |
   | `x-webhook: true` | Fire-and-forget; run handler in background | `202` |

   The CLI's pretty-print form (`[flue] tool:start`, `[flue] tool:done`) you see in `flue run` output is `flue run`'s own decoration; the underlying SSE channel names use underscores (`tool_start`, `tool_end`). If you're consuming the SSE stream from your own client, listen for the underscore form.

4. Construct a `FlueContext` (`{ id, payload, env, init }`) and invoke our default export: `handler(ctx)`.
5. Return whatever the handler resolves with, in whichever mode was selected.

So when you run our `curl` example without special headers, you hit the **sync mode**: the connection stays open until the agent finishes (PR opened), then the server returns `{ "result": { branch, prUrl, ... } }`.

If you want to watch the agent's progress live, switch to SSE:

```bash
curl -N -X POST http://localhost:3583/agents/bug-fix/run-1 \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{}'
```

Or use Flue's one-shot CLI invoker, which handles SSE for you and prints the result to stdout:

```bash
npm run run
# equivalent to:
# flue run bug-fix --target node --id run-1 --env .env --payload '{}'
```

`flue run` builds, spawns the server, POSTs with `Accept: text/event-stream`, streams events to stderr, prints the final result to stdout, and shuts the server down. Perfect for CI.

#### What `<id>` actually means: sessions

The `<id>` segment in `POST /agents/bug-fix/<id>` is not just a label. It identifies a Flue **session**: the persistent message history and conversation metadata that `agent.session()` opens inside your handler.

```
POST /agents/bug-fix/run-1   ← <id> = "run-1" → session "run-1" for the bug-fix agent
POST /agents/bug-fix/run-2   ← different <id> → fresh session, no shared state
POST /agents/bug-fix/run-1   ← same <id> as before → REUSES session "run-1"
```

**Same `<id>` reused, what actually happens:**

1. Your handler function runs **from the top, every time** (it's just a function, not auto-resumed).
2. `client.create()` makes a **new** Daytona sandbox each call (because our code calls it unconditionally).
3. But `await agent.session()` inside the handler resolves to the **same Flue session object** as the previous call with that id, so the LLM sees the previous run's message history as context for this run.

So same-id reuse persists the **conversation**, not the **sandbox**. For a chat-style agent that's exactly what you want; for our one-shot bug-fix agent, it's mostly noise (the LLM might short-circuit with "I already analyzed this") and ends up confusing things.

**Practical guidance for this guide**:
- One unique `<id>` per run (`run-1`, `run-2`, or `$(uuidgen)`). Treat each invocation as fresh.
- Pick a stable `<id>` only if you want resumability (e.g., agent crashed mid-fix and you want the LLM to remember its prior reasoning). You'd also need to extend the orchestrator to skip sandbox setup when a sandbox already exists for that id.

### 4. Example Walkthrough

Let's trace what happens when you trigger the agent against [`vercel/ms`](https://github.com/vercel/ms) issue [#284](https://github.com/vercel/ms/issues/284). The reporter found that `ms()` violates its own roundtrip contract: `ms(Number.MAX_VALUE)` returns a string in scientific notation that `parse()` can no longer read back.

To watch the agent work tool-by-tool, use Flue's one-shot CLI invoker (already wired into our `package.json`):

```bash
npm run run
```

That runs `flue run bug-fix --target node --id run-1 --env .env --payload '{}'`, which builds, spawns an ephemeral server, POSTs with `Accept: text/event-stream`, and decorates each agent event into a readable progress line. Below is a trimmed real run against a fork of `vercel/ms`. The orchestrator's `[bug-fix] ...` lines come from our setup phase; the `[flue] tool:start/done ...` lines and inline italicized text are the LLM's tool calls and reasoning streamed back via SSE:

```
[bug-fix] target: your-username/your-fork#284 (model: anthropic/claude-sonnet-4-6)
[bug-fix] sandbox ready (id: b5e8e152-7c67-4176-8721-c37aab6aa331)
[bug-fix] installing gh CLI in sandbox...
[bug-fix] commits will be authored as Your Name <12345+your-username@users.noreply.github.com>
[bug-fix] cloning your-username/your-fork into sandbox...
[bug-fix] detected package manager: pnpm
[bug-fix] installing pnpm...
[bug-fix] installing project dependencies...
[bug-fix] resolving issue source: vercel/ms
[bug-fix] fetching issue #284 from vercel/ms...
[bug-fix] uploading skill into sandbox + excluding it from git...
[bug-fix] running TDD workflow (reproduce → fix → PR)...
[flue] Agent started
  I'll work through this systematically following the test-driven approach.
[flue] tool:start  read  /home/daytona/project/package.json
[flue] tool:done   read  (1477 chars)
[flue] tool:start  glob  **/*.test.*
[flue] tool:done   glob
[flue] tool:start  read  /home/daytona/project/src/index.ts
[flue] tool:start  read  /home/daytona/project/src/parse.test.ts
[flue] tool:done   read  (5864 chars)
[flue] tool:done   read  (5082 chars)
  Now I have a clear picture. Let me verify what `format(Number.MAX_VALUE)` actually
  produces, then set up the branch and write the failing test:
[flue] tool:start  bash  $ node -e "const out = ms(Number.MAX_VALUE); ..."
[flue] tool:done   bash  (218 chars)
  I've confirmed the bug: format(Number.MAX_VALUE) produces "5.696545792019405e+297y"
  which the regex in parse() cannot match. Setting up the branch and writing the test:
[flue] tool:start  bash  $ git checkout -b flue/fix-issue-284
[flue] tool:done   bash
[flue] tool:start  edit  /home/daytona/project/src/parse.test.ts
[flue] tool:done   edit
[flue] tool:start  bash  $ pnpm test:nodejs
[flue] tool:done   bash  (2102 chars)
  The tests fail as expected. Now let me implement the fix:
[flue] tool:start  edit  /home/daytona/project/src/index.ts
[flue] tool:done   edit
[flue] tool:start  bash  $ pnpm test
[flue] tool:done   bash  (1319 chars)
  All 172 tests pass (167 pre-existing + 5 new). Now let's commit and push:
[flue] tool:start  bash  $ git add src/index.ts src/parse.test.ts && git commit -m "fix: ..."
[flue] tool:done   bash
[flue] tool:start  bash  $ git push origin flue/fix-issue-284
[flue] tool:done   bash
[flue] tool:start  bash  $ gh pr create --repo your-username/your-fork --base main \
                          --head flue/fix-issue-284 --title "fix: ..." --body "Closes #284 ..."
[flue] tool:done   bash
[bug-fix] PR opened: https://github.com/your-username/your-fork/pull/1
[bug-fix] branch: flue/fix-issue-284
[bug-fix] files changed: src/index.ts, src/parse.test.ts
[bug-fix] tearing down agents + sandbox...
```

Reading top to bottom you can see the agent following our SKILL.md phases: it understands the project (multiple parallel `read` calls), confirms the bug interactively before changing anything (the `node -e` reproduction in `bash`), creates a branch + writes the failing test, runs the suite to confirm the test fails, makes the fix, reruns the suite, then commits, pushes, and opens the PR. The final `[bug-fix] tearing down agents + sandbox...` line is the orchestrator's `try/finally` doing its work — both agents get destroyed and the Daytona sandbox is deleted before the response is returned.

Note that exact wording, file paths, test counts, and PR numbers vary between runs (the LLM is non-deterministic, and the PR number depends on how many PRs your fork already has). The shape — sandbox provision → setup → four-phase TDD workflow → PR URL → cleanup — is what's deterministic.

The four phases below zoom in on each step.

#### Phase 1: Understand

The agent reads `package.json` to identify the test runner (Jest, run via `pnpm test`), then reads the single-file source at `src/index.ts` (244 lines) and an existing parse test (`src/parse.test.ts`) to learn the project's assertion style.

The relevant code is the `parse()` regex around line 77 of `src/index.ts`:

```typescript
const match = /^(?<value>-?\d*\.?\d+) *(?<unit>...)?$/i.exec(str);
```

The `value` group only matches plain decimal numbers; it doesn't accept scientific notation (`e+297`). That's the bug.

#### Phase 2: Reproduce

The agent creates a new branch and writes a single, focused test that asserts the **roundtrip property** the reporter described:

```typescript
import { ms } from './index';

describe('issue #284: roundtrip with very large numbers', () => {
  it('format() output is always parseable back to a number', () => {
    const out = ms(Number.MAX_VALUE);
    expect(typeof out).toBe('string');
    expect(ms(out)).not.toBeNaN();
  });
});
```

It runs `pnpm test` and confirms the failures with the current implementation. From a real run:

```
FAIL  src/parse.test.ts
  ● parse(scientific notation) › should parse scientific notation values with a unit (roundtrip with format)
    Expected: false
    Received: true   (Number.isNaN was true — parse returned NaN)

  ● parse(scientific notation) › should parse scientific notation with y unit
  ● parse(scientific notation) › should parse scientific notation with ms unit
  ● parse(scientific notation) › should parse scientific notation with s unit
  ● parse(scientific notation) › should parse negative scientific notation with a unit
```

If the tests had unexpectedly passed, the agent would refuse to continue: a test that doesn't fail isn't a reproduction.

#### Phase 3: Fix

With the bug reproduced, the agent makes a minimal, surgical change to `src/index.ts` so `parse()` accepts scientific notation in the numeric value group, then reruns the full suite:

```
PASS  src/parse.test.ts
PASS  src/index.test.ts
PASS  src/format.test.ts
PASS  src/parse-strict.test.ts

Test Suites: 4 passed, 4 total
Tests:       172 passed, 172 total
```

Both the new tests and every pre-existing test pass. The fix is a small regex extension (one optional capture group): exactly the kind of minimal change the `test-driven-developer` role rewards.

#### Phase 4: Pull Request

The agent commits, pushes, and opens a PR against your fork:

```bash
$ git commit -m "fix: support scientific notation in parse() to fix roundtrip with large numbers (#284)"
$ git push origin flue/fix-issue-284
$ gh pr create --repo your-username/your-fork \
    --base main \
    --head flue/fix-issue-284 \
    --title "fix: support scientific notation in parse() to fix roundtrip with large numbers (#284)" \
    --body "Closes vercel#284 ..."
```

The PR body the agent generated includes the failing-test output from Phase 2, a one-paragraph root-cause analysis, and the passing-test output from Phase 3 — everything a human reviewer needs to merge in under five minutes.

The HTTP response body returned to your `curl` wraps the handler's return value under a `result` key:

```json
{
  "result": {
    "branch": "flue/fix-issue-284",
    "prUrl": "https://github.com/your-username/your-fork/pull/1",
    "testFile": "src/parse.test.ts",
    "filesChanged": ["src/index.ts", "src/parse.test.ts"],
    "summary": "The parse() regex only matched plain decimal numbers in the value group (`-?\\d*\\.?\\d+`), so when format() produced scientific notation (e.g. `5.696545792019405e+297y`) via JavaScript's default number serialisation for very large Math.round() results, parse() returned NaN; the fix extends the value capture group with an optional exponent part (`(?:e[+-]?\\d+)?`) so scientific notation is accepted transparently."
  }
}
```

Open the PR URL in your browser to review the diff and merge, exactly as you would for a human-authored contribution.

:::tip[Commit attribution]
Both the commit and the PR are authored under the GitHub account that owns your `GITHUB_TOKEN`. The agent calls `gh api user` at startup to resolve your login + numeric ID, then sets `git config user.email` to the GitHub-recommended `<id>+<login>@users.noreply.github.com` noreply format. GitHub recognizes that email and attaches the commit to your profile (avatar and all). The PR body still has a small `Generated by a Flue + Daytona bug-fix agent.` footer for transparency — if you'd rather drop it (some maintainers prefer not to have third-party tags on contributions), edit the PR-body template at the bottom of `.agents/skills/bug-fix/SKILL.md` and remove that line.
:::

### 5. Cleanup

Flue does **not** auto-destroy sessions when a handler returns — sessions persist for resumability via the same `<id>`, and the `cleanup: true` callback registered on our connector only fires when `agent.destroy()` is explicitly called. The orchestrator therefore wraps the entire two-agent flow in a `try { ... } finally { ... }` block:

```ts
try {
  setupAgent = await init({ sandbox: daytona(sandbox, { cleanup: true }), ... })
  // ... setup work + projectAgent + skill invocation
  return result
} finally {
  console.log('[bug-fix] tearing down agents + sandbox...')
  if (projectAgent) {
    try { await projectAgent.destroy() } catch (err) { console.error(err) }
  }
  if (setupAgent) {
    try { await setupAgent.destroy() } catch (err) { console.error(err) }
  } else {
    // setupAgent never armed cleanup: true; delete sandbox directly
    try { await sandbox.delete() } catch (err) { console.error(err) }
  }
}
```

Order matters: the **project** agent is destroyed first (closes its session, no sandbox impact since it doesn't have `cleanup: true`), then the **setup** agent's destroy fires the registered `sandbox.delete()` callback. The fallback `else` branch handles the case where `init()` itself threw before `setupAgent` was created — in that scenario nothing armed `cleanup: true`, so the orchestrator calls `sandbox.delete()` directly on the already-created sandbox. (If `client.create()` itself fails earlier, no sandbox object exists at all, so there's nothing to leak.)

This covers the common failure paths: successful runs, caught LLM errors, and exceptions thrown during the workflow. Cleanup is best-effort — if `destroy()` or `sandbox.delete()` itself throws (transient API error, network drop), the failure is logged but the sandbox is not retried. Confirm in your [Daytona Dashboard](https://app.daytona.io/dashboard) after each run, and clean up any orphans manually if you spot them.

**Key Advantages**

- **TDD by construction**: the skill's phase ordering forces a failing test before any fix lands, so every PR is reproducible.
- **Real pull requests**: the agent uses `gh` inside the sandbox to push and open PRs you can merge in the GitHub UI, not just patches you copy by hand.
- **Skill-first design**: tweak the workflow by editing markdown. No recompile, no redeploy.
- **Structured outputs**: results are validated by Valibot schemas, so downstream automation never has to parse free-form text.
- **Sandbox-isolated execution**: cloning, dependency installation, and test runs all happen inside Daytona, so your host stays clean even if the target repo is malicious or its dependencies are.