Fix Bugs Automatically With AG2 and Daytona

View as Markdown

This guide demonstrates how to use DaytonaCodeExecutor for AG2 to build a multi-agent system that automatically fixes broken code in a secure sandbox environment. The executor enables agents to run Python, JavaScript, TypeScript, and Bash code within isolated Daytona sandboxes, with no risk to your local machine.

In this example, we build a bug fixer that takes broken code as input, analyzes the bug, proposes a fix, and verifies it by actually executing the code in a Daytona sandbox. If the fix fails, the agent sees the error output and retries with a different approach, continuing until the code passes or the maximum number of attempts is reached.

1. Workflow Overview

You provide broken code. The bug_fixer agent (LLM) analyzes it and proposes a fix wrapped in a fenced code block. The code_executor agent extracts the code block and runs it in a Daytona sandbox. If execution fails, the bug fixer sees the full error output and tries again. Once the code passes, the agents terminate and the sandbox is automatically deleted.

The key benefit: every fix attempt is verified by actually running the code — not just reviewed by the LLM.

2. Project Setup

Clone the Repository

Clone the Daytona repository and navigate to the example directory:

git clone https://github.com/daytona/guides
cd guides/python/ag2/bug-fixer-agent/openai

Install Dependencies

Install the required packages for this example:

Python

pip install "ag2[daytona,openai]" python-dotenv

The packages include:

ag2[daytona,openai]: AG2 with the Daytona code executor and OpenAI model support
python-dotenv: Loads environment variables from a .env file

Configure Environment

Get your API keys and configure your environment:

Daytona API key: Get it from Daytona Dashboard
OpenAI API key: Get it from OpenAI Platform

Create a .env file in your project directory:

DAYTONA_API_KEY=dtn_***
OPENAI_API_KEY=sk-***

3. Understanding the Core Components

Before diving into the implementation, let’s understand the key components:

AG2 ConversableAgent

ConversableAgent is AG2’s general-purpose agent. Each agent can be configured as either an LLM agent (with a model and system prompt) or a non-LLM agent (llm_config=False) that responds through registered reply handlers — in our case, code execution via code_execution_config. The two agents communicate by passing messages back and forth until a termination condition is met.

DaytonaCodeExecutor

DaytonaCodeExecutor implements the AG2 CodeExecutor protocol. When used as a context manager, it creates a Daytona sandbox on entry and automatically deletes it on exit. It reuses the same sandbox across all code executions within the session, extracting and running fenced code blocks from agent messages. The language is inferred from the code block tag (```python, ```javascript, ```typescript).

4. Implementation

Step 1: Imports and environment

import os

from autogen import ConversableAgent, LLMConfig
from autogen.coding import DaytonaCodeExecutor
from dotenv import load_dotenv

load_dotenv()

Step 2: Bug fixer system prompt

The system prompt drives the iterative fix loop. It tells the agent which languages are supported, instructs it to wrap fixes in fenced code blocks, and separates the fix message from the TERMINATE signal so the executor always runs the code before the session ends:

BUG_FIXER_SYSTEM_MESSAGE = """You are an expert bug fixer. You support Python, JavaScript, and TypeScript.
If asked to fix code in any other language, refuse and explain which languages are supported.

When given broken code:

1. Analyze the bug carefully and identify the root cause
2. Write the complete fixed code in a fenced code block using the correct language tag
3. Always include assertions or print statements at the end to verify the fix works
4. If your previous fix didn't work, analyze the error output and try a different approach
5. Once the code runs successfully, reply with just the word TERMINATE — never in the same message as a code block

Always wrap your code in fenced code blocks (```python, ```javascript, or ```typescript). Never explain without providing fixed code.
Never include TERMINATE in a message that contains a code block.
"""

Step 3: Create the agents

def fix_bug(broken_code: str, error_description: str = "") -> None:
    llm_config = LLMConfig(
        {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        }
    )

    with DaytonaCodeExecutor(timeout=60) as executor:
        bug_fixer = ConversableAgent(
            name="bug_fixer",
            system_message=BUG_FIXER_SYSTEM_MESSAGE,
            llm_config=llm_config,
            code_execution_config=False,
            is_termination_msg=lambda x: (
                "TERMINATE" in (x.get("content") or "") or not (x.get("content") or "").strip()
            ),
        )

        code_executor = ConversableAgent(
            name="code_executor",
            llm_config=False,
            code_execution_config={"executor": executor},
        )

DaytonaCodeExecutor is used as a context manager so the sandbox is automatically cleaned up when fix_bug returns. bug_fixer owns the LLM reasoning; code_executor owns sandbox execution and never calls the LLM itself (llm_config=False).

The optional error_description parameter can be used to pass additional context about the failure — for example, a stack trace, a known symptom, or a hint about the cause. In the examples below we leave it empty, as the agent is capable of identifying and fixing the bugs purely from the assertion output.

Step 4: Start the conversation

        message = f"Fix this broken code:\n\n\n{broken_code}\n"
        if error_description:
            message += f"\n\nError: {error_description}"

        code_executor.run(
            recipient=bug_fixer,
            message=message,
            max_turns=8,
        ).process()

code_executor initiates the chat because it owns the problem — the broken code. bug_fixer receives it as its first message, proposes a fix, and waits for execution results.

Assign the return value of run() before calling process() to access more details about the session:

response = code_executor.run(recipient=bug_fixer, message=message, max_turns=8)
response.process()

response.messages  # full message exchange between agents
response.cost      # token usage and cost breakdown per model
response.summary   # conversation summary (requires summary_method to be set)

5. Running the Example

The complete example ships with three broken code snippets, one per language:

Example 1 — Python: postfix evaluator with swapped operands

The subtraction and division operators pop two values from the stack but apply them in reverse order, producing wrong results for non-commutative operations.

elif token == '-':
    stack.append(b - a)   # Bug: reversed — should be a - b
elif token == '/':
    stack.append(b // a)  # Bug: reversed — should be a // b

Example 2 — JavaScript: wrong concatenation order in run-length encoder

The character and count are concatenated in the wrong order in two places, producing "a2b3c2" instead of "2a3b2c".

result += str[i - 1] + count;          // Bug: should be count + str[i - 1]
result += str[str.length - 1] + count; // Bug: should be count + str[str.length - 1]

Example 3 — TypeScript: Math.min instead of Math.max in Kadane’s algorithm

Both calls use Math.min instead of Math.max, causing the algorithm to track the most negative subarray sum instead of the most positive.

currentSum = Math.min(currentSum + nums[i], nums[i]);  // Bug: should be Math.max
maxSum = Math.min(maxSum, currentSum);                  // Bug: should be Math.max

Run all examples:

python main.py

Expected output

The following shows the full agent conversation for Example 1 (Python postfix evaluator):

============================================================
Example 1: Python — Postfix Expression Evaluator Bug
============================================================
code_executor (to bug_fixer):

Fix this broken code:

def eval_postfix(expression):
    stack = []
    for token in expression.split():
        if token.lstrip('-').isdigit():
            stack.append(int(token))
        else:
            b = stack.pop()
            a = stack.pop()
            if token == '+':
                stack.append(a + b)
            elif token == '-':
                stack.append(b - a)
            elif token == '*':
                stack.append(a * b)
            elif token == '/':
                stack.append(b // a)
    return stack[0]

assert eval_postfix("3 4 +") == 7
assert eval_postfix("10 3 -") == 7, f"Got {eval_postfix('10 3 -')}"
assert eval_postfix("12 4 /") == 3, f"Got {eval_postfix('12 4 /')}"
assert eval_postfix("2 3 4 * +") == 14
print("All postfix tests passed!")

--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...
bug_fixer (to code_executor):

```python
def eval_postfix(expression):
    stack = []
    for token in expression.split():
        if token.lstrip('-').isdigit():
            stack.append(int(token))
        else:
            b = stack.pop()
            a = stack.pop()
            if token == '+':
                stack.append(a + b)
            elif token == '-':
                stack.append(a - b)  # Fixed order of operands for subtraction
            elif token == '*':
                stack.append(a * b)
            elif token == '/':
                stack.append(a // b)  # Fixed order of operands for division
    return stack[0]

assert eval_postfix("3 4 +") == 7
assert eval_postfix("10 3 -") == 7, f"Got {eval_postfix('10 3 -')}"
assert eval_postfix("12 4 /") == 3, f"Got {eval_postfix('12 4 /')}"
assert eval_postfix("2 3 4 * +") == 14
print("All postfix tests passed!")
```

--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...

>>>>>>>> EXECUTING CODE BLOCK (inferred language is python)...
code_executor (to bug_fixer):

exitcode: 0 (execution succeeded)
Code output: All postfix tests passed!

--------------------------------------------------------------------------------

>>>>>>>> USING AUTO REPLY...
bug_fixer (to code_executor):

TERMINATE

The agent correctly identified both reversed operand bugs from the assertion failure output alone and resolved them in a single attempt, adding its own # Fixed order of operands comments to the corrected lines.

How the message loop works

recipient=bug_fixer in run() is what connects the two agents. AG2 sets up a managed back-and-forth loop between them — after each reply, the message is automatically forwarded to the other agent. The agents have no direct reference to each other outside of that call.

Tracing the session above step by step:

code_executor.run(recipient=bug_fixer, ...) — AG2 starts the loop and code_executor sends the broken code as plain text to bug_fixer. Nothing is executed yet.
bug_fixer (LLM) analyzes the code and replies with the fix wrapped in a ```python block.
AG2 calls _generate_code_execution_reply_using_executor on code_executor — a reply method registered automatically when code_execution_config is set. It scans bug_fixer’s last message for fenced code blocks, extracts the block, and calls DaytonaCodeExecutor.execute_code_blocks().
Daytona runs the code in the sandbox and returns the exit code and output.
AG2 forwards the result (exitcode: 0 (execution succeeded)\nCode output: All postfix tests passed!) back to bug_fixer as code_executor’s reply.
bug_fixer sees the successful output and replies with TERMINATE.
AG2 checks is_termination_msg on the incoming message — returns True, conversation stops, the sandbox is deleted.

Note that the original broken code is never executed — only bug_fixer’s proposed fix goes into Daytona.

6. Complete Code

import os

from autogen import ConversableAgent, LLMConfig
from autogen.coding import DaytonaCodeExecutor
from dotenv import load_dotenv

load_dotenv()

BUG_FIXER_SYSTEM_MESSAGE = """You are an expert bug fixer. You support Python, JavaScript, and TypeScript.
If asked to fix code in any other language, refuse and explain which languages are supported.

When given broken code:

1. Analyze the bug carefully and identify the root cause
2. Write the complete fixed code in a fenced code block using the correct language tag
3. Always include assertions or print statements at the end to verify the fix works
4. If your previous fix didn't work, analyze the error output and try a different approach
5. Once the code runs successfully, reply with just the word TERMINATE — never in the same message as a code block

Always wrap your code in fenced code blocks (```python, ```javascript, or ```typescript). Never explain without providing fixed code.
Never include TERMINATE in a message that contains a code block.
"""


def fix_bug(broken_code: str, error_description: str = "") -> None:
    llm_config = LLMConfig(
        {
            "model": "gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        }
    )

    with DaytonaCodeExecutor(timeout=60) as executor:
        bug_fixer = ConversableAgent(
            name="bug_fixer",
            system_message=BUG_FIXER_SYSTEM_MESSAGE,
            llm_config=llm_config,
            code_execution_config=False,
            is_termination_msg=lambda x: (
                "TERMINATE" in (x.get("content") or "") or not (x.get("content") or "").strip()
            ),
        )

        code_executor = ConversableAgent(
            name="code_executor",
            llm_config=False,
            code_execution_config={"executor": executor},
        )

        message = f"Fix this broken code:\n\n\n{broken_code}\n"
        if error_description:
            message += f"\n\nError: {error_description}"

        code_executor.run(
            recipient=bug_fixer,
            message=message,
            max_turns=8,
        ).process()


if __name__ == "__main__":
    # Example 1: Python — swapped operands in postfix expression evaluator
    broken_postfix = """\
def eval_postfix(expression):
    stack = []
    for token in expression.split():
        if token.lstrip('-').isdigit():
            stack.append(int(token))
        else:
            b = stack.pop()
            a = stack.pop()
            if token == '+':
                stack.append(a + b)
            elif token == '-':
                stack.append(b - a)
            elif token == '*':
                stack.append(a * b)
            elif token == '/':
                stack.append(b // a)
    return stack[0]

assert eval_postfix("3 4 +") == 7
assert eval_postfix("10 3 -") == 7, f"Got {eval_postfix('10 3 -')}"
assert eval_postfix("12 4 /") == 3, f"Got {eval_postfix('12 4 /')}"
assert eval_postfix("2 3 4 * +") == 14
print("All postfix tests passed!")
"""

    print("=" * 60)
    print("Example 1: Python — Postfix Expression Evaluator Bug")
    print("=" * 60)
    fix_bug(broken_postfix, "")

    # Example 2: JavaScript — wrong concatenation order in run-length encoder
    broken_js = """\
function encode(str) {
    if (!str) return '';
    let result = '';
    let count = 1;
    for (let i = 1; i < str.length; i++) {
        if (str[i] === str[i - 1]) {
            count++;
        } else {
            result += str[i - 1] + count;
            count = 1;
        }
    }
    result += str[str.length - 1] + count;
    return result;
}

console.assert(encode("aabbbcc") === "2a3b2c", `Expected "2a3b2c", got "${encode("aabbbcc")}"`);
console.assert(encode("abcd") === "1a1b1c1d", `Expected "1a1b1c1d", got "${encode("abcd")}"`);
console.log("All encoding tests passed!");
"""

    print("\n" + "=" * 60)
    print("Example 2: JavaScript — Run-Length Encoder Bug")
    print("=" * 60)
    fix_bug(broken_js, "")

    # Example 3: TypeScript — Math.min instead of Math.max in Kadane's algorithm
    broken_ts = """\
function maxSubarray(nums: number[]): number {
    let maxSum = nums[0];
    let currentSum = nums[0];
    for (let i = 1; i < nums.length; i++) {
        currentSum = Math.min(currentSum + nums[i], nums[i]);
        maxSum = Math.min(maxSum, currentSum);
    }
    return maxSum;
}

console.assert(maxSubarray([-2, 1, -3, 4, -1, 2, 1, -5, 4]) === 6,
    `Expected 6, got ${maxSubarray([-2, 1, -3, 4, -1, 2, 1, -5, 4])}`);
console.assert(maxSubarray([1]) === 1,
    `Expected 1, got ${maxSubarray([1])}`);
console.assert(maxSubarray([5, 4, -1, 7, 8]) === 23,
    `Expected 23, got ${maxSubarray([5, 4, -1, 7, 8])}`);
console.log("All max subarray tests passed!");
"""

    print("\n" + "=" * 60)
    print("Example 3: TypeScript — Max Subarray Bug")
    print("=" * 60)
    fix_bug(broken_ts, "")

Key advantages of this approach:

Execution-verified fixes: Every proposed fix is actually run in a sandbox — the agent only terminates when the code passes, not just when it looks correct
Secure execution: Fix attempts run in isolated Daytona sandboxes, not on your machine
Multi-language support: Python, JavaScript, TypeScript, and Bash — language is inferred automatically from the LLM’s fenced code block
Iterative refinement: If a fix fails, the agent sees the full error output and retries automatically
Automatic cleanup: The sandbox is deleted as soon as fix_bug returns, regardless of outcome

7. API Reference

For the complete API reference of DaytonaCodeExecutor, including all configuration options and supported parameters, see the DaytonaCodeExecutor documentation.