# Contents

If you're building AI tools or autonomous agents with large language models (LLMs), generating code is only half the job. At some point, that code needs to run - often automatically and at scale. But running LLM-generated code in production environments comes with serious risks around security, reliability, and control. That’s exactly the problem Daytona is built to solve.

In this post, I’ll walk through a minimal proof-of-concept that shows how to safely generate and run Python code using LangChain, OpenAI, and the Daytona SDK. The whole workflow happens inside a secure, isolated sandbox environment, a major step forward in making AI-assisted development safer and more reproducible.

Why Daytona?

LLMs are powerful but unpredictable. If you're using them to write code, you can't always guarantee what they'll return. That unpredictability becomes a risk when the code runs in production or on shared infrastructure.

Daytona solves this by providing isolated, programmatically controlled sandbox environments that can be easily created and destroyed. These sandboxes are stateful and can be long-running, making them ideal for agents with tasks that require maintaining state over time. For example, you can spin up a sandbox, write code into it, run the code, read results, and when you're done, tear everything down—without leaving traces or risking the host system.

What This Demo Covers

This proof-of-concept covers:

  1. Creating a Daytona sandbox.

  2. Using LangChain to generate Python code from a prompt.

  3. Executing that code securely in the sandbox.

  4. Performing basic file operations inside the sandbox (write, read, delete).

  5. Cleaning up the sandbox afterward.

Prerequisites

To run the demo, you'll need:

  • Python 3.9+

  • A Daytona API key (from Daytona)

  • An OpenAI API key (get one here)

For complete setup instructions, see the Daytona Configuration Guide .

Set your keys with a .env file:

1DAYTONA_API_KEY=your_daytona_key_here
2OPENAI_API_KEY=your_openai_key_here

Create a requirements.txt file:

1daytona>=0.0.1
2langchain>=0.1.9
3langchain_openai>=0.1.0
4openai>=1.0.0
5python-dotenv>=1.0.0

Then set up your environment:

1python -m venv .venv && source .venv/bin/activate
2pip install -r requirements.txt

Core Workflow

Here’s the core idea:

1. Generate a prompt describing the feature to implement

1prompt = """
2Write a Python function called `solve(n: int)` that returns the factorial of `n`.
3Include a __main__ block that reads n from command line argument and prints the result.
4Return raw Python code only, do not wrap it in markdown code blocks or backticks.
5"""

2. Generate the code with LangChain + OpenAI

1from dotenv import load_dotenv
2from langchain_openai import ChatOpenAI
3
4# For Python 3.9 and LangChain < 1.0:
5from langchain.schema import HumanMessage
6
7# For LangChain >= 1.0 (requires Python 3.10+):
8# from langchain.messages import HumanMessage
9
10load_dotenv()
11
12llm = ChatOpenAI()
13response = llm.invoke([HumanMessage(content=prompt)])
14generated_code = response.content

3. Create the sandbox

1daytonaClient = Daytona()
2sandbox = daytonaClient.create()

Learn more: Sandbox Management | Resource Configuration

4. Execute the code

1output = sandbox.process.code_run(generated_code, params=CodeRunParams(argv=["5"]))

For comprehensive patterns: Process and Code Execution

5. Use the sandbox filesystem

1sandbox.fs.upload_file(b"Hello, Daytona!", "example.txt")
2content = sandbox.fs.download_file("example.txt")
3files = sandbox.fs.list_files("/home/daytona")
4sandbox.fs.delete_file("example.txt")

6. Delete the sandbox

1sandbox.delete()

Sample console output:

1=== Generated code ===
2def solve(n): ...
3=== Execution result ===
4120
5=== Filesystem demo ===
6example.txt: Hello, Daytona!
7Deleted sandbox. Bye!

Extending the Pattern: TDD with AI

In the second example, we add a layer of quality control: tests.

1. Prompt for a matching PyTest test suite

1test_prompt = """
2Write a PyTest test suite that imports and tests the factorial function `solve(n: int)` from factorial.py.
3Cover:
4- Positive integers
5- Zero
6- Negative input (should raise ValueError)
7- Non-integer input (should raise TypeError)
8Return raw Python code only, do not wrap it in markdown code blocks or backticks.
9"""

2. Generate the test with LangChain + OpenAI

1test_response = llm.invoke([HumanMessage(content=test_prompt)])
2generated_tests = test_response.content

3. Upload both to the sandbox

1sandbox.fs.upload_file(generated_code.encode(), "factorial.py")
2sandbox.fs.upload_file(generated_tests.encode(), "test_factorial.py")

4. Install and run PyTest inside the sandbox

1sandbox.process.exec("pip install pytest")
2test_result = sandbox.process.exec("pytest test_factorial.py")
3if test_result.exit_code != 0:
4 print(f"Error running tests: {test_result.result}")
5else:
6 print(f"All tests passed successfully!\n{test_result.result}")

Sample console output:

1=========================== short test summary info ============================
2FAILED test_factorial.py::test_negative_input - RecursionError: maximum recur...
3FAILED test_factorial.py::test_non_integer_input - RecursionError: maximum re...
4========================= 2 failed, 2 passed in 0.30s ====================

5. Self-Healing Code Generation

When tests fail (like in the output above where 2 tests failed), we can automatically regenerate the code with feedback from the test results. By setting a maximum number of retry attempts, we create a self-healing loop where the AI learns from its mistakes and iteratively improves the code until all tests pass:

1max_attempts = 20
2attempt = 0
3
4while attempt < max_attempts:
5 attempt += 1
6 print(f"\n--- Attempt {attempt} ---")
7
8 sandbox.fs.upload_file(generated_code.encode(), "factorial.py")
9 test_result = sandbox.process.exec("pytest test_factorial.py")
10
11 if test_result.exit_code == 0:
12 print(f"All tests passed successfully!\n{test_result.result}")
13 break
14 else:
15 if "short test summary info" in test_result.result:
16 error_summary = test_result.result.split("short test summary info")[-1]
17 else:
18 error_summary = test_result.result
19 print(f"Tests failed:\n{error_summary}")
20
21 response = llm.invoke(
22 [
23 HumanMessage(
24 content=f"{prompt}\n\nPrevious attempt failed with:\n{error_summary}"
25 )
26 ]
27 )
28 generated_code = response.content
29else:
30 print(f"\nFailed to generate passing code after {max_attempts} attempts.")

For the previous failing case, this mechanism successfully generates passing code at the 4th attempt:

1--- Attempt 4 ---
2All tests passed successfully!
3============================= test session starts ==============================
4platform linux -- Python 3.13.3, pytest-8.3.5, pluggy-1.6.0
5rootdir: /home/daytona
6plugins: anyio-4.9.0, langsmith-0.4.2
7collected 4 items
8
9test_factorial.py .... [100%]
10
11============================== 4 passed in 0.01s ===============================

Why This Pattern Works

  • Safe by design: The sandbox ensures no AI-generated code can affect the real environment.

  • Testable: Adding TDD lets you validate AI output automatically.

  • Flexible: You can prompt for new functions, generate edge case tests, and reuse the pattern in pipelines.

Next Steps

  • Customize prompts for your domain (e.g. data parsing, calculations, config generation).

  • Explore Daytona sandbox options (CPU/memory limits, timeouts).

  • Extend to multi-file modules or async code.

Final Thoughts

At Devōt, we often build tools that involve dynamic code execution, whether for internal platforms or client-facing products. Daytona gave us a way to experiment safely, with clear boundaries and control.

To be honest, for teams like ours, working at the intersection of AI and engineering, that kind of isolation isn’t just helpful, it’s necessary.


Ready to build your own AI coding assistant? Check out the complete documentation.

Tags::
  • LLM
  • LangChain
  • OpenAI
  • Daytona
  • Sandbox
  • AI Agents
  • Python
  • Code Execution
  • Security
  • AI Safety
  • TDD