AUG 22 2025 // 3 min read

Running LLM-Generated Code Safely: LangChain + Daytona Demo

Juraj Sulimanovic

# Contents

Why Daytona? What This Demo Covers Prerequisites Core Workflow Extending the Pattern: TDD with AI Why This Pattern Works Next Steps Final Thoughts

If you're building AI tools or autonomous agents with large language models (LLMs), generating code is only half the job. At some point, that code needs to run - often automatically and at scale. But running LLM-generated code in production environments comes with serious risks around security, reliability, and control. That’s exactly the problem Daytona is built to solve.

In this post, I’ll walk through a minimal proof-of-concept that shows how to safely generate and run Python code using LangChain, OpenAI, and the Daytona SDK. The whole workflow happens inside a secure, isolated sandbox environment, a major step forward in making AI-assisted development safer and more reproducible.

Why Daytona?

LLMs are powerful but unpredictable. If you're using them to write code, you can't always guarantee what they'll return. That unpredictability becomes a risk when the code runs in production or on shared infrastructure.

Daytona solves this by providing isolated, programmatically controlled sandbox environments that can be easily created and destroyed. These sandboxes are stateful and can be long-running, making them ideal for agents with tasks that require maintaining state over time. For example, you can spin up a sandbox, write code into it, run the code, read results, and when you're done, tear everything down—without leaving traces or risking the host system.

What This Demo Covers

This proof-of-concept covers:

Creating a Daytona sandbox.
Using LangChain to generate Python code from a prompt.
Executing that code securely in the sandbox.
Performing basic file operations inside the sandbox (write, read, delete).
Cleaning up the sandbox afterward.

Prerequisites

To run the demo, you'll need:

Python 3.9+
A Daytona API key (from Daytona)
An OpenAI API key (get one here)

For complete setup instructions, see the Daytona Configuration Guide .

Set your keys with a .env file:

Code copied successfully!

1DAYTONA_API_KEY=your_daytona_key_here
2OPENAI_API_KEY=your_openai_key_here

Create a requirements.txt file:

Code copied successfully!

1daytona>=0.0.1
2langchain>=0.1.9
3langchain_openai>=0.1.0
4openai>=1.0.0
5python-dotenv>=1.0.0

Then set up your environment:

Code copied successfully!

1python -m venv .venv && source .venv/bin/activate
2pip install -r requirements.txt

Core Workflow

Here’s the core idea:

1. Generate a prompt describing the feature to implement

Code copied successfully!

1prompt = """
2Write a Python function called `solve(n: int)` that returns the factorial of `n`.
3Include a __main__ block that reads n from command line argument and prints the result.
4Return raw Python code only, do not wrap it in markdown code blocks or backticks.
5"""

2. Generate the code with LangChain + OpenAI

Code copied successfully!

1from dotenv import load_dotenv
2from langchain_openai import ChatOpenAI
3
4# For Python 3.9 and LangChain < 1.0:
5from langchain.schema import HumanMessage
6
7# For LangChain >= 1.0 (requires Python 3.10+):
8# from langchain.messages import HumanMessage
9
10load_dotenv()
11
12llm = ChatOpenAI()
13response = llm.invoke([HumanMessage(content=prompt)])
14generated_code = response.content

3. Create the sandbox

Code copied successfully!

1daytonaClient = Daytona()
2sandbox = daytonaClient.create()

Learn more: Sandbox Management | Resource Configuration

4. Execute the code

Code copied successfully!

1output = sandbox.process.code_run(generated_code, params=CodeRunParams(argv=["5"]))

For comprehensive patterns: Process and Code Execution

5. Use the sandbox filesystem

Code copied successfully!

1sandbox.fs.upload_file(b"Hello, Daytona!", "example.txt")
2content = sandbox.fs.download_file("example.txt")
3files = sandbox.fs.list_files("/home/daytona")
4sandbox.fs.delete_file("example.txt")

6. Delete the sandbox

Code copied successfully!

1sandbox.delete()

Sample console output:

Code copied successfully!

1=== Generated code ===
2def solve(n): ...
3=== Execution result ===
4120
5=== Filesystem demo ===
6example.txt: Hello, Daytona!
7Deleted sandbox. Bye!

Extending the Pattern: TDD with AI

In the second example, we add a layer of quality control: tests.

1. Prompt for a matching PyTest test suite

Code copied successfully!

1test_prompt = """
2Write a PyTest test suite that imports and tests the factorial function `solve(n: int)` from factorial.py.
3Cover:
4- Positive integers
5- Zero
6- Negative input (should raise ValueError)
7- Non-integer input (should raise TypeError)
8Return raw Python code only, do not wrap it in markdown code blocks or backticks.
9"""

2. Generate the test with LangChain + OpenAI

Code copied successfully!

1test_response = llm.invoke([HumanMessage(content=test_prompt)])
2generated_tests = test_response.content

3. Upload both to the sandbox

Code copied successfully!

1sandbox.fs.upload_file(generated_code.encode(), "factorial.py")
2sandbox.fs.upload_file(generated_tests.encode(), "test_factorial.py")

4. Install and run PyTest inside the sandbox

Code copied successfully!

1sandbox.process.exec("pip install pytest")
2test_result = sandbox.process.exec("pytest test_factorial.py")
3if test_result.exit_code != 0:
4    print(f"Error running tests: {test_result.result}")
5else:
6    print(f"All tests passed successfully!\n{test_result.result}")

Sample console output:

Code copied successfully!

1=========================== short test summary info ============================
2FAILED test_factorial.py::test_negative_input - RecursionError: maximum recur...
3FAILED test_factorial.py::test_non_integer_input - RecursionError: maximum re...
4========================= 2 failed, 2 passed in 0.30s ====================

5. Self-Healing Code Generation

When tests fail (like in the output above where 2 tests failed), we can automatically regenerate the code with feedback from the test results. By setting a maximum number of retry attempts, we create a self-healing loop where the AI learns from its mistakes and iteratively improves the code until all tests pass:

Code copied successfully!

1max_attempts = 20
2attempt = 0
3
4while attempt < max_attempts:
5    attempt += 1
6    print(f"\n--- Attempt {attempt} ---")
7
8    sandbox.fs.upload_file(generated_code.encode(), "factorial.py")
9    test_result = sandbox.process.exec("pytest test_factorial.py")
10
11    if test_result.exit_code == 0:
12        print(f"All tests passed successfully!\n{test_result.result}")
13        break
14    else:
15        if "short test summary info" in test_result.result:
16            error_summary = test_result.result.split("short test summary info")[-1]
17        else:
18            error_summary = test_result.result
19        print(f"Tests failed:\n{error_summary}")
20
21        response = llm.invoke(
22            [
23                HumanMessage(
24                    content=f"{prompt}\n\nPrevious attempt failed with:\n{error_summary}"
25                )
26            ]
27        )
28        generated_code = response.content
29else:
30    print(f"\nFailed to generate passing code after {max_attempts} attempts.")

For the previous failing case, this mechanism successfully generates passing code at the 4th attempt:

Code copied successfully!

1--- Attempt 4 ---
2All tests passed successfully!
3============================= test session starts ==============================
4platform linux -- Python 3.13.3, pytest-8.3.5, pluggy-1.6.0
5rootdir: /home/daytona
6plugins: anyio-4.9.0, langsmith-0.4.2
7collected 4 items
8
9test_factorial.py ....                                                   [100%]
10
11============================== 4 passed in 0.01s ===============================

Why This Pattern Works

Safe by design: The sandbox ensures no AI-generated code can affect the real environment.
Testable: Adding TDD lets you validate AI output automatically.
Flexible: You can prompt for new functions, generate edge case tests, and reuse the pattern in pipelines.

Next Steps

Customize prompts for your domain (e.g. data parsing, calculations, config generation).
Explore Daytona sandbox options (CPU/memory limits, timeouts).
Extend to multi-file modules or async code.

Final Thoughts

At Devōt, we often build tools that involve dynamic code execution, whether for internal platforms or client-facing products. Daytona gave us a way to experiment safely, with clear boundaries and control.

To be honest, for teams like ours, working at the intersection of AI and engineering, that kind of isolation isn’t just helpful, it’s necessary.

Ready to build your own AI coding assistant? Check out the complete documentation.

Tags::

LLM
LangChain
OpenAI
Daytona
Sandbox
AI Agents
Python
Code Execution
Security
AI Safety
TDD

About

Devōt is a software development company working on web and mobile applications, based in Zagreb, Croatia.

[ → Learn more ]

The author

Juraj Sulimanovic

Software Engineer at Devōt

Juraj is a passionate software engineer at Devōt, currently focused on Python projects and exploring how AI can enhance development workflows. With a background in QA and Ruby, he brings a thoughtful approach to writing clean, maintainable code and enjoys sharing knowledge at events like RubyZG.