# Contents

AI research labs at universities run into infrastructure problems that have nothing to do with the research itself.

Shared compute clusters where one student's dependency update breaks another's experiment. Conda environments that drift silently over weeks. Hyperparameter sweeps that run sequentially because managing parallel environments on shared hardware is fragile.

These aren't edge cases. They're the daily reality in most academic ML labs, and they cost real time. Time spent debugging environments is time not spent on research.

Daytona provides isolated sandbox environments (full computers with dedicated kernels, filesystems, and network stacks) that spin up in under 90 milliseconds. Researchers interact with sandboxes through a Python SDK, so infrastructure management fits directly into existing experiment scripts. The Laude Institute, in collaboration with Stanford University, used Daytona to scale their Terminal Bench AI agent benchmark from 4–6 local Docker containers to 37,000 sandboxes in a single week.

This three-part series walks through three research workflows where Daytona solves concrete problems academic labs face today. Each part covers the problem, the solution, and working code.

Note

The examples in this series use the Python SDK. Daytona also provides SDKs for TypeScript, Go, Ruby, and Java, as well as a CLI and REST API.

Overview

  • Part 1: Reusable research environments — defining environments as code, creating named snapshots, and launching clean sandboxes from them.

  • Part 2: Parallel experiment runs — running hyperparameter sweeps and benchmark evaluations concurrently without shared-state contamination.

  • Part 3: Isolated environments for agent and RL research — safe execution of AI-generated code, network-restricted evaluation, and fast rollout environments.

TL;DR

This series shows how Daytona sandboxes can support academic AI research workflows across three practical areas: reusable research environments, parallel experiment runs, and isolated execution environments for agent and RL research.

You'll see how to use Daytona to:

  • Create named, reusable research environments with snapshots and declarative images

  • Launch clean sandboxes for experiments, evaluations, and benchmark tasks

  • Run parallel configurations without shared filesystem or process-state contamination

  • Execute AI-generated code safely inside isolated environments before trusting its output

  • Create fast, disposable rollout environments for agent and RL-style experimentation

  • Keep research workflows programmable through Daytona's SDK, filesystem, process, network, and sandbox APIs

Prerequisites

You'll need a Daytona account (sign up at app.daytona.io), an API key from the Dashboard, and Python 3.9+.

Install the SDK:

1pip install daytona

Set your credentials:

1export DAYTONA_API_KEY="your-api-key-here"
2export DAYTONA_TARGET="us"

Initialize the client:

1from daytona import Daytona
2daytona = Daytona()

Part 1: Reusable Research Environments

The problem: dependency drift

A paper gets submitted. Six months later, a reviewer requests an additional ablation. The researcher opens the project, runs pip install -r requirements.txt, and immediately hits version conflicts. Dependencies have released breaking updates since the code was last touched. What should be a thirty-minute experiment becomes a two-day environment debugging session.

A requirements.txt captures package names and versions, but not the complete system state. In practice, most researchers don't pin every transitive dependency, so builds aren't deterministic across time.

How Daytona helps: named snapshots

Daytona snapshots give research teams a way to name and reuse a known environment configuration. A snapshot is a sandbox template built from a Docker/OCI image or a declarative image definition. Daytona processes it and stores it in its internal registry. Once a snapshot exists, new sandboxes can be launched from it whenever the same environment is needed again.

The reproducibility guarantee comes from pinning your base image and dependencies, the same guarantee Docker tags provide. Daytona adds a workflow layer on top: a Python-native way to define environments, create named snapshots, launch sandboxes from them in under 90ms, execute code inside those sandboxes, and clean up afterward. The snapshot is the starting point; the SDK handles the lifecycle.

Creating a reusable snapshot

Instead of writing a Dockerfile, the environment can be defined declaratively in Python using the Image API:

1from daytona import Daytona, Image, CreateSnapshotParams
2
3daytona = Daytona()
4
5# Define environment with common research packages
6image = (
7 Image.debian_slim("3.12")
8 .pip_install(["numpy", "pandas", "pytest", "gymnasium"])
9 .workdir("/home/daytona")
10)
11
12# Create a named, reusable snapshot
13daytona.snapshot.create(
14 CreateSnapshotParams(
15 name="ai-research-env",
16 image=image,
17 ),
18 on_logs=lambda chunk: print(chunk, end=""),
19)

For projects that already track dependencies in requirements.txt:

1image = (
2 Image.debian_slim("3.12")
3 .pip_install_from_requirements("requirements.txt")
4 .workdir("/home/daytona")
5)

Or pyproject.toml:

1image = (
2 Image.debian_slim("3.12")
3 .pip_install_from_pyproject("pyproject.toml", optional_dependencies=["dev"])
4 .workdir("/home/daytona")
5)

Launching a sandbox from the snapshot

Once the snapshot exists, creating a sandbox from it is one call:

1from daytona import CreateSandboxFromSnapshotParams
2
3sandbox = daytona.create(
4 CreateSandboxFromSnapshotParams(
5 snapshot="ai-research-env",
6 language="python",
7 )
8)

A quick check can confirm the environment before running a larger workload:

1check = """
2import numpy as np
3import pandas as pd
4import gymnasium as gym
5
6print("numpy:", np.__version__)
7print("pandas:", pd.__version__)
8print("gymnasium:", gym.__version__)
9"""
10
11result = sandbox.process.code_run(check)
12print(result.result)

Using snapshots across a research workflow

A typical naming pattern in a lab might look like:

agent-benchmark-env

paper-eval-env

rl-rollout-env

teaching-lab-env

Once those exist as named snapshots, the workflow is: launch a sandbox, run the evaluation, collect the result, and clean up. Each run starts from a known, repeatable state.

1# Launch from a known environment
2sandbox = daytona.create(
3 CreateSandboxFromSnapshotParams(
4 snapshot="paper-eval-env",
5 language="python",
6 )
7)
8
9# Run an evaluation, passing lightweight config via environment variables
10result = sandbox.process.exec(
11 "python evaluate.py",
12 cwd="/home/daytona",
13 env={"SEED": "42", "SPLIT": "test"},
14 timeout=300,
15)
16
17print(result.result)
18sandbox.delete()

Why this fits research environments

Three properties make this workflow practical for academic labs.

First, the environment definition lives close to the project. Researchers who already maintain a requirements.txt or pyproject.toml can build a snapshot directly from those files. No parallel Dockerfile to maintain.

Second, named snapshots decouple environment definition from execution. A snapshot created for a paper submission can be relaunched months later by a different researcher on a different machine. The environment isn't a local artifact. It's a named, launchable starting point stored in Daytona's registry.

Third, fast launch times make per-experiment sandboxes practical. Spinning up a clean environment per run would be too slow on a traditional VM provisioning model. At under 90ms creation time, it becomes a lightweight operation inside a research script rather than a setup step that has to happen once and then persist.

The environment is a named, launchable starting point, not a long-lived machine that changes over time.

Part 2: Parallel Experiment Runs Without Shared-State Contamination

The problem: contamination and contention

Evaluating a model across multiple hyperparameter configurations on shared infrastructure introduces two risks. First, if experiments run in the same environment and any job modifies a shared file, writes to a common temp directory, or triggers a package side effect, it can silently affect other jobs. Second, resource contention between parallel jobs on the same node introduces performance variance that degrades result reliability.

Running experiments sequentially avoids contamination but is slow. Running them in parallel on shared filesystems is fast but fragile.

How Daytona helps: per-experiment isolation

Each experiment runs in its own Daytona sandbox with its own kernel and its own filesystem. No cross-contamination between experiments. Sandboxes spin up from snapshots quickly, so per-experiment overhead is minimal.

Running a parallel sweep

The pattern: create one sandbox per configuration, execute in parallel, collect results, clean up.

1import json
2from concurrent.futures import ThreadPoolExecutor, as_completed
3from daytona import Daytona, CreateSandboxFromSnapshotParams
4
5daytona = Daytona()
6
7configs = [
8 {"seed": 1, "temperature": 0.1, "task_id": "task-001"},
9 {"seed": 2, "temperature": 0.2, "task_id": "task-002"},
10 {"seed": 3, "temperature": 0.3, "task_id": "task-003"},
11 {"seed": 4, "temperature": 0.1, "task_id": "task-004"},
12]
13
14def run_experiment(config):
15 """Run one experiment in a fully isolated sandbox."""
16 sandbox = daytona.create(
17 CreateSandboxFromSnapshotParams(
18 snapshot="ai-research-env",
19 language="python",
20 )
21 )
22
23 try:
24 # This config is a multi-field object, so upload it as a JSON file
25 sandbox.fs.upload_file(
26 json.dumps(config).encode("utf-8"),
27 "/home/daytona/config.json",
28 )
29
30 # Run the experiment
31 result = sandbox.process.exec(
32 "python run_experiment.py --config /home/daytona/config.json",
33 cwd="/home/daytona",
34 timeout=300,
35 )
36
37 # Download the result
38 if result.exit_code == 0:
39 metrics = sandbox.fs.download_file("/home/daytona/results.json")
40 return {"config": config, "metrics": json.loads(metrics), "status": "success"}
41 else:
42 return {"config": config, "error": result.result, "status": "failed"}
43
44 finally:
45 sandbox.delete()
46
47# Execute all experiments in parallel
48with ThreadPoolExecutor(max_workers=4) as pool:
49 futures = {pool.submit(run_experiment, cfg): cfg for cfg in configs}
50
51 results = []
52 for future in as_completed(futures):
53 result = future.result()
54 results.append(result)
55 cfg = result["config"]
56 print(f"task={cfg['task_id']} seed={cfg['seed']}{result['status']}")
57
58successful = [r for r in results if r["status"] == "success"]
59print(f"\n{len(successful)}/{len(configs)} experiments completed")

Each experiment gets its own sandbox, starts from the same snapshot, and is torn down when finished. The ThreadPoolExecutor runs them concurrently. Adjust the number of parallel workers based on plan limits.

The example above uploads the full config as a JSON file. That's the right call for structured, multi-field objects; the experiment script can parse it cleanly. For simpler cases where the config is just a seed and a flag or two, Daytona's process.exec accepts an env dict directly, which avoids the file overhead entirely:

1result = sandbox.process.exec(
2 "python run_experiment.py",
3 cwd="/home/daytona",
4 env={"SEED": "42", "TEMPERATURE": "0.1", "TASK_ID": "task-001"},
5 timeout=300,
6)

The rule of thumb: scalar values go in env vars; structured config goes in a JSON file upload. Code files, test harnesses, and dataset artifacts always go through fs.upload_file.

Organizing runs with labels

When running many sandboxes, labels help with organization and debugging. Daytona lets you attach key-value labels to sandboxes and filter by them:

1sandbox = daytona.create(
2 CreateSandboxFromSnapshotParams(
3 snapshot="ai-research-env",
4 language="python",
5 )
6)
7
8sandbox.set_labels({
9 "project": "attention-ablation",
10 "task_id": "task-001",
11 "seed": "1",
12})

Those labels can be used to query sandboxes via the daytona.list method or the REST API. That helps when you need to see which runs are still active, filter a specific project's sandboxes, or inspect a failed run's filesystem before cleaning it up. daytona.list returns a paginated result, so iterate over its .items:

1# List all sandboxes from a specific project
2result = daytona.list(labels={"project": "attention-ablation"})
3for s in result.items:
4 print(f"{s.id}: {s.state}")

Labels are most valuable in longer-running workflows where sandboxes aren't immediately deleted after each run.

Why isolation matters for parallel sweeps

Parallel experiment runs are typically interpreted as if the only difference between runs is the configuration. That assumption breaks when multiple runs share a mutable environment.

Consider what can go wrong on shared infrastructure: a previous run leaves behind a cache file that a later run reads as output; two runs write to the same results directory; a failed run leaves partial state that affects the next one; a debugging change made between runs affects reproducibility. These failures are not always obvious. The run completes successfully while still depending on state that was never intended to be part of the experiment.

By giving each run its own sandbox, the boundary between runs is explicit. The configuration changes; the starting environment stays consistent. This matters especially for agent benchmarks and RL-style workflows, where the environment may be actively modified during the run — which is exactly what Part 3 covers.

Part 3: Isolated Environments for Agent and RL Research

The problem: the environment is part of the experiment

Agent and RL research workflows make the execution environment itself part of the experiment. Agents generate code, call tools, modify files, observe results, and retry. RL rollouts accumulate state across steps. If the environment is shared or persistent in the wrong way, results become harder to interpret.

Running AI-generated code in a local research environment creates unnecessary risk. The code may fail, write files, import unexpected packages, or make network calls. For agent research, this matters because the evaluation loop often depends on repeated execution: generate, run, observe, revise, run again.

Pattern 1: Safe code execution for agent evaluation

In an agent evaluation loop, the model generates code, the code runs, the model observes the result, and it revises based on what happened. The sandbox is where that loop physically takes place. Unlike a parameter sweep (Part 2), where each sandbox runs one command and is discarded, an agent evaluation often involves several attempts inside the *same* environment, because the agent needs to observe the effect of each attempt before making the next one.

Start by creating a sandbox with network access blocked:

1from daytona import Daytona, CreateSandboxFromSnapshotParams
2
3daytona = Daytona()
4
5sandbox = daytona.create(
6 CreateSandboxFromSnapshotParams(
7 snapshot="agent-eval-env",
8 language="python",
9 network_block_all=True,
10 )
11)

Here, blocking the network is a benchmark design decision, not just a safety measure. If the task is to write a correct function using only local dependencies, outbound access would let the agent fetch a solution or call an external API, which is not what the benchmark is meant to measure. Daytona's network_block_all makes that constraint part of the environment definition. (For tasks that legitimately need specific endpoints, network_allow_list permits a set of CIDR ranges instead of blocking everything.)

Upload the test harness once. It stays in place for the whole evaluation:

1tests = """
2from candidate import running_average
3
4def test_running_average():
5 assert running_average([2, 4, 6]) == [2.0, 3.0, 4.0]
6"""
7
8sandbox.fs.upload_file(tests.encode("utf-8"), "/home/daytona/test_candidate.py")

The candidate code is uploaded as a file rather than passed as an environment variable because it is an actual artifact the test harness imports. This is the practical difference from Part 2's config passing: there, a value parameterized a script; here, a file *is* the thing under test.

Now run the agent's first attempt:

1attempt_1 = """
2def running_average(xs):
3 return [sum(xs[:i]) / i for i in range(len(xs))]
4"""
5
6sandbox.fs.upload_file(attempt_1.encode("utf-8"), "/home/daytona/candidate.py")
7
8result = sandbox.process.exec(
9 "pytest -q test_candidate.py",
10 cwd="/home/daytona",
11 timeout=120,
12)
13print(result.exit_code, result.result) # non-zero: ZeroDivisionError at i = 0

The test output, including the traceback, becomes the agent's observation. This is why the sandbox stays alive rather than being torn down after one run: the agent needs to see what its code did and respond to it. The environment persists across the attempt rather than resetting.

The agent revises and submits a second attempt into the same sandbox:

1attempt_2 = """
2def running_average(xs):
3 return [sum(xs[:i + 1]) / (i + 1) for i in range(len(xs))]
4"""
5
6# Overwrite the candidate file in place; the test harness is still present
7sandbox.fs.upload_file(attempt_2.encode("utf-8"), "/home/daytona/candidate.py")
8
9result = sandbox.process.exec(
10 "pytest -q test_candidate.py",
11 cwd="/home/daytona",
12 timeout=120,
13)
14print(result.exit_code, result.result) # 0: passed

The candidate file is overwritten while the test harness and the rest of the environment stay exactly as they were. This mirrors how a real self-debugging agent operates: it doesn't get a fresh machine between attempts, it works inside one environment and observes cumulative results.

Only once the loop finishes, pass or fail, is the sandbox deleted:

1sandbox.delete()

This is a different lifecycle from Part 2. There, one sandbox maps to one run. Here, one sandbox maps to one task, which may span several attempts. The isolation guarantee is the same (generated code from one task can't touch another), but the sandbox lives for the duration of an interactive loop rather than a single command.

Pattern 2: Fast, clean environments for RL-style rollouts

RL and agentic-RL workflows extend the same idea. A rollout is itself an interaction loop: the policy acts, the environment responds, the policy acts again, and the episode accumulates state along the way. When that environment is more than an in-process simulator, when it involves files, a terminal, a coding task, or a tool the agent invokes, it benefits from the same per-episode isolation as Pattern 1. Each rollout gets a clean sandbox so that state from one episode can't leak into the next and corrupt the training signal.

Two properties make Daytona a good fit specifically for RL rollouts, beyond the general isolation that Part 2 already covered.

The first is startup speed. RL training generates an enormous number of rollouts, and each one is often short. If spinning up the environment takes longer than the rollout itself, environment overhead dominates and accelerators sit idle waiting for episodes to begin. Sandboxes that start in under 90ms keep that overhead small relative to the rollout, which matters far more here than in a parameter sweep, where a handful of long-running jobs amortize startup cost easily.

The second is clean per-episode state. A rollout that writes files, mutates a workspace, or drives a tool needs to start from a known baseline every time. Launching each episode from the same snapshot guarantees that baseline without a manual reset step, and deleting the sandbox afterward guarantees nothing carries over.

Mechanically, running many rollouts concurrently is the same pattern shown in Part 2: launch one sandbox per rollout from a shared snapshot inside a ThreadPoolExecutor, pass the seed and episode config as environment variables, run the rollout, collect the result, and delete the sandbox. The orchestration code is identical, so it isn't repeated here. What's worth carrying over from this part is *why* the per-episode sandbox matters for RL: short rollouts make startup time the bottleneck, and stateful environments make clean resets essential.

This is the regime where the pattern has been proven at scale. The Laude Institute, in collaboration with Stanford, ran their Terminal Bench agent benchmark on Daytona at tens of thousands of sandboxes in a week, with each task getting its own isolated, quickly-created environment. The same shape applies whether the rollout is a benchmark task or an RL training episode.

Why agent and RL workflows use sandboxes

Agent and RL workflows need controlled environments where actions can happen safely and results can be observed and repeated.

Each task or rollout gets a dedicated filesystem, a dedicated process namespace, optional network controls, programmable file upload and download, and clean creation from a reusable snapshot. State from one episode cannot affect the next. Generated code runs in isolation from the researcher's machine. Network access can be gated to match the benchmark's requirements.

The distinction from Part 2 is the lifecycle. A parameter sweep maps one sandbox to one run: create, execute once, collect, delete. An agent or RL workflow maps one sandbox to one task or episode, which may involve several actions and observations inside the same environment before it's cleaned up. In both cases the isolation boundary is the same; what differs is how long the environment lives and how much interaction happens inside it.

For agent and RL research, the execution environment is part of what the experiment measures, not just where the code runs.

Next steps

The examples across this series are intentionally compact, but each one extends into a larger research workflow.

Connect sandbox runs to experiment tracking

A Daytona sandbox handles execution while an experiment tracker stores the result. Depending on the lab's workflow, that might be MLflow, Weights & Biases, LangSmith, or a simple JSON artifact store. A useful record might combine sandbox metadata with run metrics:

1{
2 "snapshot": "ai-research-env",
3 "run_id": "seed-42-temperature-0.2",
4 "benchmark": "code-repair",
5 "task_id": "task-017",
6 "config": { "seed": 42, "temperature": 0.2 },
7 "metrics": { "passed": true, "score": 1.0 }
8}

The sandbox handles execution; the tracker keeps the research history.

Create snapshots for major research milestones

Snapshots are especially useful at meaningful project boundaries: a paper submission, benchmark release, class assignment, or artifact evaluation package. A lab that maintains named snapshots at these points has a named, launchable environment for every result in its history. Future runs (for a follow-up paper, a reviewer request, or a student replication) start from the same known state rather than a best-effort reconstruction.

Use network policy as part of the benchmark design

For agent workflows that execute generated code, whether a sandbox has outbound access can be part of the benchmark's definition. Some tasks may allow access to external APIs. Others may require local-only execution to test whether an agent can work within a constrained environment. Daytona's network controls make this a configurable parameter rather than an infrastructure assumption.

Expand rollout environments for agentic RL

The rollout sandbox can host much more than a simple script. A single sandbox can run a terminal task, coding task, browser task, local service, or file-based benchmark harness. For agentic RL research, the useful pattern stays the same regardless of complexity: one rollout, one environment, one sandbox, one result, created, used, observed, and cleaned up through the SDK.

Final thoughts

AI research infrastructure is often an afterthought: something assembled once from available tools and then worked around for the life of a project. Environments drift, experiments accumulate state, and reproducibility slips.

Daytona gives research teams a programmable layer for the execution environment itself. Snapshots make environments explicit and reusable. Sandboxes give each run a clean starting point. The filesystem and process APIs make research workflows scriptable rather than manual. Network controls add the safety margin that agent and RL workflows need.

The execution environment doesn't have to be a fixed machine or a persistent setup that everyone shares and hopes stays stable. You can define it in code, create it when needed, and clean it up when the task is done, like any other resource in a well-engineered research pipeline.

Getting started