Case Study
How WolfBench Scaled Agentic Benchmarking Across Thousands of Runs With Daytona

32x
faster evaluations compared to sequential runs on a local machine
1000+
thousands of benchmark runs executed across isolated sandboxes
WolfBench is an open framework that scores how consistently model-agent combinations handle real-world terminal tasks. Backed by CoreWeave, it offers the AI community a dependable signal for production readiness.
Headquarters
San Francisco, CA
Industry
AI Evaluation AI Benchmarks & Research
Department
Research and Development
Key Features
Harbor Compatibility Sandbox Creation Speed Infrastructure Scale
Learn how this agent consistency benchmarking framework partnered with Daytona to provision parallel sandboxes that executed thousands of complex runs during a critical growth phase.
Not only does Daytona's sandbox platform handle countless concurrent runs, but the team is also incredibly knowledgeable and responsive.

Wolfram Ravenwolf
AI Evangelist at CoreWeave
01 -- CHALLENGE
Continuous AI Evaluations Demanded Scalable Sandbox Infrastructure
Soon after launching WolfBench, Wolfram Ravenwolf, AI Evangelist at CoreWeave, expanded the framework’s coverage to a wide range of models, agents, and scenarios. Running evaluations at a meaningful volume during that growth phase required dedicated sandbox infrastructure.
WolfBench tests a total of 89 scenarios across frontier AI model-agent combinations. Each combination is tested five times to control for statistical variance. On a local machine, these tests run one at a time, stretching a single model evaluation to hundreds of hours. With models being updated consistently, this delay would render WolfBench’s results irrelevant by the time they were ready.
Speed and scale weren’t the only reasons sandboxes were a necessity. Agents wrote files and modified system state as they worked, so they needed to operate in a clean, isolated environment to ensure valid test results. Once performance data was captured, sandboxes had to be torn down to avoid idle compute that would inflate costs.
To preserve the integrity of each evaluation cycle, every sandbox also had to be compatible with Harbor, an open-source evaluation harness that Wolfram used to orchestrate benchmark runs. Any compatibility gaps would require engineering work to keep the testing pipeline connected.
While WolfBench's inference was powered by CoreWeave, sandbox infrastructure wasn't part of the platform's offerings at the time. Building that capability internally would’ve diverted months of engineering time. So Wolfram set out to find a cloud sandbox provider to power WolfBench’s evaluation workflow in a period of rapid scale.
That’s when he discovered Daytona. The combination of their agent-native sandbox infrastructure and hands-on assistance convinced him he’d found what he was looking for.
At a certain point, WolfBench grew beyond what local machines could support. With 89 tests per model-agent combination, we were looking at thousands of runs in total, and weeks of sequential work. We needed a sandbox provider to carry the infrastructure load through that scaling phase so we could focus on evaluations.

Wolfram Ravenwolf
AI Evangelist at CoreWeave
02 -- SOLUTION
Powering Concurrent Evaluations In a High-Growth Period With Daytona
Daytona’s native Harbor integration meant Wolfram could connect his entire evaluation pipeline with ease. After the Daytona team helped him choose the appropriate sandbox limits, he had the infrastructure layer WolfBench needed to power its evaluations at a stage of rapid expansion.
With Daytona, Wolfram provisioned sandboxes simultaneously, running all 89 WolfBench tests across proprietary and open-source model-agent combinations at scale. Each sandbox operated fully isolated, ensuring scores reflected actual agent performance. This setup helped Wolfram keep WolfBench results accurate and evaluation cycles tight.
Once performance data was captured, each sandbox was torn down immediately, clearing the way for the next run and minimizing idle compute. Beyond keeping costs predictable, this efficiency accelerated test runs and freed up resources that Wolfram could reinvest in more frequent evaluations.
Because Daytona managed the sandbox infrastructure for these WolfBench runs, Wolfram stayed focused on evolving test scenarios, analyzing performance patterns, and adding more models. This consistent iteration made WolfBench a stronger resource for developers and researchers choosing which AI agents to trust in production.
Beyond the Daytona platform, the partnership proved equally valuable to Wolfram during that period. Daytona's team stayed close, handling questions and scaling sandbox limits as WolfBench's needs grew, so evaluation runs didn’t stall. Wolfram also had direct access to Daytona’s leadership via Slack, which helped him make the most of the platform’s capabilities.
Integrating Daytona was seamless. Once we selected a sandbox limit, we started provisioning isolated environments on demand. The platform worked flawlessly.

Wolfram Ravenwolf
AI Evangelist at CoreWeave
03 -- RESULT
WolfBench Ran Evaluations 32x Faster While Partnering With Daytona
With Daytona, WolfBench gained scalable sandbox infrastructure to power thousands of concurrent model-agent evaluations during a period of rapid scale. As a result, Wolfram freed up focus for expanding the framework’s scope and turning evaluation results into performance insights.
32x faster evaluations compared to sequential runs on a local machine
Thousands of benchmark runs executed across isolated sandboxes
Wolfram remains open to partnering with Daytona again in the future and exploring how other platform capabilities, such as long‑running sandboxes, could support WolfBench’s evaluations.
Daytona reached out early, gave me the resources to do real work, and kept that support going. Trust was built from day one.

Wolfram Ravenwolf
AI Evangelist at CoreWeave




