Case Study

How WolfBench Scaled Agentic Benchmarking Across Thousands of Runs With Daytona

32x

faster evaluations compared to sequential runs on a local machine

1000+

thousands of benchmark runs executed across isolated sandboxes

WolfBench is an open framework that scores how consistently model-agent combinations handle real-world terminal tasks. Backed by CoreWeave, it offers the AI community a dependable signal for production readiness.

Headquarters

San Francisco, CA

Industry

AI Evaluation AI Benchmarks & Research

Department

Research and Development

Key Features

Harbor Compatibility Sandbox Creation Speed Infrastructure Scale

wolfbench.ai

Learn how this agent consistency benchmarking framework partnered with Daytona to provision parallel sandboxes that executed thousands of complex runs during a critical growth phase.

Not only does Daytona's sandbox platform handle countless concurrent runs, but the team is also incredibly knowledgeable and responsive.

Wolfram Ravenwolf

AI Evangelist at CoreWeave

01 -- CHALLENGE

Continuous AI Evaluations Demanded Scalable Sandbox Infrastructure

Soon after launching WolfBench, Wolfram Ravenwolf, AI Evangelist at CoreWeave, expanded the framework’s coverage to a wide range of models, agents, and scenarios. Running evaluations at a meaningful volume during that growth phase required dedicated sandbox infrastructure.

WolfBench tests a total of 89 scenarios across frontier AI model-agent combinations. Each combination is tested five times to control for statistical variance. On a local machine, these tests run one at a time, stretching a single model evaluation to hundreds of hours. With models being updated consistently, this delay would render WolfBench’s results irrelevant by the time they were ready.

Speed and scale weren’t the only reasons sandboxes were a necessity. Agents wrote files and modified system state as they worked, so they needed to operate in a clean, isolated environment to ensure valid test results. Once performance data was captured, sandboxes had to be torn down to avoid idle compute that would inflate costs.

To preserve the integrity of each evaluation cycle, every sandbox also had to be compatible with Harbor, an open-source evaluation harness that Wolfram used to orchestrate benchmark runs. Any compatibility gaps would require engineering work to keep the testing pipeline connected.

While WolfBench's inference was powered by CoreWeave, sandbox infrastructure wasn't part of the platform's offerings at the time. Building that capability internally would’ve diverted months of engineering time. So Wolfram set out to find a cloud sandbox provider to power WolfBench’s evaluation workflow in a period of rapid scale.

That’s when he discovered Daytona. The combination of their agent-native sandbox infrastructure and hands-on assistance convinced him he’d found what he was looking for.

At a certain point, WolfBench grew beyond what local machines could support. With 89 tests per model-agent combination, we were looking at thousands of runs in total, and weeks of sequential work. We needed a sandbox provider to carry the infrastructure load through that scaling phase so we could focus on evaluations.

Wolfram Ravenwolf

AI Evangelist at CoreWeave

02 -- SOLUTION

Powering Concurrent Evaluations In a High-Growth Period With Daytona

Daytona’s native Harbor integration meant Wolfram could connect his entire evaluation pipeline with ease. After the Daytona team helped him choose the appropriate sandbox limits, he had the infrastructure layer WolfBench needed to power its evaluations at a stage of rapid expansion.

With Daytona, Wolfram provisioned sandboxes simultaneously, running all 89 WolfBench tests across proprietary and open-source model-agent combinations at scale. Each sandbox operated fully isolated, ensuring scores reflected actual agent performance. This setup helped Wolfram keep WolfBench results accurate and evaluation cycles tight.

Once performance data was captured, each sandbox was torn down immediately, clearing the way for the next run and minimizing idle compute. Beyond keeping costs predictable, this efficiency accelerated test runs and freed up resources that Wolfram could reinvest in more frequent evaluations.

Because Daytona managed the sandbox infrastructure for these WolfBench runs, Wolfram stayed focused on evolving test scenarios, analyzing performance patterns, and adding more models. This consistent iteration made WolfBench a stronger resource for developers and researchers choosing which AI agents to trust in production.

Beyond the Daytona platform, the partnership proved equally valuable to Wolfram during that period. Daytona's team stayed close, handling questions and scaling sandbox limits as WolfBench's needs grew, so evaluation runs didn’t stall. Wolfram also had direct access to Daytona’s leadership via Slack, which helped him make the most of the platform’s capabilities.

Integrating Daytona was seamless. Once we selected a sandbox limit, we started provisioning isolated environments on demand. The platform worked flawlessly.

Wolfram Ravenwolf

AI Evangelist at CoreWeave

03 -- RESULT

WolfBench Ran Evaluations 32x Faster While Partnering With Daytona

With Daytona, WolfBench gained scalable sandbox infrastructure to power thousands of concurrent model-agent evaluations during a period of rapid scale. As a result, Wolfram freed up focus for expanding the framework’s scope and turning evaluation results into performance insights.

32x faster evaluations compared to sequential runs on a local machine
Thousands of benchmark runs executed across isolated sandboxes

Wolfram remains open to partnering with Daytona again in the future and exploring how other platform capabilities, such as long‑running sandboxes, could support WolfBench’s evaluations.