コンテンツにスキップ

Serve LLMs on GPU Sandboxes with SGLang

View as Markdown

このコンテンツはまだ日本語訳がありません。

This guide demonstrates how to serve large language models on a Daytona GPU sandbox with SGLang, and query them from anywhere through a token-authenticated preview URL. The worked example serves gpt-oss-20b, OpenAI’s open-weights reasoning model, but the same script serves any model SGLang supports.

SGLang serves an OpenAI-compatible API, so existing clients work unchanged. Beyond plain chat, the guide highlights some of SGLang’s features:

  • Structured output: constrained decoding against a JSON schema, so replies are guaranteed to parse
  • Prefix caching you can see: RadixAttention reuses the KV cache of repeated prompt prefixes, and the server reports the hit count on every response
  • Batched workload: the final example classifies 273 passages from thirteen classic books by author, combining reasoning with structured output and sending them concurrently for SGLang to batch on the GPU, then checks the model’s accuracy against ground truth

The serving side is a single script, serve_sglang.py: it creates the sandbox, starts the server inside it, and prints the endpoint and its access token once the server is healthy. The model served is gpt-oss-20b, a mixture-of-experts model whose MXFP4-quantized weights fit in about 15 GB of VRAM, leaving the rest of the GPU free for serving capacity.


The script serve_sglang.py performs four steps to get a live endpoint:

  1. Create: A GPU sandbox boots straight from the official lmsysorg/sglang image, which already carries the whole serving stack, so there is nothing to build or install
  2. Serve: sglang.launch_server runs inside the sandbox as an async session command, loading the model while the script keeps control
  3. Wait: The script polls /health_generate over the preview URL until the model clears a real forward pass, echoing the startup logs as they stream
  4. Hand off: Once the server is healthy, it prints ready-to-paste export ENDPOINT=... and export TOKEN=... lines

Four clients then show the endpoint in action: query.sh (curl), query_openai.py (OpenAI SDK, including the reasoning, structured output, and cache examples below), query_litellm.py (LiteLLM), and classify_passages.py (the batch classification workload).

Clone the Daytona repository, navigate to the example directory, and install into a virtual environment:

Terminal window
git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/model-serving/sglang
python3 -m venv venv
source venv/bin/activate
pip install -e .

This installs the daytona SDK along with the openai and litellm clients used by the query examples.

Get your Daytona API key from the Daytona Dashboard and set it in a .env file:

Terminal window
cp .env.example .env
# edit .env with your API key

The .env.example also has an optional HF_TOKEN entry. It is not needed for gpt-oss, which is not gated; it only matters if you swap in a gated model, though Hugging Face recommends a token for faster, less throttled downloads in general.

serve_sglang.py first creates a GPU sandbox in us-east-1, currently the region for GPU sandboxes, directly from the official SGLang image:

import os
import sys
import time
import requests
from dotenv import load_dotenv
from daytona import (
CreateSandboxFromImageParams,
Daytona,
DaytonaConfig,
GpuType,
Image,
Resources,
SessionExecuteRequest,
)
load_dotenv()
MODEL = "openai/gpt-oss-20b"
SERVED_AS = "gpt-oss-20b"
SGLANG_IMAGE = "lmsysorg/sglang:v0.5.12.post1-cu130"
PORT = 8000
TARGET = "us-east-1" # current region for GPU sandboxes
SESSION = "sglang" # name of the background session the server runs in
BOOT_TIMEOUT = 900 # max seconds to wait for the server to come up
daytona = Daytona(DaytonaConfig(target=TARGET))
env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}
sb = daytona.create(
CreateSandboxFromImageParams(
image=Image.base(SGLANG_IMAGE),
resources=Resources(
gpu=1,
gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000], # preference order
),
auto_stop_interval=0,
ephemeral=True,
env_vars=env_vars,
),
timeout=600,
)

The stock image ships the whole serving stack; the sandbox adds the GPU. gpu_type takes a single type or a priority list, gpu=1 is the current per-sandbox maximum, and auto_stop_interval=0 keeps the endpoint alive until you delete the sandbox.

The server then runs as a session command with run_async=True, so the script keeps control while the model loads:

sb.process.create_session(SESSION)
cmd = sb.process.execute_session_command(
SESSION,
SessionExecuteRequest(
command=(
f"python3 -m sglang.launch_server --model-path {MODEL} "
f"--served-model-name {SERVED_AS} "
f"--port {PORT} "
"--tool-call-parser gpt-oss --reasoning-parser gpt-oss "
"--enable-cache-report"
),
run_async=True,
),
)
cmd_id = cmd.cmd_id

What each flag is for:

  • --tool-call-parser and --reasoning-parser turn the model’s raw output markup into structured tool_calls and reasoning_content fields. Without them the server still runs, but tool calls arrive as unparsed text in content.
  • --enable-cache-report makes the server report prefix cache hits in each response’s usage stats, which the cache demo below relies on.

Finally the script waits for the server, polling /health_generate through the preview URL while streaming the startup logs to your terminal:

pv = sb.get_preview_link(PORT)
hdr = {"x-daytona-preview-token": pv.token}
deadline = time.time() + BOOT_TIMEOUT
ready = False
printed = 0
while time.time() < deadline:
# logs are a cumulative snapshot; print only the new tail
out = sb.process.get_session_command_logs(SESSION, cmd_id).output or ""
if len(out) > printed:
sys.stdout.write(out[printed:])
sys.stdout.flush()
printed = len(out)
# the server runs until killed; an exit code means it died
exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code
if exit_code is not None:
print(f"!! sglang exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True)
sys.exit(1)
try:
if requests.get(f"{pv.url}/health_generate", headers=hdr, timeout=10).status_code == 200:
ready = True
break
except requests.RequestException:
pass
time.sleep(10)

/health_generate is a stricter readiness check than a plain liveness probe: it runs an actual forward pass, so a 200 means the model is loaded and generating, not merely that the port is open. The preview link is what exposes the server outside the sandbox: pv.url is reachable from anywhere, requests authenticate with the x-daytona-preview-token header, and the URL follows the structure described in the preview docs. If the server process dies during boot, the script notices the exit code immediately, saves the full log next to the script, and exits instead of waiting out the timeout.

Once healthy, the script prints the handoff and leaves the sandbox up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren’t lost.

ready - paste into your shell:
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}
sandbox left UP: {sandboxId}
reconnect: daytona.get('{sandboxId}')
delete: daytona.get('{sandboxId}').delete()

For capacity context: on an H100 sandbox this setup reports a KV cache pool of about 1.2 million tokens (max_total_num_tokens in the startup log) against the model’s 131k context window, which is what lets the classification example later in this guide hold an 825,000-token corpus at once.

Paste the printed export lines into your shell, then use any OpenAI-compatible client. The only Daytona-specific detail is the x-daytona-preview-token header; everything else is the standard OpenAI API surface, including stream=True for token streaming.

Terminal window
curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \
-H "x-daytona-preview-token: $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-20b",
"messages": [{"role": "user", "content": "Write a haiku about a sandbox where AI agents run code."}],
"max_tokens": 4096
}'

The subsections below drive that same client through SGLang’s features, each a runnable example in query_openai.py: streaming, reasoning effort, structured output, tool calling, and prefix caching.

Set stream=True and tokens arrive as deltas instead of one final message. gpt-oss streams its reasoning channel first and the answer second, in separate fields, so printing both reasoning_content and content follows the whole generation as it comes:

stream = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": "Write ten haikus about tokens arriving one at a time."}],
max_tokens=8192,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta
print(delta.reasoning_content or delta.content or "", end="", flush=True)

The server was started with --reasoning-parser gpt-oss, which separates the model’s thinking from its answer. Thinking is on by default at medium effort; reasoning_effort adjusts it per request, and the parsed trace comes back in the message’s reasoning_content field:

resp = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": "Write a haiku about thinking before speaking."}],
reasoning_effort="high",
max_tokens=8192,
)
print("reasoning:")
print(resp.choices[0].message.reasoning_content)
print("answer:")
print(resp.choices[0].message.content)

The dial matters in both directions: "high" buys more careful answers on hard problems but can spend several thousand reasoning tokens, which is why this request budgets 8192; "low" makes simple tasks faster, cheaper, and less variable, which is why the classification workload below runs at low effort.

Pass a JSON schema as response_format and SGLang constrains decoding to it: every token the model emits must keep the output a valid prefix of schema-conforming JSON, so the reply is guaranteed to parse, with no validate-and-retry loop around the call.

schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"lines": {"type": "array", "items": {"type": "string"}, "minItems": 3, "maxItems": 3},
"season": {"type": "string"},
},
"required": ["title", "lines", "season"],
}
resp = client.chat.completions.create(
model="gpt-oss-20b",
messages=[
{
"role": "user",
"content": "Compose a haiku about GPU sandboxes, as JSON with title, lines, and season.",
}
],
response_format={"type": "json_schema", "json_schema": {"name": "haiku", "schema": schema}},
max_tokens=4096,
)
haiku = json.loads(resp.choices[0].message.content) # guaranteed to parse

With gpt-oss the two features compose cleanly: the model reasons freely in its thinking channel, and the grammar constrains only the final answer. A response can carry a hundred tokens of deliberation in reasoning_content and still deliver schema-perfect JSON in content.

Schemas are not the only constraint SGLang supports. Its native /generate API accepts regular expressions and EBNF grammars in sampling_params, forcing the output to match:

Terminal window
curl -sS "$ENDPOINT/generate" \
-H "x-daytona-preview-token: $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "The best color for a terminal theme is",
"sampling_params": {"max_new_tokens": 8, "regex": " (red|green|blue|amber)"}
}'

The response text is one of the four colors, by construction.

Because the server was started with the gpt-oss tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.

def get_weather(city):
rng = random.Random(city.lower()) # same city, same weather
temp = rng.randint(-5, 35)
sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"])
return f"{temp}°C and {sky} in {city}"
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}
]
messages = [{"role": "user", "content": "Write a haiku about the current weather in Lisbon."}]
resp = client.chat.completions.create(model="gpt-oss-20b", messages=messages, tools=tools, max_tokens=4096)
msg = resp.choices[0].message
if msg.tool_calls:
messages.append(msg.model_dump(exclude_none=True))
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
result = get_weather(**args)
messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
resp = client.chat.completions.create(model="gpt-oss-20b", messages=messages, max_tokens=4096)
print(resp.choices[0].message.content)

In this example get_weather runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox and execute the model’s tool calls there with sandbox.process.code_run(...), so model-written code runs isolated from your machine and from other sessions. The GPU sandbox where the model thinks, and CPU sandboxes where its decisions execute, both on Daytona.

RadixAttention, SGLang’s prefix cache, is on by default: when two requests share a prompt prefix, the second one reuses the first one’s KV cache instead of recomputing it. With --enable-cache-report, every response reports how many prompt tokens came from cache, so the speedup is measurable from the client:

context = (
"The Daytona platform provides isolated sandboxes for AI agents to safely execute code. " * 60
)
for attempt in (1, 2):
t0 = time.perf_counter()
resp = client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": context + "Summarize the above in one sentence."}],
max_tokens=32,
)
dt = time.perf_counter() - t0
details = resp.usage.prompt_tokens_details # omitted entirely on a cold cache
cached = details.cached_tokens if details else 0
print(f"attempt {attempt}: {dt:.2f}s, {cached}/{resp.usage.prompt_tokens} prompt tokens from cache")

A representative run against an H100 sandbox:

attempt 1: 0.90s, 0/976 prompt tokens from cache
attempt 2: 0.42s, 975/976 prompt tokens from cache

The rerun answered twice as fast because only one prompt token needed a forward pass. Any shared prefix qualifies, and the cache is a radix tree, so partial overlaps count too: a long system prompt, a few-shot preamble, a document being asked ten different questions, a multi-turn conversation growing one message at a time, each pays the prefill cost once and rides the cache afterwards.

Everything so far was one request at a time, but the capacity numbers from boot (a 1.2 million token KV pool) are about concurrency. classify_passages.py puts them to work on a task with verifiable answers: it downloads thirteen classic books from Project Gutenberg (cached locally after the first run), slices them into 273 passages of roughly 3,000 tokens, and classifies every passage by author, all 273 sent at once for SGLang to batch on the GPU, roughly 825,000 tokens of prompt in one pass.

The response format is a schema whose enum is the list of candidate authors, so the model cannot answer anything but one of the thirteen names. Each prompt leads with the passage and ends with the question, so the 3,000-token passage becomes the cached prefix: a second question over the same passages reuses it instead of paying prefill again. Both passes run at reasoning_effort="low" to keep generation light on a high-volume run.

client = AsyncOpenAI(
base_url=f"{os.environ['ENDPOINT']}/v1",
api_key="EMPTY",
default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
)
AUTHORS = ["Austen", "Bronte", "Dickens", "Doyle", "Eliot", "Hawthorne",
"Melville", "Poe", "Shelley", "Stoker", "Twain", "Wells", "Wilde"]
AUTHOR_QUESTION = f"Which of these authors wrote this passage: {', '.join(AUTHORS)}?"
AUTHOR_SCHEMA = {"type": "object", "required": ["author"],
"properties": {"author": {"type": "string", "enum": AUTHORS}}}
SETTING_QUESTION = "Is this scene set indoors or outdoors?"
SETTING_SCHEMA = {"type": "object", "required": ["setting"],
"properties": {"setting": {"type": "string", "enum": ["indoors", "outdoors"]}}}
# the passage leads and the question trails, so the passage is the cached prefix
async def classify(passage, question, schema):
resp = await client.chat.completions.create(
model="gpt-oss-20b",
messages=[{"role": "user", "content": f"{passage}\n\n{question} Reply as JSON."}],
response_format={"type": "json_schema", "json_schema": {"name": "answer", "schema": schema}},
reasoning_effort="low",
max_tokens=2048,
)
return json.loads(resp.choices[0].message.content)
# pass 1 prefills and caches the passages; pass 2 asks a new question and reuses them
authors = await asyncio.gather(*(classify(p, AUTHOR_QUESTION, AUTHOR_SCHEMA) for _, p in dataset))
settings = await asyncio.gather(*(classify(p, SETTING_QUESTION, SETTING_SCHEMA) for _, p in dataset))

A representative run against an H100 sandbox:

pass 1 - author: 273 passages in 22.6s
accuracy: 195/273 (71%)
per author: Austen 15/21, Bronte 13/21, Dickens 17/21, Doyle 20/21, Eliot 6/21,
Hawthorne 15/21, Melville 20/21, Poe 13/21, Shelley 11/21,
Stoker 17/21, Twain 16/21, Wells 18/21, Wilde 14/21
in: 825,388 tok (36,577 tok/s, 131.7M/hour)
out: 27,693 tok (1,227 tok/s)
pass 2 - setting: 273 passages in 4.8s (4.7x faster)
predominantly indoors: Wilde 19/21, Austen 18/21, Bronte 18/21, Eliot 18/21,
Doyle 17/21, Stoker 15/21, Dickens 14/21, Poe 14/21
predominantly outdoors: Melville 19/21, Twain 15/21, Wells 15/21,
Shelley 14/21, Hawthorne 13/21
in: 817,198 tok (171,405 tok/s, 617.1M/hour including cache hits)
out: 13,341 tok (2,798 tok/s)
cached: 813,103/817,198 prompt tokens from cache

Three things worth reading out of those numbers:

  • Throughput: the endpoint ingested documents at roughly 37,000 tokens per second, around 130 million input tokens per hour from one GPU sandbox. Document workloads like this are prefill-bound, which is why output is only a trickle here; a generation-heavy workload is decode-bound instead, and its output token rate would be substantially higher.
  • Accuracy with a confusion pattern: 71 percent against ground truth over thirteen candidates (it varies a few points between runs, since sampling is on by default; pass temperature=0 for deterministic output). The errors are not random: Conan Doyle and Melville come back almost perfectly, while George Eliot’s Middlemarch is the hardest to place.
  • The second question reuses the cache: pass two asks something different about the same passages, whether each scene is set indoors or outdoors. Because the passages lead every prompt, RadixAttention still holds them, so 813,000 of 817,000 prompt tokens come from cache and the pass finishes in 4.8 seconds instead of 23, about 4.7 times faster. The calls sort the library the way you might expect: the drawing-room and detective novels (Austen, Doyle, Wilde, Eliot) come out indoors, while Moby Dick and Huck Finn come out outdoors.

Everything so far used the default preview auth: the token in the x-daytona-preview-token header, the best fit for code you control. The alternatives, in increasing order of openness:

SetupClient needsGood for
Preview token header (guide default)base URL + custom headeryour own code
Signed URLURL only; expires on scheduletemporary sharing
Public preview + SGLang API keybase URL + api_keypointing existing apps at your model
Public preview, no keybase URL onlyquick demos

Both proxy alternatives are one step away. sb.create_signed_preview_url(PORT, expires_in_seconds=3600) returns a signed URL with a short-lived token baked in, so the client needs only the URL and it expires on schedule (the default is just 60 seconds, so pass expires_in_seconds explicitly). A public preview goes further and drops the proxy’s auth entirely: set public=True in the sandbox create params (the same CreateSandboxFromImageParams used to create the sandbox), and anyone with the URL can reach the server.

SGLang also has its own key check, independent of the proxy: add --api-key your-secret-key to the launch command and the server requires Authorization: Bearer your-secret-key, exactly what OpenAI-compatible clients send as their api_key. Pair that with a public preview and the endpoint takes the standard OpenAI shape, base URL plus key, usable by any tool that accepts only those two fields.

Code running inside the sandbox skips all of it and talks to http://localhost:8000 directly; the SGLang image ships the openai package, so sb.process.code_run snippets using the SDK work as-is. That colocated shape fits batch inference over data uploaded into the sandbox, or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.

To serve a different model, change three things in serve_sglang.py:

  • MODEL: the Hugging Face model ID
  • SERVED_AS: the name clients will pass as model
  • The --tool-call-parser and --reasoning-parser flags, which are model-family specific (or set both to auto)

For gated models, set HF_TOKEN in your .env; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.

The 20b’s big sibling also fits on a single H100, which is the point of its MXFP4 quantization: 117B parameters in about 67 GB of weights. It needs three additions to the launch command, because the weights occupy 85 percent of VRAM and nothing fits by default:

Terminal window
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b \
--mem-fraction-static 0.93 --cuda-graph-max-bs 64 ...

The allocator setting prevents fragmentation failures while the MXFP4 weights are converted during loading; the raised memory fraction makes room for a KV cache at all (SGLang’s automatically chosen value computes a negative pool size for this model); the graph cap keeps CUDA graph capture inside what remains.

9. Going Further: One Endpoint, Many Sandboxes

Section titled “9. Going Further: One Endpoint, Many Sandboxes”

A single GPU sandbox is one worker. SGLang’s companion Model Gateway is built to sit in front of many: among its load-balancing strategies is a cache-aware one that tracks each worker’s radix cache and routes prefix-sharing requests to the worker that already has them cached. Since every Daytona sandbox exposes its server through its own preview URL, a fleet of single-GPU sandboxes behind one gateway becomes a horizontally scaled endpoint, with workers joining and leaving as you create and delete sandboxes.

Constants at the top of serve_sglang.py:

ParameterDefaultDescription
MODELopenai/gpt-oss-20bHugging Face model ID to serve
SERVED_ASgpt-oss-20bModel name exposed by the API
SGLANG_IMAGElmsysorg/sglang:v0.5.12.post1-cu130SGLang Docker image
PORT8000Port the server listens on
TARGETus-east-1Current region for GPU sandboxes
BOOT_TIMEOUT900Seconds to wait for the server to become healthy

Key advantages of this approach:

  • No infrastructure to manage: one script turns the stock SGLang image into a live GPU endpoint, with no cluster to run, image to build, or drivers to install
  • Fast and ephemeral: the endpoint is live about five minutes after you run the script, and the sandbox is disposable, deleted when you are done and billed only while it runs
  • Reachable anywhere, OpenAI-compatible: the token-authenticated preview URL works from any machine, and the API is the standard OpenAI surface, so existing clients and SDKs work unchanged
  • The serving features come with it: schema-constrained JSON, measurable prefix caching, separated reasoning, and tool calling, all from the stock image and a handful of flags