Serve LLMs on GPU Sandboxes with SGLang

このコンテンツはまだ日本語訳がありません。

This guide demonstrates how to serve large language models on a Daytona GPU sandbox with SGLang, and query them from anywhere through a token-authenticated preview URL. The worked example serves gpt-oss-20b, OpenAI’s open-weights reasoning model, but the same script serves any model SGLang supports.

SGLang serves an OpenAI-compatible API, so existing clients work unchanged. Beyond plain chat, the guide highlights some of SGLang’s features:

Structured output: constrained decoding against a JSON schema, so replies are guaranteed to parse
Prefix caching you can see: RadixAttention reuses the KV cache of repeated prompt prefixes, and the server reports the hit count on every response
Batched workload: the final example classifies 273 passages from thirteen classic books by author, combining reasoning with structured output and sending them concurrently for SGLang to batch on the GPU, then checks the model’s accuracy against ground truth

The serving side is a single script, serve_sglang.py: it creates the sandbox, starts the server inside it, and prints the endpoint and its access token once the server is healthy. The model served is gpt-oss-20b, a mixture-of-experts model whose MXFP4-quantized weights fit in about 15 GB of VRAM, leaving the rest of the GPU free for serving capacity.

1. Workflow Overview

The script serve_sglang.py performs four steps to get a live endpoint:

Create: A GPU sandbox boots straight from the official lmsysorg/sglang image, which already carries the whole serving stack, so there is nothing to build or install
Serve: sglang.launch_server runs inside the sandbox as an async session command, loading the model while the script keeps control
Wait: The script polls /health_generate over the preview URL until the model clears a real forward pass, echoing the startup logs as they stream
Hand off: Once the server is healthy, it prints ready-to-paste export ENDPOINT=... and export TOKEN=... lines

Four clients then show the endpoint in action: query.sh (curl), query_openai.py (OpenAI SDK, including the reasoning, structured output, and cache examples below), query_litellm.py (LiteLLM), and classify_passages.py (the batch classification workload).

2. Setup

Clone the Daytona repository, navigate to the example directory, and install into a virtual environment:

git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/model-serving/sglang
python3 -m venv venv
source venv/bin/activate
pip install -e .

This installs the daytona SDK along with the openai and litellm clients used by the query examples.

Get your Daytona API key from the Daytona Dashboard and set it in a .env file:

cp .env.example .env
# edit .env with your API key

The .env.example also has an optional HF_TOKEN entry. It is not needed for gpt-oss, which is not gated; it only matters if you swap in a gated model, though Hugging Face recommends a token for faster, less throttled downloads in general.

3. Launching the Server

serve_sglang.py first creates a GPU sandbox in us-east-1, currently the region for GPU sandboxes, directly from the official SGLang image:

Python

import os
import sys
import time

import requests
from dotenv import load_dotenv

from daytona import (
    CreateSandboxFromImageParams,
    Daytona,
    DaytonaConfig,
    GpuType,
    Image,
    Resources,
    SessionExecuteRequest,
)

load_dotenv()

MODEL = "openai/gpt-oss-20b"
SERVED_AS = "gpt-oss-20b"
SGLANG_IMAGE = "lmsysorg/sglang:v0.5.12.post1-cu130"
PORT = 8000
TARGET = "us-east-1"  # current region for GPU sandboxes
SESSION = "sglang"  # name of the background session the server runs in
BOOT_TIMEOUT = 900  # max seconds to wait for the server to come up

daytona = Daytona(DaytonaConfig(target=TARGET))
env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}
sb = daytona.create(
    CreateSandboxFromImageParams(
        image=Image.base(SGLANG_IMAGE),
        resources=Resources(
            gpu=1,
            gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000],  # preference order
        ),
        auto_stop_interval=0,
        ephemeral=True,
        env_vars=env_vars,
    ),
    timeout=600,
)

The stock image ships the whole serving stack; the sandbox adds the GPU. gpu_type takes a single type or a priority list, gpu=1 is the current per-sandbox maximum, and auto_stop_interval=0 keeps the endpoint alive until you delete the sandbox.

The server then runs as a session command with run_async=True, so the script keeps control while the model loads:

Python

sb.process.create_session(SESSION)
cmd = sb.process.execute_session_command(
    SESSION,
    SessionExecuteRequest(
        command=(
            f"python3 -m sglang.launch_server --model-path {MODEL} "
            f"--served-model-name {SERVED_AS} "
            f"--port {PORT} "
            "--tool-call-parser gpt-oss --reasoning-parser gpt-oss "
            "--enable-cache-report"
        ),
        run_async=True,
    ),
)
cmd_id = cmd.cmd_id

What each flag is for:

--tool-call-parser and --reasoning-parser turn the model’s raw output markup into structured tool_calls and reasoning_content fields. Without them the server still runs, but tool calls arrive as unparsed text in content.
--enable-cache-report makes the server report prefix cache hits in each response’s usage stats, which the cache demo below relies on.

Finally the script waits for the server, polling /health_generate through the preview URL while streaming the startup logs to your terminal:

Python

pv = sb.get_preview_link(PORT)
hdr = {"x-daytona-preview-token": pv.token}

deadline = time.time() + BOOT_TIMEOUT
ready = False
printed = 0
while time.time() < deadline:
    # logs are a cumulative snapshot; print only the new tail
    out = sb.process.get_session_command_logs(SESSION, cmd_id).output or ""
    if len(out) > printed:
        sys.stdout.write(out[printed:])
        sys.stdout.flush()
        printed = len(out)
    # the server runs until killed; an exit code means it died
    exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code
    if exit_code is not None:
        print(f"!! sglang exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True)
        sys.exit(1)
    try:
        if requests.get(f"{pv.url}/health_generate", headers=hdr, timeout=10).status_code == 200:
            ready = True
            break
    except requests.RequestException:
        pass
    time.sleep(10)

/health_generate is a stricter readiness check than a plain liveness probe: it runs an actual forward pass, so a 200 means the model is loaded and generating, not merely that the port is open. The preview link is what exposes the server outside the sandbox: pv.url is reachable from anywhere, requests authenticate with the x-daytona-preview-token header, and the URL follows the structure described in the preview docs. If the server process dies during boot, the script notices the exit code immediately, saves the full log next to the script, and exits instead of waiting out the timeout.

Once healthy, the script prints the handoff and leaves the sandbox up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren’t lost.

ready - paste into your shell:
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}

sandbox left UP: {sandboxId}
  reconnect:  daytona.get('{sandboxId}')
  delete:     daytona.get('{sandboxId}').delete()

For capacity context: on an H100 sandbox this setup reports a KV cache pool of about 1.2 million tokens (max_total_num_tokens in the startup log) against the model’s 131k context window, which is what lets the classification example later in this guide hold an 825,000-token corpus at once.

4. Querying the Endpoint

Paste the printed export lines into your shell, then use any OpenAI-compatible client. The only Daytona-specific detail is the x-daytona-preview-token header; everything else is the standard OpenAI API surface, including stream=True for token streaming.

curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \
  -H "x-daytona-preview-token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "Write a haiku about a sandbox where AI agents run code."}],
    "max_tokens": 4096
  }'

from openai import OpenAI

client = OpenAI(
    base_url=f"{os.environ['ENDPOINT']}/v1",
    api_key="EMPTY",  # SGLang doesn't check it; auth is the preview-token header
    default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
)

resp = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[{"role": "user", "content": "Write a haiku about a sandbox that vanishes when the work is done."}],
    max_tokens=4096,
)
print(resp.choices[0].message.content)

import litellm

resp = litellm.completion(
    model="openai/gpt-oss-20b",  # generic OpenAI-compatible provider
    api_base=f"{os.environ['ENDPOINT']}/v1",
    api_key="EMPTY",
    extra_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
    messages=[{"role": "user", "content": "Write a haiku about calling a model that runs in the cloud."}],
    max_tokens=4096,
)
print(resp.choices[0].message.content)

The subsections below drive that same client through SGLang’s features, each a runnable example in query_openai.py: streaming, reasoning effort, structured output, tool calling, and prefix caching.

Streaming

Set stream=True and tokens arrive as deltas instead of one final message. gpt-oss streams its reasoning channel first and the answer second, in separate fields, so printing both reasoning_content and content follows the whole generation as it comes:

Python

stream = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[{"role": "user", "content": "Write ten haikus about tokens arriving one at a time."}],
    max_tokens=8192,
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta
    print(delta.reasoning_content or delta.content or "", end="", flush=True)

Reasoning on a Dial

The server was started with --reasoning-parser gpt-oss, which separates the model’s thinking from its answer. Thinking is on by default at medium effort; reasoning_effort adjusts it per request, and the parsed trace comes back in the message’s reasoning_content field:

Python

resp = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[{"role": "user", "content": "Write a haiku about thinking before speaking."}],
    reasoning_effort="high",
    max_tokens=8192,
)
print("reasoning:")
print(resp.choices[0].message.reasoning_content)
print("answer:")
print(resp.choices[0].message.content)

The dial matters in both directions: "high" buys more careful answers on hard problems but can spend several thousand reasoning tokens, which is why this request budgets 8192; "low" makes simple tasks faster, cheaper, and less variable, which is why the classification workload below runs at low effort.

Structured Output

Pass a JSON schema as response_format and SGLang constrains decoding to it: every token the model emits must keep the output a valid prefix of schema-conforming JSON, so the reply is guaranteed to parse, with no validate-and-retry loop around the call.

Python

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "lines": {"type": "array", "items": {"type": "string"}, "minItems": 3, "maxItems": 3},
        "season": {"type": "string"},
    },
    "required": ["title", "lines", "season"],
}
resp = client.chat.completions.create(
    model="gpt-oss-20b",
    messages=[
        {
            "role": "user",
            "content": "Compose a haiku about GPU sandboxes, as JSON with title, lines, and season.",
        }
    ],
    response_format={"type": "json_schema", "json_schema": {"name": "haiku", "schema": schema}},
    max_tokens=4096,
)
haiku = json.loads(resp.choices[0].message.content)  # guaranteed to parse

With gpt-oss the two features compose cleanly: the model reasons freely in its thinking channel, and the grammar constrains only the final answer. A response can carry a hundred tokens of deliberation in reasoning_content and still deliver schema-perfect JSON in content.

Schemas are not the only constraint SGLang supports. Its native /generate API accepts regular expressions and EBNF grammars in sampling_params, forcing the output to match:

curl

curl -sS "$ENDPOINT/generate" \
  -H "x-daytona-preview-token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "The best color for a terminal theme is",
    "sampling_params": {"max_new_tokens": 8, "regex": " (red|green|blue|amber)"}
  }'

The response text is one of the four colors, by construction.

Tool Calling

Because the server was started with the gpt-oss tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.

Python

def get_weather(city):
    rng = random.Random(city.lower())  # same city, same weather
    temp = rng.randint(-5, 35)
    sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"])
    return f"{temp}°C and {sky} in {city}"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]

messages = [{"role": "user", "content": "Write a haiku about the current weather in Lisbon."}]
resp = client.chat.completions.create(model="gpt-oss-20b", messages=messages, tools=tools, max_tokens=4096)
msg = resp.choices[0].message

if msg.tool_calls:
    messages.append(msg.model_dump(exclude_none=True))
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = get_weather(**args)
        messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
    resp = client.chat.completions.create(model="gpt-oss-20b", messages=messages, max_tokens=4096)
    print(resp.choices[0].message.content)

In this example get_weather runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox and execute the model’s tool calls there with sandbox.process.code_run(...), so model-written code runs isolated from your machine and from other sessions. The GPU sandbox where the model thinks, and CPU sandboxes where its decisions execute, both on Daytona.

Prefix Caching

RadixAttention, SGLang’s prefix cache, is on by default: when two requests share a prompt prefix, the second one reuses the first one’s KV cache instead of recomputing it. With --enable-cache-report, every response reports how many prompt tokens came from cache, so the speedup is measurable from the client:

Python

context = (
    "The Daytona platform provides isolated sandboxes for AI agents to safely execute code. " * 60
)
for attempt in (1, 2):
    t0 = time.perf_counter()
    resp = client.chat.completions.create(
        model="gpt-oss-20b",
        messages=[{"role": "user", "content": context + "Summarize the above in one sentence."}],
        max_tokens=32,
    )
    dt = time.perf_counter() - t0
    details = resp.usage.prompt_tokens_details  # omitted entirely on a cold cache
    cached = details.cached_tokens if details else 0
    print(f"attempt {attempt}: {dt:.2f}s, {cached}/{resp.usage.prompt_tokens} prompt tokens from cache")

A representative run against an H100 sandbox:

attempt 1: 0.90s, 0/976 prompt tokens from cache
attempt 2: 0.42s, 975/976 prompt tokens from cache

The rerun answered twice as fast because only one prompt token needed a forward pass. Any shared prefix qualifies, and the cache is a radix tree, so partial overlaps count too: a long system prompt, a few-shot preamble, a document being asked ten different questions, a multi-turn conversation growing one message at a time, each pays the prefill cost once and rides the cache afterwards.

5. Classifying the Classics

Everything so far was one request at a time, but the capacity numbers from boot (a 1.2 million token KV pool) are about concurrency. classify_passages.py puts them to work on a task with verifiable answers: it downloads thirteen classic books from Project Gutenberg (cached locally after the first run), slices them into 273 passages of roughly 3,000 tokens, and classifies every passage by author, all 273 sent at once for SGLang to batch on the GPU, roughly 825,000 tokens of prompt in one pass.

The response format is a schema whose enum is the list of candidate authors, so the model cannot answer anything but one of the thirteen names. Each prompt leads with the passage and ends with the question, so the 3,000-token passage becomes the cached prefix: a second question over the same passages reuses it instead of paying prefill again. Both passes run at reasoning_effort="low" to keep generation light on a high-volume run.

Python

client = AsyncOpenAI(
    base_url=f"{os.environ['ENDPOINT']}/v1",
    api_key="EMPTY",
    default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
)

AUTHORS = ["Austen", "Bronte", "Dickens", "Doyle", "Eliot", "Hawthorne",
           "Melville", "Poe", "Shelley", "Stoker", "Twain", "Wells", "Wilde"]
AUTHOR_QUESTION = f"Which of these authors wrote this passage: {', '.join(AUTHORS)}?"
AUTHOR_SCHEMA = {"type": "object", "required": ["author"],
                 "properties": {"author": {"type": "string", "enum": AUTHORS}}}

SETTING_QUESTION = "Is this scene set indoors or outdoors?"
SETTING_SCHEMA = {"type": "object", "required": ["setting"],
                  "properties": {"setting": {"type": "string", "enum": ["indoors", "outdoors"]}}}

# the passage leads and the question trails, so the passage is the cached prefix
async def classify(passage, question, schema):
    resp = await client.chat.completions.create(
        model="gpt-oss-20b",
        messages=[{"role": "user", "content": f"{passage}\n\n{question} Reply as JSON."}],
        response_format={"type": "json_schema", "json_schema": {"name": "answer", "schema": schema}},
        reasoning_effort="low",
        max_tokens=2048,
    )
    return json.loads(resp.choices[0].message.content)

# pass 1 prefills and caches the passages; pass 2 asks a new question and reuses them
authors = await asyncio.gather(*(classify(p, AUTHOR_QUESTION, AUTHOR_SCHEMA) for _, p in dataset))
settings = await asyncio.gather(*(classify(p, SETTING_QUESTION, SETTING_SCHEMA) for _, p in dataset))

A representative run against an H100 sandbox:

pass 1 - author: 273 passages in 22.6s
accuracy: 195/273 (71%)
per author: Austen 15/21, Bronte 13/21, Dickens 17/21, Doyle 20/21, Eliot 6/21,
            Hawthorne 15/21, Melville 20/21, Poe 13/21, Shelley 11/21,
            Stoker 17/21, Twain 16/21, Wells 18/21, Wilde 14/21
in:  825,388 tok (36,577 tok/s, 131.7M/hour)
out: 27,693 tok (1,227 tok/s)

pass 2 - setting: 273 passages in 4.8s (4.7x faster)
predominantly indoors:  Wilde 19/21, Austen 18/21, Bronte 18/21, Eliot 18/21,
                        Doyle 17/21, Stoker 15/21, Dickens 14/21, Poe 14/21
predominantly outdoors: Melville 19/21, Twain 15/21, Wells 15/21,
                        Shelley 14/21, Hawthorne 13/21
in:  817,198 tok (171,405 tok/s, 617.1M/hour including cache hits)
out: 13,341 tok (2,798 tok/s)
cached: 813,103/817,198 prompt tokens from cache

Three things worth reading out of those numbers:

Throughput: the endpoint ingested documents at roughly 37,000 tokens per second, around 130 million input tokens per hour from one GPU sandbox. Document workloads like this are prefill-bound, which is why output is only a trickle here; a generation-heavy workload is decode-bound instead, and its output token rate would be substantially higher.
Accuracy with a confusion pattern: 71 percent against ground truth over thirteen candidates (it varies a few points between runs, since sampling is on by default; pass temperature=0 for deterministic output). The errors are not random: Conan Doyle and Melville come back almost perfectly, while George Eliot’s Middlemarch is the hardest to place.
The second question reuses the cache: pass two asks something different about the same passages, whether each scene is set indoors or outdoors. Because the passages lead every prompt, RadixAttention still holds them, so 813,000 of 817,000 prompt tokens come from cache and the pass finishes in 4.8 seconds instead of 23, about 4.7 times faster. The calls sort the library the way you might expect: the drawing-room and detective novels (Austen, Doyle, Wilde, Eliot) come out indoors, while Moby Dick and Huck Finn come out outdoors.

6. Access and Authentication

Everything so far used the default preview auth: the token in the x-daytona-preview-token header, the best fit for code you control. The alternatives, in increasing order of openness:

Setup	Client needs	Good for
Preview token header (guide default)	base URL + custom header	your own code
Signed URL	URL only; expires on schedule	temporary sharing
Public preview + SGLang API key	base URL + `api_key`	pointing existing apps at your model
Public preview, no key	base URL only	quick demos

Both proxy alternatives are one step away. sb.create_signed_preview_url(PORT, expires_in_seconds=3600) returns a signed URL with a short-lived token baked in, so the client needs only the URL and it expires on schedule (the default is just 60 seconds, so pass expires_in_seconds explicitly). A public preview goes further and drops the proxy’s auth entirely: set public=True in the sandbox create params (the same CreateSandboxFromImageParams used to create the sandbox), and anyone with the URL can reach the server.

SGLang also has its own key check, independent of the proxy: add --api-key your-secret-key to the launch command and the server requires Authorization: Bearer your-secret-key, exactly what OpenAI-compatible clients send as their api_key. Pair that with a public preview and the endpoint takes the standard OpenAI shape, base URL plus key, usable by any tool that accepts only those two fields.

Code running inside the sandbox skips all of it and talks to http://localhost:8000 directly; the SGLang image ships the openai package, so sb.process.code_run snippets using the SDK work as-is. That colocated shape fits batch inference over data uploaded into the sandbox, or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.

7. Swapping Models

To serve a different model, change three things in serve_sglang.py:

MODEL: the Hugging Face model ID
SERVED_AS: the name clients will pass as model
The --tool-call-parser and --reasoning-parser flags, which are model-family specific (or set both to auto)

For gated models, set HF_TOKEN in your .env; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.

8. Scaling Up: gpt-oss-120b on One GPU

The 20b’s big sibling also fits on a single H100, which is the point of its MXFP4 quantization: 117B parameters in about 67 GB of weights. It needs three additions to the launch command, because the weights occupy 85 percent of VRAM and nothing fits by default:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b \
  --mem-fraction-static 0.93 --cuda-graph-max-bs 64 ...

The allocator setting prevents fragmentation failures while the MXFP4 weights are converted during loading; the raised memory fraction makes room for a KV cache at all (SGLang’s automatically chosen value computes a negative pool size for this model); the graph cap keeps CUDA graph capture inside what remains.

9. Going Further: One Endpoint, Many Sandboxes

A single GPU sandbox is one worker. SGLang’s companion Model Gateway is built to sit in front of many: among its load-balancing strategies is a cache-aware one that tracks each worker’s radix cache and routes prefix-sharing requests to the worker that already has them cached. Since every Daytona sandbox exposes its server through its own preview URL, a fleet of single-GPU sandboxes behind one gateway becomes a horizontally scaled endpoint, with workers joining and leaving as you create and delete sandboxes.

10. Configuration Options

Constants at the top of serve_sglang.py:

Parameter	Default	Description
`MODEL`	`openai/gpt-oss-20b`	Hugging Face model ID to serve
`SERVED_AS`	`gpt-oss-20b`	Model name exposed by the API
`SGLANG_IMAGE`	`lmsysorg/sglang:v0.5.12.post1-cu130`	SGLang Docker image
`PORT`	`8000`	Port the server listens on
`TARGET`	`us-east-1`	Current region for GPU sandboxes
`BOOT_TIMEOUT`	`900`	Seconds to wait for the server to become healthy

Key advantages of this approach:

No infrastructure to manage: one script turns the stock SGLang image into a live GPU endpoint, with no cluster to run, image to build, or drivers to install
Fast and ephemeral: the endpoint is live about five minutes after you run the script, and the sandbox is disposable, deleted when you are done and billed only while it runs
Reachable anywhere, OpenAI-compatible: the token-authenticated preview URL works from any machine, and the API is the standard OpenAI surface, so existing clients and SDKs work unchanged
The serving features come with it: schema-constrained JSON, measurable prefix caching, separated reasoning, and tool calling, all from the stock image and a handful of flags