# Serve LLMs on GPU Sandboxes with SGLang

This guide demonstrates how to serve large language models on a Daytona [GPU sandbox](https://www.daytona.io/docs/en/sandboxes.md#gpu-sandboxes) with [SGLang](https://docs.sglang.ai/), and query them from anywhere through a token-authenticated preview URL. The worked example serves [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b), OpenAI's open-weights reasoning model, but the same script serves any model SGLang supports.

SGLang serves an OpenAI-compatible API, so existing clients work unchanged. Beyond plain chat, the guide highlights some of SGLang's features:

- **Structured output**: constrained decoding against a JSON schema, so replies are guaranteed to parse
- **Prefix caching you can see**: RadixAttention reuses the KV cache of repeated prompt prefixes, and the server reports the hit count on every response
- **Batched workload**: the final example classifies 273 passages from thirteen classic books by author, combining reasoning with structured output and sending them concurrently for SGLang to batch on the GPU, then checks the model's accuracy against ground truth

The serving side is a single script, `serve_sglang.py`: it creates the sandbox, starts the server inside it, and prints the endpoint and its access token once the server is healthy. The model served is gpt-oss-20b, a mixture-of-experts model whose MXFP4-quantized weights fit in about 15 GB of VRAM, leaving the rest of the GPU free for serving capacity.

---

### 1. Workflow Overview

The script `serve_sglang.py` performs four steps to get a live endpoint:

1. **Create**: A GPU sandbox boots straight from the official `lmsysorg/sglang` image, which already carries the whole serving stack, so there is nothing to build or install
2. **Serve**: `sglang.launch_server` runs inside the sandbox as an async session command, loading the model while the script keeps control
3. **Wait**: The script polls `/health_generate` over the preview URL until the model clears a real forward pass, echoing the startup logs as they stream
4. **Hand off**: Once the server is healthy, it prints ready-to-paste `export ENDPOINT=...` and `export TOKEN=...` lines

Four clients then show the endpoint in action: `query.sh` (curl), `query_openai.py` (OpenAI SDK, including the reasoning, structured output, and cache examples below), `query_litellm.py` (LiteLLM), and `classify_passages.py` (the batch classification workload).

### 2. Setup

Clone the [Daytona repository](https://github.com/daytonaio/daytona.git), navigate to the example directory, and install into a virtual environment:

```bash
git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/model-serving/sglang
python3 -m venv venv
source venv/bin/activate
pip install -e .
```

This installs the `daytona` SDK along with the `openai` and `litellm` clients used by the query examples.

Get your Daytona API key from the [Daytona Dashboard](https://app.daytona.io/dashboard/keys) and set it in a `.env` file:

```bash
cp .env.example .env
# edit .env with your API key
```

The `.env.example` also has an optional `HF_TOKEN` entry. It is not needed for gpt-oss, which is not gated; it only matters if you swap in a gated model, though Hugging Face recommends a token for faster, less throttled downloads in general.

### 3. Launching the Server

`serve_sglang.py` first creates a GPU sandbox in `us-east-1`, currently the region for GPU sandboxes, directly from the official SGLang image:

    ```python
    import os
    import sys
    import time

    import requests
    from dotenv import load_dotenv

    from daytona import (
        CreateSandboxFromImageParams,
        Daytona,
        DaytonaConfig,
        GpuType,
        Image,
        Resources,
        SessionExecuteRequest,
    )

    load_dotenv()

    MODEL = "openai/gpt-oss-20b"
    SERVED_AS = "gpt-oss-20b"
    SGLANG_IMAGE = "lmsysorg/sglang:v0.5.12.post1-cu130"
    PORT = 8000
    TARGET = "us-east-1"  # current region for GPU sandboxes
    SESSION = "sglang"  # name of the background session the server runs in
    BOOT_TIMEOUT = 900  # max seconds to wait for the server to come up

    daytona = Daytona(DaytonaConfig(target=TARGET))
    env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}
    sb = daytona.create(
        CreateSandboxFromImageParams(
            image=Image.base(SGLANG_IMAGE),
            resources=Resources(
                gpu=1,
                gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000],  # preference order
            ),
            auto_stop_interval=0,
            ephemeral=True,
            env_vars=env_vars,
        ),
        timeout=600,
    )
    ```

The stock image ships the whole serving stack; the sandbox adds the GPU. `gpu_type` takes a single type or a priority list, `gpu=1` is the current per-sandbox maximum, and `auto_stop_interval=0` keeps the endpoint alive until you delete the sandbox.

The server then runs as a session command with `run_async=True`, so the script keeps control while the model loads:

    ```python
    sb.process.create_session(SESSION)
    cmd = sb.process.execute_session_command(
        SESSION,
        SessionExecuteRequest(
            command=(
                f"python3 -m sglang.launch_server --model-path {MODEL} "
                f"--served-model-name {SERVED_AS} "
                f"--port {PORT} "
                "--tool-call-parser gpt-oss --reasoning-parser gpt-oss "
                "--enable-cache-report"
            ),
            run_async=True,
        ),
    )
    cmd_id = cmd.cmd_id
    ```

What each flag is for:

- `--tool-call-parser` and `--reasoning-parser` turn the model's raw output markup into structured `tool_calls` and `reasoning_content` fields. Without them the server still runs, but tool calls arrive as unparsed text in `content`.
- `--enable-cache-report` makes the server report prefix cache hits in each response's usage stats, which the cache demo below relies on.

:::note[Changing the model]
Parser names must match the model family and your SGLang version; both flags also accept `auto`, which detects the parser from the model's chat template. The full lists are in the SGLang [tool parser](https://docs.sglang.ai/advanced_features/tool_parser.html) and [reasoning parser](https://docs.sglang.ai/advanced_features/separate_reasoning.html) docs.
:::

Finally the script waits for the server, polling `/health_generate` through the preview URL while streaming the startup logs to your terminal:

    ```python
    pv = sb.get_preview_link(PORT)
    hdr = {"x-daytona-preview-token": pv.token}

    deadline = time.time() + BOOT_TIMEOUT
    ready = False
    printed = 0
    while time.time() < deadline:
        # logs are a cumulative snapshot; print only the new tail
        out = sb.process.get_session_command_logs(SESSION, cmd_id).output or ""
        if len(out) > printed:
            sys.stdout.write(out[printed:])
            sys.stdout.flush()
            printed = len(out)
        # the server runs until killed; an exit code means it died
        exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code
        if exit_code is not None:
            print(f"!! sglang exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True)
            sys.exit(1)
        try:
            if requests.get(f"{pv.url}/health_generate", headers=hdr, timeout=10).status_code == 200:
                ready = True
                break
        except requests.RequestException:
            pass
        time.sleep(10)
    ```

`/health_generate` is a stricter readiness check than a plain liveness probe: it runs an actual forward pass, so a 200 means the model is loaded and generating, not merely that the port is open. The preview link is what exposes the server outside the sandbox: `pv.url` is reachable from anywhere, requests authenticate with the `x-daytona-preview-token` header, and the URL follows the structure described in the [preview docs](https://www.daytona.io/docs/en/preview.md). If the server process dies during boot, the script notices the exit code immediately, saves the full log next to the script, and exits instead of waiting out the timeout.

Once healthy, the script prints the handoff and leaves the sandbox up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren't lost.

```
ready - paste into your shell:
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}

sandbox left UP: {sandboxId}
  reconnect:  daytona.get('{sandboxId}')
  delete:     daytona.get('{sandboxId}').delete()
```

For capacity context: on an H100 sandbox this setup reports a KV cache pool of about 1.2 million tokens (`max_total_num_tokens` in the startup log) against the model's 131k context window, which is what lets the classification example later in this guide hold an 825,000-token corpus at once.

### 4. Querying the Endpoint

Paste the printed `export` lines into your shell, then use any OpenAI-compatible client. The only Daytona-specific detail is the `x-daytona-preview-token` header; everything else is the standard OpenAI API surface, including `stream=True` for token streaming.

    ```bash
    curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \
      -H "x-daytona-preview-token: $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-oss-20b",
        "messages": [{"role": "user", "content": "Write a haiku about a sandbox where AI agents run code."}],
        "max_tokens": 4096
      }'
    ```
    ```python
    from openai import OpenAI

    client = OpenAI(
        base_url=f"{os.environ['ENDPOINT']}/v1",
        api_key="EMPTY",  # SGLang doesn't check it; auth is the preview-token header
        default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
    )

    resp = client.chat.completions.create(
        model="gpt-oss-20b",
        messages=[{"role": "user", "content": "Write a haiku about a sandbox that vanishes when the work is done."}],
        max_tokens=4096,
    )
    print(resp.choices[0].message.content)
    ```
    ```python
    import litellm

    resp = litellm.completion(
        model="openai/gpt-oss-20b",  # generic OpenAI-compatible provider
        api_base=f"{os.environ['ENDPOINT']}/v1",
        api_key="EMPTY",
        extra_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
        messages=[{"role": "user", "content": "Write a haiku about calling a model that runs in the cloud."}],
        max_tokens=4096,
    )
    print(resp.choices[0].message.content)
    ```

:::caution[Budget for thinking]
That `max_tokens: 4096` on a haiku request is not a typo. gpt-oss reasons before it answers, and `max_tokens` covers reasoning plus answer combined. If thinking exhausts the budget, the request returns `finish_reason: "length"` with the truncated trace in `reasoning_content` and `content: null`, which looks like the model returned nothing. The thinking length also varies widely between identical runs. Give requests generous budgets, or turn `reasoning_effort` down for simple tasks.
:::

The subsections below drive that same client through SGLang's features, each a runnable example in `query_openai.py`: streaming, reasoning effort, structured output, tool calling, and prefix caching.

#### Streaming

Set `stream=True` and tokens arrive as deltas instead of one final message. gpt-oss streams its reasoning channel first and the answer second, in separate fields, so printing both `reasoning_content` and `content` follows the whole generation as it comes:

    ```python
    stream = client.chat.completions.create(
        model="gpt-oss-20b",
        messages=[{"role": "user", "content": "Write ten haikus about tokens arriving one at a time."}],
        max_tokens=8192,
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta
        print(delta.reasoning_content or delta.content or "", end="", flush=True)
    ```

#### Reasoning on a Dial

The server was started with `--reasoning-parser gpt-oss`, which separates the model's thinking from its answer. Thinking is on by default at medium effort; `reasoning_effort` adjusts it per request, and the parsed trace comes back in the message's `reasoning_content` field:

    ```python
    resp = client.chat.completions.create(
        model="gpt-oss-20b",
        messages=[{"role": "user", "content": "Write a haiku about thinking before speaking."}],
        reasoning_effort="high",
        max_tokens=8192,
    )
    print("reasoning:")
    print(resp.choices[0].message.reasoning_content)
    print("answer:")
    print(resp.choices[0].message.content)
    ```

The dial matters in both directions: `"high"` buys more careful answers on hard problems but can spend several thousand reasoning tokens, which is why this request budgets 8192; `"low"` makes simple tasks faster, cheaper, and less variable, which is why the classification workload below runs at low effort.

#### Structured Output

Pass a JSON schema as `response_format` and SGLang constrains decoding to it: every token the model emits must keep the output a valid prefix of schema-conforming JSON, so the reply is guaranteed to parse, with no validate-and-retry loop around the call.

    ```python
    schema = {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "lines": {"type": "array", "items": {"type": "string"}, "minItems": 3, "maxItems": 3},
            "season": {"type": "string"},
        },
        "required": ["title", "lines", "season"],
    }
    resp = client.chat.completions.create(
        model="gpt-oss-20b",
        messages=[
            {
                "role": "user",
                "content": "Compose a haiku about GPU sandboxes, as JSON with title, lines, and season.",
            }
        ],
        response_format={"type": "json_schema", "json_schema": {"name": "haiku", "schema": schema}},
        max_tokens=4096,
    )
    haiku = json.loads(resp.choices[0].message.content)  # guaranteed to parse
    ```

With gpt-oss the two features compose cleanly: the model reasons freely in its thinking channel, and the grammar constrains only the final answer. A response can carry a hundred tokens of deliberation in `reasoning_content` and still deliver schema-perfect JSON in `content`.

:::caution[The grammar guarantees shape, not sense]
Describe the expected structure in the prompt too. The grammar can only mask out invalid tokens; it cannot make the model put the right values in the right fields. A prompt that agrees with the schema keeps the model and the grammar pulling in the same direction.
:::

Schemas are not the only constraint SGLang supports. Its native `/generate` API accepts regular expressions and EBNF grammars in `sampling_params`, forcing the output to match:

    ```bash
    curl -sS "$ENDPOINT/generate" \
      -H "x-daytona-preview-token: $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "text": "The best color for a terminal theme is",
        "sampling_params": {"max_new_tokens": 8, "regex": " (red|green|blue|amber)"}
      }'
    ```

The response text is one of the four colors, by construction.

#### Tool Calling

Because the server was started with the `gpt-oss` tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.

    ```python
    def get_weather(city):
        rng = random.Random(city.lower())  # same city, same weather
        temp = rng.randint(-5, 35)
        sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"])
        return f"{temp}°C and {sky} in {city}"

    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a city.",
                "parameters": {
                    "type": "object",
                    "properties": {"city": {"type": "string"}},
                    "required": ["city"],
                },
            },
        }
    ]

    messages = [{"role": "user", "content": "Write a haiku about the current weather in Lisbon."}]
    resp = client.chat.completions.create(model="gpt-oss-20b", messages=messages, tools=tools, max_tokens=4096)
    msg = resp.choices[0].message

    if msg.tool_calls:
        messages.append(msg.model_dump(exclude_none=True))
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = get_weather(**args)
            messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
        resp = client.chat.completions.create(model="gpt-oss-20b", messages=messages, max_tokens=4096)
        print(resp.choices[0].message.content)
    ```

In this example `get_weather` runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox and execute the model's tool calls there with `sandbox.process.code_run(...)`, so model-written code runs isolated from your machine and from other sessions. The GPU sandbox where the model thinks, and CPU sandboxes where its decisions execute, both on Daytona.

#### Prefix Caching

RadixAttention, SGLang's prefix cache, is on by default: when two requests share a prompt prefix, the second one reuses the first one's KV cache instead of recomputing it. With `--enable-cache-report`, every response reports how many prompt tokens came from cache, so the speedup is measurable from the client:

    ```python
    context = (
        "The Daytona platform provides isolated sandboxes for AI agents to safely execute code. " * 60
    )
    for attempt in (1, 2):
        t0 = time.perf_counter()
        resp = client.chat.completions.create(
            model="gpt-oss-20b",
            messages=[{"role": "user", "content": context + "Summarize the above in one sentence."}],
            max_tokens=32,
        )
        dt = time.perf_counter() - t0
        details = resp.usage.prompt_tokens_details  # omitted entirely on a cold cache
        cached = details.cached_tokens if details else 0
        print(f"attempt {attempt}: {dt:.2f}s, {cached}/{resp.usage.prompt_tokens} prompt tokens from cache")
    ```

A representative run against an H100 sandbox:

```
attempt 1: 0.90s, 0/976 prompt tokens from cache
attempt 2: 0.42s, 975/976 prompt tokens from cache
```

The rerun answered twice as fast because only one prompt token needed a forward pass. Any shared prefix qualifies, and the cache is a radix tree, so partial overlaps count too: a long system prompt, a few-shot preamble, a document being asked ten different questions, a multi-turn conversation growing one message at a time, each pays the prefill cost once and rides the cache afterwards.

:::tip[Re-measuring from cold]
`POST $ENDPOINT/flush_cache` resets the radix cache, useful when you want to demonstrate the cold-versus-warm difference again.
:::

### 5. Classifying the Classics

Everything so far was one request at a time, but the capacity numbers from boot (a 1.2 million token KV pool) are about concurrency. `classify_passages.py` puts them to work on a task with verifiable answers: it downloads thirteen classic books from Project Gutenberg (cached locally after the first run), slices them into 273 passages of roughly 3,000 tokens, and classifies every passage by author, all 273 sent at once for SGLang to batch on the GPU, roughly 825,000 tokens of prompt in one pass.

The response format is a schema whose `enum` is the list of candidate authors, so the model cannot answer anything but one of the thirteen names. Each prompt leads with the passage and ends with the question, so the 3,000-token passage becomes the cached prefix: a second question over the same passages reuses it instead of paying prefill again. Both passes run at `reasoning_effort="low"` to keep generation light on a high-volume run.

    ```python
    client = AsyncOpenAI(
        base_url=f"{os.environ['ENDPOINT']}/v1",
        api_key="EMPTY",
        default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
    )

    AUTHORS = ["Austen", "Bronte", "Dickens", "Doyle", "Eliot", "Hawthorne",
               "Melville", "Poe", "Shelley", "Stoker", "Twain", "Wells", "Wilde"]
    AUTHOR_QUESTION = f"Which of these authors wrote this passage: {', '.join(AUTHORS)}?"
    AUTHOR_SCHEMA = {"type": "object", "required": ["author"],
                     "properties": {"author": {"type": "string", "enum": AUTHORS}}}

    SETTING_QUESTION = "Is this scene set indoors or outdoors?"
    SETTING_SCHEMA = {"type": "object", "required": ["setting"],
                      "properties": {"setting": {"type": "string", "enum": ["indoors", "outdoors"]}}}

    # the passage leads and the question trails, so the passage is the cached prefix
    async def classify(passage, question, schema):
        resp = await client.chat.completions.create(
            model="gpt-oss-20b",
            messages=[{"role": "user", "content": f"{passage}\n\n{question} Reply as JSON."}],
            response_format={"type": "json_schema", "json_schema": {"name": "answer", "schema": schema}},
            reasoning_effort="low",
            max_tokens=2048,
        )
        return json.loads(resp.choices[0].message.content)

    # pass 1 prefills and caches the passages; pass 2 asks a new question and reuses them
    authors = await asyncio.gather(*(classify(p, AUTHOR_QUESTION, AUTHOR_SCHEMA) for _, p in dataset))
    settings = await asyncio.gather(*(classify(p, SETTING_QUESTION, SETTING_SCHEMA) for _, p in dataset))
    ```

A representative run against an H100 sandbox:

```
pass 1 - author: 273 passages in 22.6s
accuracy: 195/273 (71%)
per author: Austen 15/21, Bronte 13/21, Dickens 17/21, Doyle 20/21, Eliot 6/21,
            Hawthorne 15/21, Melville 20/21, Poe 13/21, Shelley 11/21,
            Stoker 17/21, Twain 16/21, Wells 18/21, Wilde 14/21
in:  825,388 tok (36,577 tok/s, 131.7M/hour)
out: 27,693 tok (1,227 tok/s)

pass 2 - setting: 273 passages in 4.8s (4.7x faster)
predominantly indoors:  Wilde 19/21, Austen 18/21, Bronte 18/21, Eliot 18/21,
                        Doyle 17/21, Stoker 15/21, Dickens 14/21, Poe 14/21
predominantly outdoors: Melville 19/21, Twain 15/21, Wells 15/21,
                        Shelley 14/21, Hawthorne 13/21
in:  817,198 tok (171,405 tok/s, 617.1M/hour including cache hits)
out: 13,341 tok (2,798 tok/s)
cached: 813,103/817,198 prompt tokens from cache
```

Three things worth reading out of those numbers:

- **Throughput**: the endpoint ingested documents at roughly 37,000 tokens per second, around 130 million input tokens per hour from one GPU sandbox. Document workloads like this are prefill-bound, which is why output is only a trickle here; a generation-heavy workload is decode-bound instead, and its output token rate would be substantially higher.
- **Accuracy with a confusion pattern**: 71 percent against ground truth over thirteen candidates (it varies a few points between runs, since sampling is on by default; pass `temperature=0` for deterministic output). The errors are not random: Conan Doyle and Melville come back almost perfectly, while George Eliot's Middlemarch is the hardest to place.
- **The second question reuses the cache**: pass two asks something different about the same passages, whether each scene is set indoors or outdoors. Because the passages lead every prompt, RadixAttention still holds them, so 813,000 of 817,000 prompt tokens come from cache and the pass finishes in 4.8 seconds instead of 23, about 4.7 times faster. The calls sort the library the way you might expect: the drawing-room and detective novels (Austen, Doyle, Wilde, Eliot) come out indoors, while Moby Dick and Huck Finn come out outdoors.

### 6. Access and Authentication

Everything so far used the default preview auth: the token in the `x-daytona-preview-token` header, the best fit for code you control. The alternatives, in increasing order of openness:

| Setup | Client needs | Good for |
|-------|--------------|----------|
| Preview token header (guide default) | base URL + custom header | your own code |
| Signed URL | URL only; expires on schedule | temporary sharing |
| Public preview + SGLang API key | base URL + `api_key` | pointing existing apps at your model |
| Public preview, no key | base URL only | quick demos |

Both proxy alternatives are one step away. `sb.create_signed_preview_url(PORT, expires_in_seconds=3600)` returns a signed URL with a short-lived token baked in, so the client needs only the URL and it expires on schedule (the default is just 60 seconds, so pass `expires_in_seconds` explicitly). A public preview goes further and drops the proxy's auth entirely: set `public=True` in the sandbox create params (the same `CreateSandboxFromImageParams` used to create the sandbox), and anyone with the URL can reach the server.

SGLang also has its own key check, independent of the proxy: add `--api-key your-secret-key` to the launch command and the server requires `Authorization: Bearer your-secret-key`, exactly what OpenAI-compatible clients send as their `api_key`. Pair that with a public preview and the endpoint takes the standard OpenAI shape, base URL plus key, usable by any tool that accepts only those two fields.

Code running inside the sandbox skips all of it and talks to `http://localhost:8000` directly; the SGLang image ships the `openai` package, so `sb.process.code_run` snippets using the SDK work as-is. That colocated shape fits batch inference over data uploaded into the sandbox, or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.

### 7. Swapping Models

To serve a different model, change three things in `serve_sglang.py`:

- `MODEL`: the Hugging Face model ID
- `SERVED_AS`: the name clients will pass as `model`
- The `--tool-call-parser` and `--reasoning-parser` flags, which are model-family specific (or set both to `auto`)

For gated models, set `HF_TOKEN` in your `.env`; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.

### 8. Scaling Up: gpt-oss-120b on One GPU

The 20b's big sibling also fits on a single H100, which is the point of its MXFP4 quantization: 117B parameters in about 67 GB of weights. It needs three additions to the launch command, because the weights occupy 85 percent of VRAM and nothing fits by default:

```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 -m sglang.launch_server --model-path openai/gpt-oss-120b \
  --mem-fraction-static 0.93 --cuda-graph-max-bs 64 ...
```

The allocator setting prevents fragmentation failures while the MXFP4 weights are converted during loading; the raised memory fraction makes room for a KV cache at all (SGLang's automatically chosen value computes a negative pool size for this model); the graph cap keeps CUDA graph capture inside what remains.

:::caution[A small KV pool]
The resulting KV pool is about 89,000 tokens, a fraction of the model's 131k context window and less than a tenth of the 20b's pool. One request can use at most that much, and the classification workload above would not fit a quarter of its batch. The 120b on one sandbox is a capable single-user reasoning endpoint, not quite a concurrent-serving one.
:::

### 9. Going Further: One Endpoint, Many Sandboxes

A single GPU sandbox is one worker. SGLang's companion [Model Gateway](https://github.com/sgl-project/sglang/tree/main/sgl-model-gateway) is built to sit in front of many: among its load-balancing strategies is a cache-aware one that tracks each worker's radix cache and routes prefix-sharing requests to the worker that already has them cached. Since every Daytona sandbox exposes its server through its own preview URL, a fleet of single-GPU sandboxes behind one gateway becomes a horizontally scaled endpoint, with workers joining and leaving as you create and delete sandboxes.

### 10. Configuration Options

Constants at the top of `serve_sglang.py`:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `MODEL` | `openai/gpt-oss-20b` | Hugging Face model ID to serve |
| `SERVED_AS` | `gpt-oss-20b` | Model name exposed by the API |
| `SGLANG_IMAGE` | `lmsysorg/sglang:v0.5.12.post1-cu130` | SGLang Docker image |
| `PORT` | `8000` | Port the server listens on |
| `TARGET` | `us-east-1` | Current region for GPU sandboxes |
| `BOOT_TIMEOUT` | `900` | Seconds to wait for the server to become healthy |

:::tip[SGLang Tuning]
`sglang.launch_server` is extensively configurable; the full list is in the [server arguments reference](https://docs.sglang.ai/advanced_features/server_arguments.html). For example, `--mem-fraction-static` adjusts how much VRAM the server claims, and `--context-length` trims the context window to free memory for the KV cache.
:::

---

**Key advantages of this approach:**

- **No infrastructure to manage**: one script turns the stock SGLang image into a live GPU endpoint, with no cluster to run, image to build, or drivers to install
- **Fast and ephemeral**: the endpoint is live about five minutes after you run the script, and the sandbox is disposable, deleted when you are done and billed only while it runs
- **Reachable anywhere, OpenAI-compatible**: the token-authenticated preview URL works from any machine, and the API is the standard OpenAI surface, so existing clients and SDKs work unchanged
- **The serving features come with it**: schema-constrained JSON, measurable prefix caching, separated reasoning, and tool calling, all from the stock image and a handful of flags