# Serve LLMs on GPU Sandboxes with vLLM

This guide demonstrates how to serve an open-weights model on a Daytona [GPU sandbox](https://www.daytona.io/docs/en/sandboxes.md#gpu-sandboxes) with [vLLM](https://docs.vllm.ai/en/stable/) and query it from anywhere through a token-authenticated preview URL.

The serving side is a single script: it creates the sandbox, starts vLLM inside it, and prints the endpoint and its access token once the server is healthy. The endpoint is OpenAI-compatible, so existing clients work without modification; the guide shows examples querying it with curl, the OpenAI SDK, and LiteLLM. The model served is `google/gemma-4-26B-A4B-it`, but any model vLLM can serve works the same way.

---

### 1. Workflow Overview

Four steps take you from an API key to a live endpoint:

1. **Create**: Spin up a GPU sandbox from the stock `vllm/vllm-openai` image, no custom image build needed
2. **Serve**: Start `vllm serve` as a background session command inside the sandbox
3. **Wait**: Poll the server's `/health` endpoint through the preview URL while streaming the startup logs to your terminal
4. **Hand off**: Print paste-ready `export ENDPOINT=...` and `export TOKEN=...` lines

All four are handled by a single script, `serve_vllm.py`; once it finishes, the sandbox keeps serving until you delete it. Three small clients show the endpoint in action: `query.sh` (curl), `query_openai.py` (OpenAI SDK with chat, streaming, and tool calling), and `query_litellm.py` (LiteLLM).

### 2. Setup

#### Clone the Repository

Clone the [Daytona repository](https://github.com/daytonaio/daytona.git) and navigate to the example directory:

```bash
git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/model-serving/vllm
```

#### Create Virtual Environment

```bash
python3 -m venv venv
source venv/bin/activate
```

#### Install Dependencies

```bash
pip install -e .
```

This installs the `daytona` SDK along with the `openai` and `litellm` clients used by the query examples.

#### Configure Environment

Get your Daytona API key from the [Daytona Dashboard](https://app.daytona.io/dashboard/keys) and set it in a `.env` file:

```bash
cp .env.example .env
# edit .env with your API key
```

The `.env.example` also has an optional `HF_TOKEN` entry. It is only required for gated Hugging Face models, though Hugging Face recommends a token for faster, less throttled downloads in general.

### 3. Understanding the Code

Let's walk through `serve_vllm.py`, the script that creates the sandbox and starts the server.

#### Creating the GPU Sandbox

The script targets `us-east-1`, currently the region for GPU sandboxes, and creates one directly from the official vLLM image:

    ```python
    import os
    import sys
    import time

    import requests
    from dotenv import load_dotenv

    from daytona import (
        CreateSandboxFromImageParams,
        Daytona,
        DaytonaConfig,
        GpuType,
        Image,
        Resources,
        SessionExecuteRequest,
    )

    load_dotenv()

    MODEL = "google/gemma-4-26B-A4B-it"
    SERVED_AS = "gemma-4-moe"
    VLLM_IMAGE = "vllm/vllm-openai:v0.22.1"
    PORT = 8000
    TARGET = "us-east-1"  # current region for GPU sandboxes
    SESSION = "vllm"  # name of the background session the server runs in
    BOOT_TIMEOUT = 900  # max seconds to wait for the server to come up

    daytona = Daytona(DaytonaConfig(target=TARGET))
    env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}
    sb = daytona.create(
        CreateSandboxFromImageParams(
            image=Image.base(VLLM_IMAGE),
            resources=Resources(
                gpu=1,
                gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000],  # preference order
            ),
            auto_stop_interval=0,
            ephemeral=True,
            env_vars=env_vars,
        ),
        timeout=600,
    )
    ```

A few things worth noting:

- **Stock image**: `Image.base` pulls `vllm/vllm-openai` as-is. The whole serving stack ships in the image; the sandbox just adds the GPU.
- **One GPU per sandbox**: `gpu=1` is currently the per-sandbox maximum.
- **GPU preference**: `gpu_type` takes a single type or a priority list; the sandbox gets the first type with availability, here an H100 with an RTX PRO 6000 fallback.
- **No idle stop**: `auto_stop_interval=0` keeps the endpoint alive until you delete the sandbox.
- **HF_TOKEN passthrough**: the script forwards your token into the sandbox if you set one; it is required only for gated models, and the default model is not gated.

#### Starting the Server

The server runs as a session command with `run_async=True`, so the script keeps control while vLLM boots:

    ```python
    sb.process.create_session(SESSION)
    cmd = sb.process.execute_session_command(
        SESSION,
        SessionExecuteRequest(
            command=(
                f"vllm serve {MODEL} --port {PORT} "
                f"--served-model-name {SERVED_AS} "
                "--enable-auto-tool-choice --tool-call-parser gemma4 "
                "--reasoning-parser gemma4 "
                "--enable-prefix-caching"
            ),
            run_async=True,
        ),
    )
    cmd_id = cmd.cmd_id
    ```

The flags expose the model under the short name `gemma-4-moe` and enable tool calling and reasoning output parsing. The call returns immediately, and `cmd_id` is the handle for asking about the command later; the wait loop below uses it to fetch logs and check whether the server is still alive.

:::note[Changing the model]
The `--tool-call-parser` and `--reasoning-parser` names must match the model family and your vLLM version, or `vllm serve` won't start. Check the vLLM [tool calling](https://docs.vllm.ai/en/stable/features/tool_calling.html) and [reasoning outputs](https://docs.vllm.ai/en/stable/features/reasoning_outputs/) docs for the right parser names when swapping models.
:::

#### Waiting for the Server

Model download and loading take a few minutes. While waiting, the script streams the server logs to your terminal and polls `/health` through the preview URL, giving up after `BOOT_TIMEOUT` (900 seconds by default):

    ```python
    pv = sb.get_preview_link(PORT)
    hdr = {"x-daytona-preview-token": pv.token}

    deadline = time.time() + BOOT_TIMEOUT
    ready = False
    printed = 0
    while time.time() < deadline:
        # logs are a cumulative snapshot; print only the new tail
        out = sb.process.get_session_command_logs(SESSION, cmd_id).output or ""
        if len(out) > printed:
            sys.stdout.write(out[printed:])
            sys.stdout.flush()
            printed = len(out)
        # vllm serve runs until killed; an exit code means it died
        exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code
        if exit_code is not None:
            print(f"!! vllm exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True)
            sys.exit(1)
        try:
            if requests.get(f"{pv.url}/health", headers=hdr, timeout=10).status_code == 200:
                ready = True
                break
        except requests.RequestException:
            pass
        time.sleep(10)
    ```

The preview link is the piece that exposes the server outside the sandbox: `pv.url` is reachable from anywhere, and requests authenticate with the `x-daytona-preview-token` header. The URL follows the structure `https://{port}-{sandboxId}.{daytonaProxyDomain}`, as described in the [preview docs](https://www.daytona.io/docs/en/preview.md). The same URL and token the script uses for health checks are the ones your clients will use for inference.

If the server process dies during boot, the script notices the exit code immediately, saves the full server log next to the script, and exits instead of waiting out the timeout.

Once healthy, the script prints the handoff:

```
ready - paste into your shell:
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}

sandbox left UP: {sandboxId}
  reconnect:  daytona.get('{sandboxId}')
  delete:     daytona.get('{sandboxId}').delete()
```

The sandbox stays up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren't lost. Delete it when you're done.

### 4. Querying the Endpoint

Paste the printed `export` lines into your shell, then use any OpenAI-compatible client. Each example below ships as a ready-to-run file in the directory you cloned: `query.sh` for curl, `query_openai.py` for the OpenAI SDK, and `query_litellm.py` for LiteLLM.

    ```bash
    curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \
      -H "x-daytona-preview-token: $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gemma-4-moe",
        "messages": [{"role": "user", "content": "Write a haiku about sandboxes for AI agents."}],
        "max_tokens": 64
      }'
    ```
    ```python
    from openai import OpenAI

    client = OpenAI(
        base_url=f"{os.environ['ENDPOINT']}/v1",
        api_key="EMPTY",  # vLLM doesn't check it; auth is the preview-token header
        default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
    )

    resp = client.chat.completions.create(
        model="gemma-4-moe",
        messages=[{"role": "user", "content": "Write a haiku about ephemeral sandboxes."}],
        max_tokens=64,
    )
    print(resp.choices[0].message.content)
    ```
    ```python
    import litellm

    resp = litellm.completion(
        model="hosted_vllm/gemma-4-moe",  # OpenAI-compatible vLLM server
        api_base=f"{os.environ['ENDPOINT']}/v1",
        api_key="EMPTY",
        extra_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
        messages=[{"role": "user", "content": "Write a haiku about agents running code in the cloud."}],
        max_tokens=64,
    )
    print(resp.choices[0].message.content)
    ```

The only Daytona-specific detail in any of these is the `x-daytona-preview-token` header. Everything else is the standard OpenAI API surface. The next three examples continue with the OpenAI SDK, following `query_openai.py` through streaming, reasoning, and tool calling.

#### Streaming

After the plain chat call, `query_openai.py` shows an example that streams tokens as they arrive:

    ```python
    stream = client.chat.completions.create(
        model="gemma-4-moe",
        messages=[{"role": "user", "content": "Write ten haikus about tokens streaming from a sandbox."}],
        max_tokens=512,
        stream=True,
    )
    for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="", flush=True)
    print()
    ```

#### Reasoning

The server was started with `--reasoning-parser gemma4`, which separates the model's thinking from its answer. For the gemma-4 family there is a catch: reasoning tokens are never generated unless the request asks for them, which is why the other examples in this guide respond directly. Passing `reasoning_effort` turns thinking mode on, and the parsed trace comes back in the message's `reasoning` field:

    ```python
    resp = client.chat.completions.create(
        model="gemma-4-moe",
        messages=[{"role": "user", "content": "Write a haiku about GPU sandboxes."}],
        reasoning_effort="low",
        max_tokens=2048,
    )
    print("\nreasoning:")
    print(resp.choices[0].message.reasoning)
    print("answer:")
    print(resp.choices[0].message.content)
    ```

#### Tool Calling

The script finishes with tool calling. Because the server was started with `--enable-auto-tool-choice` and a tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.

    ```python
    def get_weather(city):
        rng = random.Random(city.lower())  # same city, same weather
        temp = rng.randint(-5, 35)
        sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"])
        return f"{temp}°C and {sky} in {city}"

    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a city.",
                "parameters": {
                    "type": "object",
                    "properties": {"city": {"type": "string"}},
                    "required": ["city"],
                },
            },
        }
    ]

    messages = [{"role": "user", "content": "Write a haiku about the current weather in Paris."}]
    resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, tools=tools, max_tokens=256)
    msg = resp.choices[0].message

    if msg.tool_calls:
        messages.append(msg.model_dump(exclude_none=True))
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = get_weather(**args)
            print(f"\ntool call: {call.function.name}({args})")
            print(f"result:    {result}")
            messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
        resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, max_tokens=256)
        print("final:")
        print(resp.choices[0].message.content)
    ```

### 5. Access and Authentication

Two independent layers decide who can reach the model: Daytona's preview proxy in front of the sandbox, and vLLM's own API key check inside it. The guide has used one mode of the proxy layer so far; here is the full picture.

#### The Daytona Layer: Preview Links

Every request so far carried the preview token as a header. That is the default mode and the best fit for code you control, since the secret stays out of URLs, logs, and browser history. Preview links support two more modes for when the header is a poor fit.

**Signed URLs** embed a short-lived token in the URL itself:

    ```python
    signed = sb.create_signed_preview_url(PORT, expires_in_seconds=3600)
    print(signed.url)  # no headers needed, expires after an hour
    ```

Anything that accepts only a base URL can now call the model: chat frontends, no-code tools, a colleague's notebook. And because the URL expires on schedule, sharing it is a bounded commitment rather than a permanent grant. Two details to know: the default expiry is only 60 seconds, so pass `expires_in_seconds` explicitly, and `sb.expire_signed_preview_url(PORT, signed.token)` revokes a URL early.

**Public previews** drop the proxy's authentication entirely. Create the sandbox with `public=True` in the create params, and the preview URL serves anyone who has it, for as long as the sandbox stays up.

#### The vLLM Layer: API Keys

Every query example sets `api_key="EMPTY"`. That is because vLLM, unless told otherwise, accepts any key; the field exists only to satisfy client constructors. Add `--api-key your-secret-key` to the `vllm serve` command (or set `VLLM_API_KEY` in the sandbox's `env_vars`) and the check becomes real: the server requires `Authorization: Bearer your-secret-key`, which is exactly what OpenAI-compatible clients send as their `api_key`.

The two layers do not guard the same surface, though. The vLLM key covers the inference routes (`/v1` and similar prefixes), while [other endpoints on the same server](https://docs.vllm.ai/en/stable/usage/security/) accept requests without it. The Daytona token gates everything on the port.

That makes **public preview plus vLLM API key** a combination for sharing with people you broadly trust: the endpoint behaves like a standard OpenAI-style API, configured with nothing but a base URL and an `api_key`, so it works in any tool that accepts only those two fields. For anything more exposed than that, the Daytona layer is the better fit.

| Setup | Client needs | Good for |
|-------|--------------|----------|
| Preview token header (guide default) | base URL + custom header | your own code |
| Signed URL | URL only; expires on schedule | temporary sharing |
| Public preview + vLLM API key | base URL + `api_key` | pointing existing apps at your model |
| Public preview, no key | base URL only | quick demos |

#### Inside the Sandbox

Everything above governs requests arriving from outside. Code running inside the sandbox can skip all of it and talk to `http://localhost:8000` directly. The vLLM image ships Python with the `openai` package preinstalled (it is a vLLM dependency), so the SDK works there as-is:

    ```python
    daytona = Daytona(DaytonaConfig(target="us-east-1"))
    sb = daytona.get("sandbox-id")  # printed by serve_vllm.py

    print(sb.process.code_run("""
    from openai import OpenAI

    client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
    resp = client.chat.completions.create(
        model="gemma-4-moe",
        messages=[{"role": "user", "content": "Write a haiku about code that never leaves its sandbox."}],
        max_tokens=64,
    )
    print(resp.choices[0].message.content)
    """).result)
    ```

No auth is needed because the traffic never leaves the sandbox. Anything the image doesn't ship, like `litellm`, install first with `sb.process.exec("pip install litellm")`.

This colocated shape fits workloads where the data should live next to the model: batch inference (upload a dataset into the sandbox, process it against the local endpoint, download the results) or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.

### 6. Swapping Models

To serve a different model, change three things in `serve_vllm.py`:

- `MODEL`: the Hugging Face model ID
- `SERVED_AS`: the name clients will pass as `model`
- The `--tool-call-parser` and `--reasoning-parser` flags, which are model-family specific

For gated models, set `HF_TOKEN` in your `.env`; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.

### 7. Going Further: Sandboxes as Tool Runtimes

In the tool calling example above, `get_weather` runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox, and execute the model's tool calls there. The GPU sandbox keeps serving every session, while each conversation gets an isolated runtime where model-written code can run, install packages, and touch files without affecting anyone else. When the session ends, delete its sandbox.

Only the tool function's body changes; instead of computing the result locally, it runs the model's request in the session's sandbox, with `sandbox.process.code_run(...)` for code or `sandbox.process.exec(...)` for shell commands, and returns the output. The sandbox can also carry whatever harness the tools need: interpreters, test runners, project dependencies. The tool-calling loop stays exactly the same. Both halves of the application run on Daytona: the GPU sandbox where the model thinks, and the CPU sandboxes where its decisions execute.

### 8. Configuration Options

Constants at the top of `serve_vllm.py`:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `MODEL` | `google/gemma-4-26B-A4B-it` | Hugging Face model ID to serve |
| `SERVED_AS` | `gemma-4-moe` | Model name exposed by the API |
| `VLLM_IMAGE` | `vllm/vllm-openai:v0.22.1` | vLLM Docker image |
| `PORT` | `8000` | Port the server listens on |
| `TARGET` | `us-east-1` | Current region for GPU sandboxes |
| `BOOT_TIMEOUT` | `900` | Seconds to wait for the server to become healthy |

:::tip[vLLM Tuning]
`vllm serve` is extensively configurable; the full list of options is in the [serve command reference](https://docs.vllm.ai/en/stable/cli/serve.html). Anything you add to the command in `serve_vllm.py` works as it would anywhere else, for example `--max-model-len` to trim the context window and free KV-cache memory, or `--gpu-memory-utilization` to adjust how much VRAM vLLM claims.
:::

---

**Key advantages of this approach:**

- **No infrastructure to manage**: one script turns a stock Docker image into a served model on a GPU
- **Fast**: a live endpoint about five minutes after you run the script, no provisioning, no driver setup
- **Reachable anywhere**: the preview URL works from any machine, secured by a token header
- **OpenAI-compatible**: existing clients, SDKs, and frameworks work unchanged