Serve LLMs on GPU Sandboxes with vLLM

このコンテンツはまだ日本語訳がありません。

This guide demonstrates how to serve an open-weights model on a Daytona GPU sandbox with vLLM and query it from anywhere through a token-authenticated preview URL.

The serving side is a single script: it creates the sandbox, starts vLLM inside it, and prints the endpoint and its access token once the server is healthy. The endpoint is OpenAI-compatible, so existing clients work without modification; the guide shows examples querying it with curl, the OpenAI SDK, and LiteLLM. The model served is google/gemma-4-26B-A4B-it, but any model vLLM can serve works the same way.

1. Workflow Overview

Four steps take you from an API key to a live endpoint:

Create: Spin up a GPU sandbox from the stock vllm/vllm-openai image, no custom image build needed
Serve: Start vllm serve as a background session command inside the sandbox
Wait: Poll the server’s /health endpoint through the preview URL while streaming the startup logs to your terminal
Hand off: Print paste-ready export ENDPOINT=... and export TOKEN=... lines

All four are handled by a single script, serve_vllm.py; once it finishes, the sandbox keeps serving until you delete it. Three small clients show the endpoint in action: query.sh (curl), query_openai.py (OpenAI SDK with chat, streaming, and tool calling), and query_litellm.py (LiteLLM).

2. Setup

Clone the Repository

Clone the Daytona repository and navigate to the example directory:

git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/model-serving/vllm

Create Virtual Environment

python3 -m venv venv
source venv/bin/activate

Install Dependencies

pip install -e .

This installs the daytona SDK along with the openai and litellm clients used by the query examples.

Configure Environment

Get your Daytona API key from the Daytona Dashboard and set it in a .env file:

cp .env.example .env
# edit .env with your API key

The .env.example also has an optional HF_TOKEN entry. It is only required for gated Hugging Face models, though Hugging Face recommends a token for faster, less throttled downloads in general.

3. Understanding the Code

Let’s walk through serve_vllm.py, the script that creates the sandbox and starts the server.

Creating the GPU Sandbox

The script targets us-east-1, currently the region for GPU sandboxes, and creates one directly from the official vLLM image:

Python

import os
import sys
import time

import requests
from dotenv import load_dotenv

from daytona import (
    CreateSandboxFromImageParams,
    Daytona,
    DaytonaConfig,
    GpuType,
    Image,
    Resources,
    SessionExecuteRequest,
)

load_dotenv()

MODEL = "google/gemma-4-26B-A4B-it"
SERVED_AS = "gemma-4-moe"
VLLM_IMAGE = "vllm/vllm-openai:v0.22.1"
PORT = 8000
TARGET = "us-east-1"  # current region for GPU sandboxes
SESSION = "vllm"  # name of the background session the server runs in
BOOT_TIMEOUT = 900  # max seconds to wait for the server to come up

daytona = Daytona(DaytonaConfig(target=TARGET))
env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}
sb = daytona.create(
    CreateSandboxFromImageParams(
        image=Image.base(VLLM_IMAGE),
        resources=Resources(
            gpu=1,
            gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000],  # preference order
        ),
        auto_stop_interval=0,
        ephemeral=True,
        env_vars=env_vars,
    ),
    timeout=600,
)

A few things worth noting:

Stock image: Image.base pulls vllm/vllm-openai as-is. The whole serving stack ships in the image; the sandbox just adds the GPU.
One GPU per sandbox: gpu=1 is currently the per-sandbox maximum.
GPU preference: gpu_type takes a single type or a priority list; the sandbox gets the first type with availability, here an H100 with an RTX PRO 6000 fallback.
No idle stop: auto_stop_interval=0 keeps the endpoint alive until you delete the sandbox.
HF_TOKEN passthrough: the script forwards your token into the sandbox if you set one; it is required only for gated models, and the default model is not gated.

Starting the Server

The server runs as a session command with run_async=True, so the script keeps control while vLLM boots:

Python

sb.process.create_session(SESSION)
cmd = sb.process.execute_session_command(
    SESSION,
    SessionExecuteRequest(
        command=(
            f"vllm serve {MODEL} --port {PORT} "
            f"--served-model-name {SERVED_AS} "
            "--enable-auto-tool-choice --tool-call-parser gemma4 "
            "--reasoning-parser gemma4 "
            "--enable-prefix-caching"
        ),
        run_async=True,
    ),
)
cmd_id = cmd.cmd_id

The flags expose the model under the short name gemma-4-moe and enable tool calling and reasoning output parsing. The call returns immediately, and cmd_id is the handle for asking about the command later; the wait loop below uses it to fetch logs and check whether the server is still alive.

Waiting for the Server

Model download and loading take a few minutes. While waiting, the script streams the server logs to your terminal and polls /health through the preview URL, giving up after BOOT_TIMEOUT (900 seconds by default):

Python

pv = sb.get_preview_link(PORT)
hdr = {"x-daytona-preview-token": pv.token}

deadline = time.time() + BOOT_TIMEOUT
ready = False
printed = 0
while time.time() < deadline:
    # logs are a cumulative snapshot; print only the new tail
    out = sb.process.get_session_command_logs(SESSION, cmd_id).output or ""
    if len(out) > printed:
        sys.stdout.write(out[printed:])
        sys.stdout.flush()
        printed = len(out)
    # vllm serve runs until killed; an exit code means it died
    exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code
    if exit_code is not None:
        print(f"!! vllm exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True)
        sys.exit(1)
    try:
        if requests.get(f"{pv.url}/health", headers=hdr, timeout=10).status_code == 200:
            ready = True
            break
    except requests.RequestException:
        pass
    time.sleep(10)

The preview link is the piece that exposes the server outside the sandbox: pv.url is reachable from anywhere, and requests authenticate with the x-daytona-preview-token header. The URL follows the structure https://{port}-{sandboxId}.{daytonaProxyDomain}, as described in the preview docs. The same URL and token the script uses for health checks are the ones your clients will use for inference.

If the server process dies during boot, the script notices the exit code immediately, saves the full server log next to the script, and exits instead of waiting out the timeout.

Once healthy, the script prints the handoff:

ready - paste into your shell:
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}

sandbox left UP: {sandboxId}
  reconnect:  daytona.get('{sandboxId}')
  delete:     daytona.get('{sandboxId}').delete()

The sandbox stays up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren’t lost. Delete it when you’re done.

4. Querying the Endpoint

Paste the printed export lines into your shell, then use any OpenAI-compatible client. Each example below ships as a ready-to-run file in the directory you cloned: query.sh for curl, query_openai.py for the OpenAI SDK, and query_litellm.py for LiteLLM.

curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \
  -H "x-daytona-preview-token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-moe",
    "messages": [{"role": "user", "content": "Write a haiku about sandboxes for AI agents."}],
    "max_tokens": 64
  }'

from openai import OpenAI

client = OpenAI(
    base_url=f"{os.environ['ENDPOINT']}/v1",
    api_key="EMPTY",  # vLLM doesn't check it; auth is the preview-token header
    default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
)

resp = client.chat.completions.create(
    model="gemma-4-moe",
    messages=[{"role": "user", "content": "Write a haiku about ephemeral sandboxes."}],
    max_tokens=64,
)
print(resp.choices[0].message.content)

import litellm

resp = litellm.completion(
    model="hosted_vllm/gemma-4-moe",  # OpenAI-compatible vLLM server
    api_base=f"{os.environ['ENDPOINT']}/v1",
    api_key="EMPTY",
    extra_headers={"x-daytona-preview-token": os.environ["TOKEN"]},
    messages=[{"role": "user", "content": "Write a haiku about agents running code in the cloud."}],
    max_tokens=64,
)
print(resp.choices[0].message.content)

The only Daytona-specific detail in any of these is the x-daytona-preview-token header. Everything else is the standard OpenAI API surface. The next three examples continue with the OpenAI SDK, following query_openai.py through streaming, reasoning, and tool calling.

Streaming

After the plain chat call, query_openai.py shows an example that streams tokens as they arrive:

Python

stream = client.chat.completions.create(
    model="gemma-4-moe",
    messages=[{"role": "user", "content": "Write ten haikus about tokens streaming from a sandbox."}],
    max_tokens=512,
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

Reasoning

The server was started with --reasoning-parser gemma4, which separates the model’s thinking from its answer. For the gemma-4 family there is a catch: reasoning tokens are never generated unless the request asks for them, which is why the other examples in this guide respond directly. Passing reasoning_effort turns thinking mode on, and the parsed trace comes back in the message’s reasoning field:

Python

resp = client.chat.completions.create(
    model="gemma-4-moe",
    messages=[{"role": "user", "content": "Write a haiku about GPU sandboxes."}],
    reasoning_effort="low",
    max_tokens=2048,
)
print("\nreasoning:")
print(resp.choices[0].message.reasoning)
print("answer:")
print(resp.choices[0].message.content)

Tool Calling

The script finishes with tool calling. Because the server was started with --enable-auto-tool-choice and a tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.

Python

def get_weather(city):
    rng = random.Random(city.lower())  # same city, same weather
    temp = rng.randint(-5, 35)
    sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"])
    return f"{temp}°C and {sky} in {city}"

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]

messages = [{"role": "user", "content": "Write a haiku about the current weather in Paris."}]
resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, tools=tools, max_tokens=256)
msg = resp.choices[0].message

if msg.tool_calls:
    messages.append(msg.model_dump(exclude_none=True))
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = get_weather(**args)
        print(f"\ntool call: {call.function.name}({args})")
        print(f"result:    {result}")
        messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
    resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, max_tokens=256)
    print("final:")
    print(resp.choices[0].message.content)

5. Access and Authentication

Two independent layers decide who can reach the model: Daytona’s preview proxy in front of the sandbox, and vLLM’s own API key check inside it. The guide has used one mode of the proxy layer so far; here is the full picture.

The Daytona Layer: Preview Links

Every request so far carried the preview token as a header. That is the default mode and the best fit for code you control, since the secret stays out of URLs, logs, and browser history. Preview links support two more modes for when the header is a poor fit.

Signed URLs embed a short-lived token in the URL itself:

Python

signed = sb.create_signed_preview_url(PORT, expires_in_seconds=3600)
print(signed.url)  # no headers needed, expires after an hour

Anything that accepts only a base URL can now call the model: chat frontends, no-code tools, a colleague’s notebook. And because the URL expires on schedule, sharing it is a bounded commitment rather than a permanent grant. Two details to know: the default expiry is only 60 seconds, so pass expires_in_seconds explicitly, and sb.expire_signed_preview_url(PORT, signed.token) revokes a URL early.

Public previews drop the proxy’s authentication entirely. Create the sandbox with public=True in the create params, and the preview URL serves anyone who has it, for as long as the sandbox stays up.

The vLLM Layer: API Keys

Every query example sets api_key="EMPTY". That is because vLLM, unless told otherwise, accepts any key; the field exists only to satisfy client constructors. Add --api-key your-secret-key to the vllm serve command (or set VLLM_API_KEY in the sandbox’s env_vars) and the check becomes real: the server requires Authorization: Bearer your-secret-key, which is exactly what OpenAI-compatible clients send as their api_key.

The two layers do not guard the same surface, though. The vLLM key covers the inference routes (/v1 and similar prefixes), while other endpoints on the same server accept requests without it. The Daytona token gates everything on the port.

That makes public preview plus vLLM API key a combination for sharing with people you broadly trust: the endpoint behaves like a standard OpenAI-style API, configured with nothing but a base URL and an api_key, so it works in any tool that accepts only those two fields. For anything more exposed than that, the Daytona layer is the better fit.

Setup	Client needs	Good for
Preview token header (guide default)	base URL + custom header	your own code
Signed URL	URL only; expires on schedule	temporary sharing
Public preview + vLLM API key	base URL + `api_key`	pointing existing apps at your model
Public preview, no key	base URL only	quick demos

Inside the Sandbox

Everything above governs requests arriving from outside. Code running inside the sandbox can skip all of it and talk to http://localhost:8000 directly. The vLLM image ships Python with the openai package preinstalled (it is a vLLM dependency), so the SDK works there as-is:

Python

daytona = Daytona(DaytonaConfig(target="us-east-1"))
sb = daytona.get("sandbox-id")  # printed by serve_vllm.py

print(sb.process.code_run("""
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="gemma-4-moe",
    messages=[{"role": "user", "content": "Write a haiku about code that never leaves its sandbox."}],
    max_tokens=64,
)
print(resp.choices[0].message.content)
""").result)

No auth is needed because the traffic never leaves the sandbox. Anything the image doesn’t ship, like litellm, install first with sb.process.exec("pip install litellm").

This colocated shape fits workloads where the data should live next to the model: batch inference (upload a dataset into the sandbox, process it against the local endpoint, download the results) or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.

6. Swapping Models

To serve a different model, change three things in serve_vllm.py:

MODEL: the Hugging Face model ID
SERVED_AS: the name clients will pass as model
The --tool-call-parser and --reasoning-parser flags, which are model-family specific

For gated models, set HF_TOKEN in your .env; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.

7. Going Further: Sandboxes as Tool Runtimes

In the tool calling example above, get_weather runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox, and execute the model’s tool calls there. The GPU sandbox keeps serving every session, while each conversation gets an isolated runtime where model-written code can run, install packages, and touch files without affecting anyone else. When the session ends, delete its sandbox.

Only the tool function’s body changes; instead of computing the result locally, it runs the model’s request in the session’s sandbox, with sandbox.process.code_run(...) for code or sandbox.process.exec(...) for shell commands, and returns the output. The sandbox can also carry whatever harness the tools need: interpreters, test runners, project dependencies. The tool-calling loop stays exactly the same. Both halves of the application run on Daytona: the GPU sandbox where the model thinks, and the CPU sandboxes where its decisions execute.

8. Configuration Options

Constants at the top of serve_vllm.py:

Parameter	Default	Description
`MODEL`	`google/gemma-4-26B-A4B-it`	Hugging Face model ID to serve
`SERVED_AS`	`gemma-4-moe`	Model name exposed by the API
`VLLM_IMAGE`	`vllm/vllm-openai:v0.22.1`	vLLM Docker image
`PORT`	`8000`	Port the server listens on
`TARGET`	`us-east-1`	Current region for GPU sandboxes
`BOOT_TIMEOUT`	`900`	Seconds to wait for the server to become healthy

Key advantages of this approach:

No infrastructure to manage: one script turns a stock Docker image into a served model on a GPU
Fast: a live endpoint about five minutes after you run the script, no provisioning, no driver setup
Reachable anywhere: the preview URL works from any machine, secured by a token header
OpenAI-compatible: existing clients, SDKs, and frameworks work unchanged