コンテンツにスキップ

Serve LLMs on GPU Sandboxes with vLLM

View as Markdown

このコンテンツはまだ日本語訳がありません。

This guide demonstrates how to serve an open-weights model on a Daytona GPU sandbox with vLLM and query it from anywhere through a token-authenticated preview URL.

The serving side is a single script: it creates the sandbox, starts vLLM inside it, and prints the endpoint and its access token once the server is healthy. The endpoint is OpenAI-compatible, so existing clients work without modification; the guide shows examples querying it with curl, the OpenAI SDK, and LiteLLM. The model served is google/gemma-4-26B-A4B-it, but any model vLLM can serve works the same way.


Four steps take you from an API key to a live endpoint:

  1. Create: Spin up a GPU sandbox from the stock vllm/vllm-openai image, no custom image build needed
  2. Serve: Start vllm serve as a background session command inside the sandbox
  3. Wait: Poll the server’s /health endpoint through the preview URL while streaming the startup logs to your terminal
  4. Hand off: Print paste-ready export ENDPOINT=... and export TOKEN=... lines

All four are handled by a single script, serve_vllm.py; once it finishes, the sandbox keeps serving until you delete it. Three small clients show the endpoint in action: query.sh (curl), query_openai.py (OpenAI SDK with chat, streaming, and tool calling), and query_litellm.py (LiteLLM).

Clone the Daytona repository and navigate to the example directory:

Terminal window
git clone https://github.com/daytonaio/daytona.git
cd daytona/guides/python/model-serving/vllm
Terminal window
python3 -m venv venv
source venv/bin/activate
Terminal window
pip install -e .

This installs the daytona SDK along with the openai and litellm clients used by the query examples.

Get your Daytona API key from the Daytona Dashboard and set it in a .env file:

Terminal window
cp .env.example .env
# edit .env with your API key

The .env.example also has an optional HF_TOKEN entry. It is only required for gated Hugging Face models, though Hugging Face recommends a token for faster, less throttled downloads in general.

Let’s walk through serve_vllm.py, the script that creates the sandbox and starts the server.

The script targets us-east-1, currently the region for GPU sandboxes, and creates one directly from the official vLLM image:

import os
import sys
import time
import requests
from dotenv import load_dotenv
from daytona import (
CreateSandboxFromImageParams,
Daytona,
DaytonaConfig,
GpuType,
Image,
Resources,
SessionExecuteRequest,
)
load_dotenv()
MODEL = "google/gemma-4-26B-A4B-it"
SERVED_AS = "gemma-4-moe"
VLLM_IMAGE = "vllm/vllm-openai:v0.22.1"
PORT = 8000
TARGET = "us-east-1" # current region for GPU sandboxes
SESSION = "vllm" # name of the background session the server runs in
BOOT_TIMEOUT = 900 # max seconds to wait for the server to come up
daytona = Daytona(DaytonaConfig(target=TARGET))
env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}
sb = daytona.create(
CreateSandboxFromImageParams(
image=Image.base(VLLM_IMAGE),
resources=Resources(
gpu=1,
gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000], # preference order
),
auto_stop_interval=0,
ephemeral=True,
env_vars=env_vars,
),
timeout=600,
)

A few things worth noting:

  • Stock image: Image.base pulls vllm/vllm-openai as-is. The whole serving stack ships in the image; the sandbox just adds the GPU.
  • One GPU per sandbox: gpu=1 is currently the per-sandbox maximum.
  • GPU preference: gpu_type takes a single type or a priority list; the sandbox gets the first type with availability, here an H100 with an RTX PRO 6000 fallback.
  • No idle stop: auto_stop_interval=0 keeps the endpoint alive until you delete the sandbox.
  • HF_TOKEN passthrough: the script forwards your token into the sandbox if you set one; it is required only for gated models, and the default model is not gated.

The server runs as a session command with run_async=True, so the script keeps control while vLLM boots:

sb.process.create_session(SESSION)
cmd = sb.process.execute_session_command(
SESSION,
SessionExecuteRequest(
command=(
f"vllm serve {MODEL} --port {PORT} "
f"--served-model-name {SERVED_AS} "
"--enable-auto-tool-choice --tool-call-parser gemma4 "
"--reasoning-parser gemma4 "
"--enable-prefix-caching"
),
run_async=True,
),
)
cmd_id = cmd.cmd_id

The flags expose the model under the short name gemma-4-moe and enable tool calling and reasoning output parsing. The call returns immediately, and cmd_id is the handle for asking about the command later; the wait loop below uses it to fetch logs and check whether the server is still alive.

Model download and loading take a few minutes. While waiting, the script streams the server logs to your terminal and polls /health through the preview URL, giving up after BOOT_TIMEOUT (900 seconds by default):

pv = sb.get_preview_link(PORT)
hdr = {"x-daytona-preview-token": pv.token}
deadline = time.time() + BOOT_TIMEOUT
ready = False
printed = 0
while time.time() < deadline:
# logs are a cumulative snapshot; print only the new tail
out = sb.process.get_session_command_logs(SESSION, cmd_id).output or ""
if len(out) > printed:
sys.stdout.write(out[printed:])
sys.stdout.flush()
printed = len(out)
# vllm serve runs until killed; an exit code means it died
exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code
if exit_code is not None:
print(f"!! vllm exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True)
sys.exit(1)
try:
if requests.get(f"{pv.url}/health", headers=hdr, timeout=10).status_code == 200:
ready = True
break
except requests.RequestException:
pass
time.sleep(10)

The preview link is the piece that exposes the server outside the sandbox: pv.url is reachable from anywhere, and requests authenticate with the x-daytona-preview-token header. The URL follows the structure https://{port}-{sandboxId}.{daytonaProxyDomain}, as described in the preview docs. The same URL and token the script uses for health checks are the ones your clients will use for inference.

If the server process dies during boot, the script notices the exit code immediately, saves the full server log next to the script, and exits instead of waiting out the timeout.

Once healthy, the script prints the handoff:

ready - paste into your shell:
export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}
export TOKEN={previewToken}
sandbox left UP: {sandboxId}
reconnect: daytona.get('{sandboxId}')
delete: daytona.get('{sandboxId}').delete()

The sandbox stays up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren’t lost. Delete it when you’re done.

Paste the printed export lines into your shell, then use any OpenAI-compatible client. Each example below ships as a ready-to-run file in the directory you cloned: query.sh for curl, query_openai.py for the OpenAI SDK, and query_litellm.py for LiteLLM.

Terminal window
curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \
-H "x-daytona-preview-token: $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-moe",
"messages": [{"role": "user", "content": "Write a haiku about sandboxes for AI agents."}],
"max_tokens": 64
}'

The only Daytona-specific detail in any of these is the x-daytona-preview-token header. Everything else is the standard OpenAI API surface. The next three examples continue with the OpenAI SDK, following query_openai.py through streaming, reasoning, and tool calling.

After the plain chat call, query_openai.py shows an example that streams tokens as they arrive:

stream = client.chat.completions.create(
model="gemma-4-moe",
messages=[{"role": "user", "content": "Write ten haikus about tokens streaming from a sandbox."}],
max_tokens=512,
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

The server was started with --reasoning-parser gemma4, which separates the model’s thinking from its answer. For the gemma-4 family there is a catch: reasoning tokens are never generated unless the request asks for them, which is why the other examples in this guide respond directly. Passing reasoning_effort turns thinking mode on, and the parsed trace comes back in the message’s reasoning field:

resp = client.chat.completions.create(
model="gemma-4-moe",
messages=[{"role": "user", "content": "Write a haiku about GPU sandboxes."}],
reasoning_effort="low",
max_tokens=2048,
)
print("\nreasoning:")
print(resp.choices[0].message.reasoning)
print("answer:")
print(resp.choices[0].message.content)

The script finishes with tool calling. Because the server was started with --enable-auto-tool-choice and a tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.

def get_weather(city):
rng = random.Random(city.lower()) # same city, same weather
temp = rng.randint(-5, 35)
sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"])
return f"{temp}°C and {sky} in {city}"
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}
]
messages = [{"role": "user", "content": "Write a haiku about the current weather in Paris."}]
resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, tools=tools, max_tokens=256)
msg = resp.choices[0].message
if msg.tool_calls:
messages.append(msg.model_dump(exclude_none=True))
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
result = get_weather(**args)
print(f"\ntool call: {call.function.name}({args})")
print(f"result: {result}")
messages.append({"role": "tool", "tool_call_id": call.id, "content": result})
resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, max_tokens=256)
print("final:")
print(resp.choices[0].message.content)

Two independent layers decide who can reach the model: Daytona’s preview proxy in front of the sandbox, and vLLM’s own API key check inside it. The guide has used one mode of the proxy layer so far; here is the full picture.

Every request so far carried the preview token as a header. That is the default mode and the best fit for code you control, since the secret stays out of URLs, logs, and browser history. Preview links support two more modes for when the header is a poor fit.

Signed URLs embed a short-lived token in the URL itself:

signed = sb.create_signed_preview_url(PORT, expires_in_seconds=3600)
print(signed.url) # no headers needed, expires after an hour

Anything that accepts only a base URL can now call the model: chat frontends, no-code tools, a colleague’s notebook. And because the URL expires on schedule, sharing it is a bounded commitment rather than a permanent grant. Two details to know: the default expiry is only 60 seconds, so pass expires_in_seconds explicitly, and sb.expire_signed_preview_url(PORT, signed.token) revokes a URL early.

Public previews drop the proxy’s authentication entirely. Create the sandbox with public=True in the create params, and the preview URL serves anyone who has it, for as long as the sandbox stays up.

Every query example sets api_key="EMPTY". That is because vLLM, unless told otherwise, accepts any key; the field exists only to satisfy client constructors. Add --api-key your-secret-key to the vllm serve command (or set VLLM_API_KEY in the sandbox’s env_vars) and the check becomes real: the server requires Authorization: Bearer your-secret-key, which is exactly what OpenAI-compatible clients send as their api_key.

The two layers do not guard the same surface, though. The vLLM key covers the inference routes (/v1 and similar prefixes), while other endpoints on the same server accept requests without it. The Daytona token gates everything on the port.

That makes public preview plus vLLM API key a combination for sharing with people you broadly trust: the endpoint behaves like a standard OpenAI-style API, configured with nothing but a base URL and an api_key, so it works in any tool that accepts only those two fields. For anything more exposed than that, the Daytona layer is the better fit.

SetupClient needsGood for
Preview token header (guide default)base URL + custom headeryour own code
Signed URLURL only; expires on scheduletemporary sharing
Public preview + vLLM API keybase URL + api_keypointing existing apps at your model
Public preview, no keybase URL onlyquick demos

Everything above governs requests arriving from outside. Code running inside the sandbox can skip all of it and talk to http://localhost:8000 directly. The vLLM image ships Python with the openai package preinstalled (it is a vLLM dependency), so the SDK works there as-is:

daytona = Daytona(DaytonaConfig(target="us-east-1"))
sb = daytona.get("sandbox-id") # printed by serve_vllm.py
print(sb.process.code_run("""
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="gemma-4-moe",
messages=[{"role": "user", "content": "Write a haiku about code that never leaves its sandbox."}],
max_tokens=64,
)
print(resp.choices[0].message.content)
""").result)

No auth is needed because the traffic never leaves the sandbox. Anything the image doesn’t ship, like litellm, install first with sb.process.exec("pip install litellm").

This colocated shape fits workloads where the data should live next to the model: batch inference (upload a dataset into the sandbox, process it against the local endpoint, download the results) or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.

To serve a different model, change three things in serve_vllm.py:

  • MODEL: the Hugging Face model ID
  • SERVED_AS: the name clients will pass as model
  • The --tool-call-parser and --reasoning-parser flags, which are model-family specific

For gated models, set HF_TOKEN in your .env; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.

7. Going Further: Sandboxes as Tool Runtimes

Section titled “7. Going Further: Sandboxes as Tool Runtimes”

In the tool calling example above, get_weather runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox, and execute the model’s tool calls there. The GPU sandbox keeps serving every session, while each conversation gets an isolated runtime where model-written code can run, install packages, and touch files without affecting anyone else. When the session ends, delete its sandbox.

Only the tool function’s body changes; instead of computing the result locally, it runs the model’s request in the session’s sandbox, with sandbox.process.code_run(...) for code or sandbox.process.exec(...) for shell commands, and returns the output. The sandbox can also carry whatever harness the tools need: interpreters, test runners, project dependencies. The tool-calling loop stays exactly the same. Both halves of the application run on Daytona: the GPU sandbox where the model thinks, and the CPU sandboxes where its decisions execute.

Constants at the top of serve_vllm.py:

ParameterDefaultDescription
MODELgoogle/gemma-4-26B-A4B-itHugging Face model ID to serve
SERVED_ASgemma-4-moeModel name exposed by the API
VLLM_IMAGEvllm/vllm-openai:v0.22.1vLLM Docker image
PORT8000Port the server listens on
TARGETus-east-1Current region for GPU sandboxes
BOOT_TIMEOUT900Seconds to wait for the server to become healthy

Key advantages of this approach:

  • No infrastructure to manage: one script turns a stock Docker image into a served model on a GPU
  • Fast: a live endpoint about five minutes after you run the script, no provisioning, no driver setup
  • Reachable anywhere: the preview URL works from any machine, secured by a token header
  • OpenAI-compatible: existing clients, SDKs, and frameworks work unchanged