This guide demonstrates how to serve an open-weights model on a Daytona GPU sandbox with vLLM and query it from anywhere through a token-authenticated preview URL.
The serving side is a single script: it creates the sandbox, starts vLLM inside it, and prints the endpoint and its access token once the server is healthy. The endpoint is OpenAI-compatible, so existing clients work without modification; the guide shows examples querying it with curl, the OpenAI SDK, and LiteLLM. The model served is google/gemma-4-26B-A4B-it, but any model vLLM can serve works the same way.
1. Workflow Overview
Section titled “1. Workflow Overview”Four steps take you from an API key to a live endpoint:
- Create: Spin up a GPU sandbox from the stock
vllm/vllm-openaiimage, no custom image build needed - Serve: Start
vllm serveas a background session command inside the sandbox - Wait: Poll the server’s
/healthendpoint through the preview URL while streaming the startup logs to your terminal - Hand off: Print paste-ready
export ENDPOINT=...andexport TOKEN=...lines
All four are handled by a single script, serve_vllm.py; once it finishes, the sandbox keeps serving until you delete it. Three small clients show the endpoint in action: query.sh (curl), query_openai.py (OpenAI SDK with chat, streaming, and tool calling), and query_litellm.py (LiteLLM).
2. Setup
Section titled “2. Setup”Clone the Repository
Section titled “Clone the Repository”Clone the Daytona repository and navigate to the example directory:
git clone https://github.com/daytonaio/daytona.gitcd daytona/guides/python/model-serving/vllmCreate Virtual Environment
Section titled “Create Virtual Environment”python3 -m venv venvsource venv/bin/activateInstall Dependencies
Section titled “Install Dependencies”pip install -e .This installs the daytona SDK along with the openai and litellm clients used by the query examples.
Configure Environment
Section titled “Configure Environment”Get your Daytona API key from the Daytona Dashboard and set it in a .env file:
cp .env.example .env# edit .env with your API keyThe .env.example also has an optional HF_TOKEN entry. It is only required for gated Hugging Face models, though Hugging Face recommends a token for faster, less throttled downloads in general.
3. Understanding the Code
Section titled “3. Understanding the Code”Let’s walk through serve_vllm.py, the script that creates the sandbox and starts the server.
Creating the GPU Sandbox
Section titled “Creating the GPU Sandbox”The script targets us-east-1, currently the region for GPU sandboxes, and creates one directly from the official vLLM image:
import osimport sysimport time
import requestsfrom dotenv import load_dotenv
from daytona import ( CreateSandboxFromImageParams, Daytona, DaytonaConfig, GpuType, Image, Resources, SessionExecuteRequest,)
load_dotenv()
MODEL = "google/gemma-4-26B-A4B-it"SERVED_AS = "gemma-4-moe"VLLM_IMAGE = "vllm/vllm-openai:v0.22.1"PORT = 8000TARGET = "us-east-1" # current region for GPU sandboxesSESSION = "vllm" # name of the background session the server runs inBOOT_TIMEOUT = 900 # max seconds to wait for the server to come up
daytona = Daytona(DaytonaConfig(target=TARGET))env_vars = {"HF_TOKEN": os.environ["HF_TOKEN"]} if os.environ.get("HF_TOKEN") else {}sb = daytona.create( CreateSandboxFromImageParams( image=Image.base(VLLM_IMAGE), resources=Resources( gpu=1, gpu_type=[GpuType.H100, GpuType.RTX_PRO_6000], # preference order ), auto_stop_interval=0, ephemeral=True, env_vars=env_vars, ), timeout=600,)A few things worth noting:
- Stock image:
Image.basepullsvllm/vllm-openaias-is. The whole serving stack ships in the image; the sandbox just adds the GPU. - One GPU per sandbox:
gpu=1is currently the per-sandbox maximum. - GPU preference:
gpu_typetakes a single type or a priority list; the sandbox gets the first type with availability, here an H100 with an RTX PRO 6000 fallback. - No idle stop:
auto_stop_interval=0keeps the endpoint alive until you delete the sandbox. - HF_TOKEN passthrough: the script forwards your token into the sandbox if you set one; it is required only for gated models, and the default model is not gated.
Starting the Server
Section titled “Starting the Server”The server runs as a session command with run_async=True, so the script keeps control while vLLM boots:
sb.process.create_session(SESSION)cmd = sb.process.execute_session_command( SESSION, SessionExecuteRequest( command=( f"vllm serve {MODEL} --port {PORT} " f"--served-model-name {SERVED_AS} " "--enable-auto-tool-choice --tool-call-parser gemma4 " "--reasoning-parser gemma4 " "--enable-prefix-caching" ), run_async=True, ),)cmd_id = cmd.cmd_idThe flags expose the model under the short name gemma-4-moe and enable tool calling and reasoning output parsing. The call returns immediately, and cmd_id is the handle for asking about the command later; the wait loop below uses it to fetch logs and check whether the server is still alive.
Waiting for the Server
Section titled “Waiting for the Server”Model download and loading take a few minutes. While waiting, the script streams the server logs to your terminal and polls /health through the preview URL, giving up after BOOT_TIMEOUT (900 seconds by default):
pv = sb.get_preview_link(PORT)hdr = {"x-daytona-preview-token": pv.token}
deadline = time.time() + BOOT_TIMEOUTready = Falseprinted = 0while time.time() < deadline: # logs are a cumulative snapshot; print only the new tail out = sb.process.get_session_command_logs(SESSION, cmd_id).output or "" if len(out) > printed: sys.stdout.write(out[printed:]) sys.stdout.flush() printed = len(out) # vllm serve runs until killed; an exit code means it died exit_code = sb.process.get_session_command(SESSION, cmd_id).exit_code if exit_code is not None: print(f"!! vllm exited with code {exit_code}. Full log saved to {dump_log(cmd_id)}", flush=True) sys.exit(1) try: if requests.get(f"{pv.url}/health", headers=hdr, timeout=10).status_code == 200: ready = True break except requests.RequestException: pass time.sleep(10)The preview link is the piece that exposes the server outside the sandbox: pv.url is reachable from anywhere, and requests authenticate with the x-daytona-preview-token header. The URL follows the structure https://{port}-{sandboxId}.{daytonaProxyDomain}, as described in the preview docs. The same URL and token the script uses for health checks are the ones your clients will use for inference.
If the server process dies during boot, the script notices the exit code immediately, saves the full server log next to the script, and exits instead of waiting out the timeout.
Once healthy, the script prints the handoff:
ready - paste into your shell:export ENDPOINT=https://8000-{sandboxId}.{daytonaProxyDomain}export TOKEN={previewToken}
sandbox left UP: {sandboxId} reconnect: daytona.get('{sandboxId}') delete: daytona.get('{sandboxId}').delete()The sandbox stays up either way: on success so the endpoint keeps serving, on failure so the already-downloaded weights aren’t lost. Delete it when you’re done.
4. Querying the Endpoint
Section titled “4. Querying the Endpoint”Paste the printed export lines into your shell, then use any OpenAI-compatible client. Each example below ships as a ready-to-run file in the directory you cloned: query.sh for curl, query_openai.py for the OpenAI SDK, and query_litellm.py for LiteLLM.
curl -sS --connect-timeout 30 --max-time 120 "$ENDPOINT/v1/chat/completions" \ -H "x-daytona-preview-token: $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-4-moe", "messages": [{"role": "user", "content": "Write a haiku about sandboxes for AI agents."}], "max_tokens": 64 }'from openai import OpenAI
client = OpenAI( base_url=f"{os.environ['ENDPOINT']}/v1", api_key="EMPTY", # vLLM doesn't check it; auth is the preview-token header default_headers={"x-daytona-preview-token": os.environ["TOKEN"]},)
resp = client.chat.completions.create( model="gemma-4-moe", messages=[{"role": "user", "content": "Write a haiku about ephemeral sandboxes."}], max_tokens=64,)print(resp.choices[0].message.content)import litellm
resp = litellm.completion( model="hosted_vllm/gemma-4-moe", # OpenAI-compatible vLLM server api_base=f"{os.environ['ENDPOINT']}/v1", api_key="EMPTY", extra_headers={"x-daytona-preview-token": os.environ["TOKEN"]}, messages=[{"role": "user", "content": "Write a haiku about agents running code in the cloud."}], max_tokens=64,)print(resp.choices[0].message.content)The only Daytona-specific detail in any of these is the x-daytona-preview-token header. Everything else is the standard OpenAI API surface. The next three examples continue with the OpenAI SDK, following query_openai.py through streaming, reasoning, and tool calling.
Streaming
Section titled “Streaming”After the plain chat call, query_openai.py shows an example that streams tokens as they arrive:
stream = client.chat.completions.create( model="gemma-4-moe", messages=[{"role": "user", "content": "Write ten haikus about tokens streaming from a sandbox."}], max_tokens=512, stream=True,)for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True)print()Reasoning
Section titled “Reasoning”The server was started with --reasoning-parser gemma4, which separates the model’s thinking from its answer. For the gemma-4 family there is a catch: reasoning tokens are never generated unless the request asks for them, which is why the other examples in this guide respond directly. Passing reasoning_effort turns thinking mode on, and the parsed trace comes back in the message’s reasoning field:
resp = client.chat.completions.create( model="gemma-4-moe", messages=[{"role": "user", "content": "Write a haiku about GPU sandboxes."}], reasoning_effort="low", max_tokens=2048,)print("\nreasoning:")print(resp.choices[0].message.reasoning)print("answer:")print(resp.choices[0].message.content)Tool Calling
Section titled “Tool Calling”The script finishes with tool calling. Because the server was started with --enable-auto-tool-choice and a tool-call parser, the model can emit structured tool calls. The loop is the standard OpenAI one: the model requests a call, you run it, feed the result back, and the model answers.
def get_weather(city): rng = random.Random(city.lower()) # same city, same weather temp = rng.randint(-5, 35) sky = rng.choice(["sunny", "cloudy", "rainy", "foggy", "windy"]) return f"{temp}°C and {sky} in {city}"
tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city.", "parameters": { "type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"], }, }, }]
messages = [{"role": "user", "content": "Write a haiku about the current weather in Paris."}]resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, tools=tools, max_tokens=256)msg = resp.choices[0].message
if msg.tool_calls: messages.append(msg.model_dump(exclude_none=True)) for call in msg.tool_calls: args = json.loads(call.function.arguments) result = get_weather(**args) print(f"\ntool call: {call.function.name}({args})") print(f"result: {result}") messages.append({"role": "tool", "tool_call_id": call.id, "content": result}) resp = client.chat.completions.create(model="gemma-4-moe", messages=messages, max_tokens=256) print("final:") print(resp.choices[0].message.content)5. Access and Authentication
Section titled “5. Access and Authentication”Two independent layers decide who can reach the model: Daytona’s preview proxy in front of the sandbox, and vLLM’s own API key check inside it. The guide has used one mode of the proxy layer so far; here is the full picture.
The Daytona Layer: Preview Links
Section titled “The Daytona Layer: Preview Links”Every request so far carried the preview token as a header. That is the default mode and the best fit for code you control, since the secret stays out of URLs, logs, and browser history. Preview links support two more modes for when the header is a poor fit.
Signed URLs embed a short-lived token in the URL itself:
signed = sb.create_signed_preview_url(PORT, expires_in_seconds=3600)print(signed.url) # no headers needed, expires after an hourAnything that accepts only a base URL can now call the model: chat frontends, no-code tools, a colleague’s notebook. And because the URL expires on schedule, sharing it is a bounded commitment rather than a permanent grant. Two details to know: the default expiry is only 60 seconds, so pass expires_in_seconds explicitly, and sb.expire_signed_preview_url(PORT, signed.token) revokes a URL early.
Public previews drop the proxy’s authentication entirely. Create the sandbox with public=True in the create params, and the preview URL serves anyone who has it, for as long as the sandbox stays up.
The vLLM Layer: API Keys
Section titled “The vLLM Layer: API Keys”Every query example sets api_key="EMPTY". That is because vLLM, unless told otherwise, accepts any key; the field exists only to satisfy client constructors. Add --api-key your-secret-key to the vllm serve command (or set VLLM_API_KEY in the sandbox’s env_vars) and the check becomes real: the server requires Authorization: Bearer your-secret-key, which is exactly what OpenAI-compatible clients send as their api_key.
The two layers do not guard the same surface, though. The vLLM key covers the inference routes (/v1 and similar prefixes), while other endpoints on the same server accept requests without it. The Daytona token gates everything on the port.
That makes public preview plus vLLM API key a combination for sharing with people you broadly trust: the endpoint behaves like a standard OpenAI-style API, configured with nothing but a base URL and an api_key, so it works in any tool that accepts only those two fields. For anything more exposed than that, the Daytona layer is the better fit.
| Setup | Client needs | Good for |
|---|---|---|
| Preview token header (guide default) | base URL + custom header | your own code |
| Signed URL | URL only; expires on schedule | temporary sharing |
| Public preview + vLLM API key | base URL + api_key | pointing existing apps at your model |
| Public preview, no key | base URL only | quick demos |
Inside the Sandbox
Section titled “Inside the Sandbox”Everything above governs requests arriving from outside. Code running inside the sandbox can skip all of it and talk to http://localhost:8000 directly. The vLLM image ships Python with the openai package preinstalled (it is a vLLM dependency), so the SDK works there as-is:
daytona = Daytona(DaytonaConfig(target="us-east-1"))sb = daytona.get("sandbox-id") # printed by serve_vllm.py
print(sb.process.code_run("""from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")resp = client.chat.completions.create( model="gemma-4-moe", messages=[{"role": "user", "content": "Write a haiku about code that never leaves its sandbox."}], max_tokens=64,)print(resp.choices[0].message.content)""").result)No auth is needed because the traffic never leaves the sandbox. Anything the image doesn’t ship, like litellm, install first with sb.process.exec("pip install litellm").
This colocated shape fits workloads where the data should live next to the model: batch inference (upload a dataset into the sandbox, process it against the local endpoint, download the results) or a self-contained agent that calls the local model and runs the code it writes, all in one sandbox.
6. Swapping Models
Section titled “6. Swapping Models”To serve a different model, change three things in serve_vllm.py:
MODEL: the Hugging Face model IDSERVED_AS: the name clients will pass asmodel- The
--tool-call-parserand--reasoning-parserflags, which are model-family specific
For gated models, set HF_TOKEN in your .env; the script forwards it into the sandbox automatically. Keep in mind that the model has to fit on a single GPU, since sandboxes have at most one.
7. Going Further: Sandboxes as Tool Runtimes
Section titled “7. Going Further: Sandboxes as Tool Runtimes”In the tool calling example above, get_weather runs in the same process as the client. Daytona makes a stronger pattern natural: give each chat session its own CPU sandbox, and execute the model’s tool calls there. The GPU sandbox keeps serving every session, while each conversation gets an isolated runtime where model-written code can run, install packages, and touch files without affecting anyone else. When the session ends, delete its sandbox.
Only the tool function’s body changes; instead of computing the result locally, it runs the model’s request in the session’s sandbox, with sandbox.process.code_run(...) for code or sandbox.process.exec(...) for shell commands, and returns the output. The sandbox can also carry whatever harness the tools need: interpreters, test runners, project dependencies. The tool-calling loop stays exactly the same. Both halves of the application run on Daytona: the GPU sandbox where the model thinks, and the CPU sandboxes where its decisions execute.
8. Configuration Options
Section titled “8. Configuration Options”Constants at the top of serve_vllm.py:
| Parameter | Default | Description |
|---|---|---|
MODEL | google/gemma-4-26B-A4B-it | Hugging Face model ID to serve |
SERVED_AS | gemma-4-moe | Model name exposed by the API |
VLLM_IMAGE | vllm/vllm-openai:v0.22.1 | vLLM Docker image |
PORT | 8000 | Port the server listens on |
TARGET | us-east-1 | Current region for GPU sandboxes |
BOOT_TIMEOUT | 900 | Seconds to wait for the server to become healthy |
Key advantages of this approach:
- No infrastructure to manage: one script turns a stock Docker image into a served model on a GPU
- Fast: a live endpoint about five minutes after you run the script, no provisioning, no driver setup
- Reachable anywhere: the preview URL works from any machine, secured by a token header
- OpenAI-compatible: existing clients, SDKs, and frameworks work unchanged