Running GPT-OSS-120B on a Single NVIDIA DGX Spark - A Practical Guide
Note on the model name: OpenAI’s open-weight family ships asgpt-oss-20bandgpt-oss-120b. There is no130Bvariant — this guide targetsgpt-oss-120b, which is the one sized to fit the Spark’s unified memory.
A practical, single-node setup guide for serving gpt-oss-120b as a local coding backend on the GB10 Grace Blackwell DGX Spark, and wiring it into Claude Code.
1. Why this model fits the Spark
The DGX Spark has 128 GB of coherent unified LPDDR5x (~119.7 GB addressable by the GPU) but only ~273 GB/s of memory bandwidth. Token generation is bandwidth-bound, so bandwidth — not capacity — is the limiting factor.
gpt-oss-120b is a good match for two reasons:
- It fits. In its native MXFP4 weight format the full model loads into the ~120 GB unified pool with room left for KV cache.
- It’s a sparse MoE. The model has ~117B total parameters but activates only ~5.1B per token. Generation speed scales with active parameters against bandwidth, so it runs far faster than a dense model of comparable footprint.
For reference, on the same box a dense ~32B model is bandwidth-starved (~9–10 tok/s), while small-active MoE models run several times faster. Published gpt-oss-120b results on the Spark land around ~50 tokens/s on an optimized engine (SGLang), which is usable for an interactive coding agent.
Rule of thumb for the Spark: prefer MoE models with low active-parameter counts; avoid large dense models.
2. Prerequisites
| Requirement | Detail |
|---|---|
| Hardware | NVIDIA DGX Spark (GB10), 128 GB unified memory |
| OS | DGX OS (Ubuntu-based, ARM64 / aarch64) |
| GPU stack | CUDA + drivers preinstalled on DGX OS; Blackwell compute capability sm_121 |
| Firmware | Update to a current firmware version before serving (see §6) |
| Disk | The 120B weights are large (~60+ GB on disk); the 4 TB NVMe is fine, but watch free space if you keep multiple quants |
| Access | A Hugging Face account + access token for openai/gpt-oss-120b |
Set your token once:
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxx"
3. Pick an inference engine
Three viable paths, from easiest to highest-throughput. All three serve an HTTP API you can point a client at.
| Engine | Effort | API exposed | Best for |
|---|---|---|---|
| Ollama | Lowest | OpenAI-compatible | Quick start, single user |
| llama.cpp | Medium | OpenAI-compatible | Control, tuning, GGUF quants |
| SGLang | Higher | OpenAI-compatible (+ Anthropic-compatible via proxy) | Best measured throughput on Spark |
Community testing on the Spark consistently recommends llama.cpp or SGLang over Ollama for throughput on this hardware. Use Ollama to confirm everything works, then move to llama.cpp/SGLang for daily use.
4. Option A — Ollama (fastest to first token)
# Pull and run; Ollama fetches the official MXFP4 build
ollama pull gpt-oss:120b
ollama run gpt-oss:120b
Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1.
Caveats:
- Ollama defaults to a 4096-token context. Raise it for real coding work (see model/Modelfile context settings).
- Performance is acceptable for testing but typically below a tuned llama.cpp/SGLang setup.
5. Option B — llama.cpp (recommended for control)
Build llama.cpp with CUDA support for the Blackwell GPU, then serve a GGUF build of the model.
~/llama.cpp/build/bin/llama-server \
-m ~/.cache/llama.cpp/gpt-oss-120b/gpt-oss-120b.gguf \
-c 16384 \ # context length — tune to your workload (see notes)
-ngl 999 \ # offload all layers to the Blackwell GPU
--flash-attn on \ # enable flash attention
--no-mmap \ # see mmap note below
--kv-unified \ # single shared KV buffer
--jinja \ # use the model's chat template
-ub 2048 \ # micro-batch size for prompt processing
--host 0.0.0.0 \
--port 8005
Flag rationale:
-ngl 999— force all layers onto the GPU. On unified memory this keeps everything in the fast path.--no-mmap— there is a known mmap issue on the Spark that inflates model load time (reported ~5×). Disabling mmap fixes load times.--flash-attn on— standard attention speedup for transformer inference.-c(context) — directly trades off against memory and speed. Larger context grows the KV cache and reduces tok/s. On a comparable small-active MoE, throughput dropped from ~20–25 tok/s at 16K context to ~15–17 tok/s at 32K. Start at 16K and only raise it if your task needs it.-ub 2048— larger micro-batch improves prompt-processing (prefill) throughput.
Endpoint: http://<spark-ip>:8005/v1 (OpenAI-compatible).
6. Option C — SGLang (highest measured throughput)
SGLang has explicit DGX Spark support and produced the best published gpt-oss-120b numbers (~50 tok/s).
General shape (consult the current SGLang DGX Spark docs for exact flags/container):
# Launch the SGLang server pointing at the 120B weights
python -m sglang.launch_server \
--model-path openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 30000
Notes:
- The 120B is ~6× the size of the 20B build, so expect longer load times.
- For stability on the larger model, enabling swap memory on the Spark is recommended.
- Endpoint:
http://<spark-ip>:30000/v1.
Firmware: keep DGX OS current before serving. Via the DGX Dashboard, or on the CLI:
7. Verify the server
OpenAI-compatible smoke test against whichever engine you started:
curl http://localhost:8005/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss-120b",
"messages": [{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number."}],
"max_tokens": 256
}'
A coherent code response confirms the model is loaded and serving.
8. Wire it into Claude Code
Claude Code speaks the Anthropic /v1/messages API, while llama.cpp/Ollama/SGLang expose an OpenAI-compatible API. You therefore need one of:
- (a) An Anthropic-compatible endpoint, exposed directly by the engine or via a bridge, or
- (b) A translation gateway (e.g. LiteLLM) that accepts Anthropic-format requests and forwards them to your OpenAI-compatible server.
Claude Code is pointed at any endpoint with the ANTHROPIC_BASE_URL environment variable (this is the official mechanism for routing through a custom endpoint).
8a. Direct / bridged endpoint
If your server (or a thin bridge in front of it) presents an Anthropic-shaped /v1/messages endpoint:
ANTHROPIC_BASE_URL=http://localhost:8005 \
ANTHROPIC_AUTH_TOKEN=dummy \
ANTHROPIC_DEFAULT_OPUS_MODEL=gpt-oss-120b \
ANTHROPIC_DEFAULT_SONNET_MODEL=gpt-oss-120b \
ANTHROPIC_DEFAULT_HAIKU_MODEL=gpt-oss-120b \
claude
ANTHROPIC_AUTH_TOKENcarries the bearer/gateway token (dummyworks for an open local server that ignores auth).- The
ANTHROPIC_DEFAULT_*_MODELvariables map Claude Code’s Opus/Sonnet/Haiku tiers onto your single local model, so every tier resolves togpt-oss-120b.
8b. LiteLLM bridge (for OpenAI-only servers)
Run LiteLLM in front of llama.cpp/Ollama, register the model under claude-* aliases, then point Claude Code at LiteLLM’s URL with the same env vars as above. This is the established pattern for using a purely OpenAI-compatible local server with Claude Code on the Spark.
Persisting and a caching gotcha
Add the variables to ~/.bashrc/~/.zshrc, or to ~/.claude/settings.json under an env block.
Prefix-caching note: Claude Code injects a per-request attribution hash into the system prompt, which can defeat prefix caching and slow throughput. If your serving stack doesn’t handle this automatically, set:
{
"env": { "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" }
}
in ~/.claude/settings.json.
Launch Claude Code and run a small prompt to confirm requests are routing to the Spark.
9. Tuning checklist
- Context length is your main lever. Bigger context = bigger KV cache = lower tok/s and more memory. Right-size it per task (16K is a sane default; raise deliberately).
- Stay on MoE. Don’t swap in dense models on this box expecting similar speed.
--no-mmapon llama.cpp to avoid the slow-load bug.- Enable swap for stability when loading the 120B.
- One engine, one quant. Multiple large GGUF/quant copies fill the NVMe fast.
- Watch active-vs-total params, not total size, when predicting speed.
10. Honest expectations vs. “like Opus”
On a single Spark, gpt-oss-120b is the largest coherent, frontier-style reasoning/tool-use model that fits, and it is genuinely usable in a Claude Code loop at ~50 tok/s. It is not equivalent to a current frontier closed model. The open models that most directly rival top closed models on agentic coding are trillion-parameter MoEs (e.g. Kimi K2.x, DeepSeek V4-Pro, large GLM MoEs) — those do not fit on one Spark and would require clustering two Sparks over the ConnectX-7 200G link or different hardware.
If you want a coding-specialized alternative on the same box, Qwen3-Coder variants (e.g. 30B-A3B, or Qwen3-Coder-Next in FP8/NVFP4) are smaller-active MoEs that run faster and are widely used with Claude Code on the Spark.
Source anchors
- DGX Spark hardware (GB10, 128 GB unified, 273 GB/s,
sm_121, DGX OS): NVIDIA / LMSYS / StorageReview reviews. gpt-oss-120bon Spark (~50 tok/s, SGLang support, fits 120 GB, swap recommendation): LMSYS DGX Spark + GPT-OSS posts, Ollama Spark performance blog.- llama.cpp flags and the
--no-mmapload-time bug, context-vs-throughput figures: community Spark engine write-ups. - Dense-vs-MoE throughput contrast and “use llama.cpp / switch to MoE” guidance: NVIDIA developer forum.
- Claude Code routing (
ANTHROPIC_BASE_URL,ANTHROPIC_AUTH_TOKEN,ANTHROPIC_DEFAULT_*_MODEL,CLAUDE_CODE_ATTRIBUTION_HEADER): Claude Code authentication docs, vLLM Claude Code integration docs, LiteLLM bridge example.