Clustering Two NVIDIA DGX Sparks to Serve Qwen3-30B-Thinking with Ray + vLLM

Clustering Two NVIDIA DGX Sparks to Serve Qwen3-30B-Thinking with Ray + vLLM
Photo by Mel Poole / Unsplash

TL;DR

We took two NVIDIA DGX Spark units, wired them together over a 200 GbE link, joined them into a single Ray cluster running inside a vLLM container, and serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 with tensor parallelism across both boxes. One Spark holds shard 0, the other holds shard 1, Ray dispatches the work, and vLLM exposes an OpenAI-compatible endpoint on port 8000.

The trickiest part wasn't the networking or the orchestration. It was a one-line vLLM flag — --reasoning-parser deepseek_r1 — that we had to use instead of the obvious qwen3 parser, because the model emits its reasoning block without an opening <think> tag and the strict Qwen parser silently swallows the whole reasoning trace.

This post walks through the setup end to end and the gotcha at the end.


Why two Sparks?

A single DGX Spark (GB10, SM121, 121 GB unified memory) is plenty for any 30B-class model in FP8 — we run Qwen3.6-35B-A3B-FP8 on a single node every day. But we wanted to:

  1. Validate the multi-node story end to end before we needed it for something bigger.
  2. Give Qwen3-30B-Thinking room to breathe: 128k context with the KV cache distributed across two boxes leaves much more memory for in-flight requests than the same model squeezed onto one.
  3. Have a real-world reference for the Ray + vLLM + 200 GbE pattern that we can scale to 4 or 8 Sparks later.

The model itself — Qwen3-30B-A3B-Thinking — is a Mixture-of-Experts with 30B total / 3B active params and an explicit "thinking" mode that emits a <think>…</think> reasoning trace before the answer.


The hardware path: 200 GbE between the Sparks

Each Spark has two QSFP cages on its ConnectX NIC, presented to Linux as enp1s0f0np0 and enp1s0f1np1. We cabled enp1s0f1np1 on Node 1 directly to the same port on Node 2 — no switch in the middle — and brought the link up with static IPs on a small /30.

To verify which interface is the live one, use ibdev2netdev:

$ ibdev2netdev
mlx5_0 port 1 ==> enp1s0f0np0 (Down)
mlx5_1 port 1 ==> enp1s0f1np1 (Up)

That (Up) line is what the head-node script reads. The whole Ray + NCCL + UCX stack pins itself to this interface so no traffic ever leaks onto the management NIC.


Starting the Ray head node

On Node 1, we run /usr/local/bin/run_headnode.sh. It's intentionally tiny — the real plumbing is in the upstream run_cluster.sh helper from the vLLM repo; the wrapper just resolves the IP and forwards the right env vars:

# /usr/local/bin/run_headnode.sh
export MN_IF_NAME=enp1s0f1np1
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME \
  | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3

echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"

bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface \
  -e VLLM_HOST_IP=$VLLM_HOST_IP \
  -e UCX_NET_DEVICES=$MN_IF_NAME \
  -e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
  -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
  -e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
  -e TP_SOCKET_IFNAME=$MN_IF_NAME \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$VLLM_HOST_IP

A few things worth pointing out:

  • Every transport gets pinned to the same NIC. UCX, NCCL, OMPI, Gloo, and vLLM's TP socket all read separate env vars. Setting only one of them is the classic mistake — the other libraries happily fall back to eth0 and you get a healthy-looking cluster that runs at 1 GbE speeds. Setting all five means every byte goes over the 200 GbE link.
  • ~/.cache/huggingface is bind-mounted so the model weights are pre-staged on disk (run hf download Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 once before you start) and shared between the host and the container.
  • RAY_memory_monitor_refresh_ms=0 disables Ray's OOM killer. On Spark's unified memory, Ray's heuristic is too aggressive and kills the vLLM worker before it's finished allocating KV cache.
  • The container itself runs nvcr.io/nvidia/vllm:25.11-py3. NGC's vLLM image already has the right CUDA / NCCL / FlashInfer stack for GB10 — building your own from PyPI is a world of pain we don't recommend.

When this script finishes you have a container named node-0 (or node-1 depending on the cluster helper's counter) running on Node 1, with Ray's head process listening on port 6379.


Starting the worker

On Node 2 we have the symmetric script. It's the same run_cluster.sh invocation, but --head is replaced with --worker, and the second positional arg is the head node's IP, not its own:

# /usr/local/bin/run_workernode.sh (on Node 2)
# On Node 2, join as worker

# Set the interface name (same as Node 1)
export MN_IF_NAME=enp1s0f1np1

# Get Node 2's own IP address
export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')

# IMPORTANT: Set HEAD_NODE_IP to Node 1's IP address
export HEAD_NODE_IP=10.0.0.3

# Set vLLM image
export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3

echo "Worker IP: $VLLM_HOST_IP, connecting to head node at: $HEAD_NODE_IP"

bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface \
  -e VLLM_HOST_IP=$VLLM_HOST_IP \
  -e UCX_NET_DEVICES=$MN_IF_NAME \
  -e NCCL_SOCKET_IFNAME=$MN_IF_NAME \
  -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME \
  -e GLOO_SOCKET_IFNAME=$MN_IF_NAME \
  -e TP_SOCKET_IFNAME=$MN_IF_NAME \
  -e RAY_memory_monitor_refresh_ms=0 \
  -e MASTER_ADDR=$HEAD_NODE_IP

The same interface-pinning rules apply on the worker. Two things to notice:

  • HEAD_NODE_IP=10.0.0.3 is the head node's address on the 200 GbE /30 we put on enp1s0f1np1not its management IP. If you point this at the wrong NIC the workers will still find Ray, but every NCCL collective will quietly fall back to the slow path.
  • MASTER_ADDR=$HEAD_NODE_IP here, not local. Ray uses it to elect the rank-0 coordinator for the vLLM TP group; getting this wrong on the worker is a common reason tensor-parallel init hangs forever at startup.

To confirm the two nodes have actually joined, exec into either container and run ray status. We have that wrapped in /usr/local/bin/ray_inference_health.sh:

export VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$')
docker exec $VLLM_CONTAINER ray status
curl http://127.0.0.1:8000/health
docker exec $VLLM_CONTAINER nvidia-smi --query-gpu=memory.used,memory.total --format=csv

You should see two nodes, two GPUs total, and one resource type called GPU with value 2. If you see only one, the worker isn't reaching the head — usually because something on the network path is firewalled or the wrong interface got picked up.


Launching vLLM inside the Ray cluster

With Ray up and both nodes joined, the actual model launch is a docker exec into the head container. That's ~/docker/vllm/qwen36/launch.sh:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
if [[ -f "${SCRIPT_DIR}/.env" ]]; then
  set -a; source "${SCRIPT_DIR}/.env"; set +a
fi
: "${VLLM_API_KEY:?VLLM_API_KEY not set (expected in .env)}"

VLLM_CONTAINER=$(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$' | head -n1)
if [[ -z "${VLLM_CONTAINER}" ]]; then
  echo "No node-* container running. Start the Ray cluster first." >&2
  exit 1
fi

docker exec -it -e VLLM_API_KEY="${VLLM_API_KEY}" "${VLLM_CONTAINER}" /bin/bash -c '
  set -e
  exec vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 \
    --served-model-name qwen3_30b_thinking \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 \
    --max-num-seqs 8 \
    --max-model-len 131072 \
    --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.70 \
    --enable-prefix-caching \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes
'

Key decisions:

  • --tensor-parallel-size 2 — Ray sees two GPUs across two nodes and shards one half of the model to each. vLLM doesn't need --pipeline-parallel-size; TP across two nodes is what we want for a 30B MoE.
  • --max-model-len 131072 — the model supports 256k via YaRN scaling, but 128k is what we actually need and it leaves comfortable KV headroom.
  • --gpu-memory-utilization 0.70 — conservative on purpose. Spark's unified memory layout means GPU allocations and the host can fight; 70% reliably avoids OOM under load.
  • --enable-prefix-caching — large win for system-prompted workloads, which is most of ours.
  • --enable-auto-tool-choice --tool-call-parser hermes — Qwen3 thinking models emit tool calls in the Hermes format. Auto-choice lets the model decide when to call a tool vs. answer directly.

Secrets: .env next to launch.sh

launch.sh sources ~/docker/vllm/qwen36/.env with set -a so every variable is exported into the environment. The file contains exactly two values:

# ~/docker/vllm/qwen36/.env  (mode 0600, .gitignored)
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
VLLM_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  • HF_TOKEN is needed at first run so the container can download the FP8 weights (after the first run they live in ~/.cache/huggingface and the token isn't reached for again).
  • VLLM_API_KEY is the bearer token clients must send as Authorization: Bearer …. vLLM picks it up automatically when the env var is set inside the serving process — that's why launch.sh passes it through with docker exec -e VLLM_API_KEY=….

The : "${VLLM_API_KEY:?…}" guard fails the script fast if the .env is missing or empty, instead of starting an open server. We keep this file chmod 600, never check it into git, and rotate both secrets when anyone leaves the team.


The gotcha: <think> and the wrong reasoning parser

Here's the one that cost us an afternoon.

Qwen3-30B-Thinking emits its reasoning trace inside <think>…</think> tags. vLLM ships a parser called qwen3 that's purpose-built for this format. The obvious thing to do is:

--reasoning-parser qwen3

Don't. With this model and the FP8 image we run, the output stream looks like this:

…this is a multi-step problem, let me think about prime factorisation…
…</think>
The answer is 42.

Notice what's not there: an opening <think> tag. The model's chat template adds it as the last token of the prompt, so generation begins already inside the thinking block. The first thing the model ever emits is the reasoning text itself — the opening tag is implicit. The qwen3 parser is strict and waits for an opening tag that never arrives. The whole reasoning trace ends up in the wrong field of the OpenAI response (or gets dropped entirely, depending on the vLLM version), and clients see an empty reasoning_content followed by content that starts with </think>.

The fix is to use DeepSeek-R1's parser instead:

--reasoning-parser deepseek_r1

deepseek_r1 is lenient about the opening tag — it treats everything from the start of the response up to a closing </think> as reasoning, and everything after as the answer. That happens to be exactly the shape Qwen3-Thinking produces in practice. Reasoning lands in choices[0].message.reasoning_content, the answer lands in choices[0].message.content, and tool-calls round-trip correctly through the Hermes parser.

If you ever see truncated or missing reasoning content with a Qwen thinking model, this is almost certainly the cause. The strict-vs-lenient mismatch isn't documented prominently in either project, but it's load-bearing once you wire the model into a real client.


What we ended up with

  • Two DGX Sparks, one 200 GbE cable between them.
  • Ray cluster running inside nvcr.io/nvidia/vllm:25.11-py3 containers, networking pinned to the high-speed NIC across UCX / NCCL / OMPI / Gloo / vLLM-TP.
  • vllm serve Qwen/Qwen3-30B-A3B-Thinking-2507-FP8 with TP=2, 128k context, prefix caching, Hermes tool calls.
  • OpenAI-compatible endpoint on port 8000 with a real bearer token.
  • Inference at multi-node throughput, with the reasoning trace surfacing cleanly thanks to one carefully chosen parser flag.

Next stop: scaling the same recipe to four Sparks for a 70B-class model. The wrapper scripts barely change — just --tensor-parallel-size 4 and one more --worker invocation per box.


Appendix: useful one-liners

# Health check
curl -s -H "Authorization: Bearer $VLLM_API_KEY" http://localhost:8000/v1/models | jq .

# Ray topology from inside the container
docker exec $(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$') ray status

# Watch GPU memory on both Sparks (run on each)
watch -n 1 'nvidia-smi --query-gpu=memory.used,memory.total --format=csv'

# Tail the vLLM serving process
docker logs -f $(docker ps --format '{{.Names}}' | grep -E '^node-[0-9]+$' | head -n1)