Qwen3-4B / 8B / 32B

Qwen3-4B is the default openinfer model line: pure Rust + CUDA, no Python at build time or runtime, full-attention GQA, paged KV cache, prefix caching, CUDA Graph decode, optional pegaflow KV offload, and DSpark speculative decoding.

Launch

From the openinfer workspace root:

huggingface-cli download Qwen/Qwen3-4B --local-dir models/Qwen3-4B

export CUDA_HOME=/usr/local/cuda
cargo run --release

The default model path is models/Qwen3-4B, and openinfer-server is the workspace default member. To pass an explicit model path or port:

cargo run --release -p openinfer-server -- \
  --model-path models/Qwen3-4B \
  --port 8000

The server exposes an OpenAI-compatible /v1/completions endpoint:

curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "models/Qwen3-4B", "prompt": "The capital of France is", "max_tokens": 32}'

Streaming:

curl -N http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "models/Qwen3-4B", "prompt": "Write a haiku about Rust:", "max_tokens": 64, "stream": true}'

Useful Qwen3 flags:

# Disable CUDA Graph for debugging
cargo run --release -- --cuda-graph=false

# Pure host-tier KV offload benchmark mode
cargo run --release -- \
  --kv-offload \
  --kv-offload-host-gib 16 \
  --no-prefix-cache

# DSpark speculative decoding (greedy, single-GPU)
cargo run --release -- \
  --model-path models/Qwen3-4B \
  --dflash-draft-model-path models/dspark_qwen3_4b_block7

Qwen3-8B

Qwen3-8B uses the same architecture (4096 hidden, 12288 intermediate, 36 layers) and runs on the same single GPU — just point --model-path at the 8B weights. No feature flags or build changes needed.

cargo run --release -- --model-path models/Qwen3-8B

Qwen3-32B

Qwen3-32B’s BF16 weights (~63 GB) need a single large-VRAM GPU (GH200/H200 class).

huggingface-cli download Qwen/Qwen3-32B --local-dir models/Qwen3-32B

cargo run --release -- --model-path models/Qwen3-32B

Tool calling goes through /v1/chat/completions with a tools array; a get_weather round-trip returns:

{"choices":[{"message":{"role":"assistant","tool_calls":[{"function":{"name":"get_weather",
  "arguments":"{\"city\": \"Paris\"}"}}]},"finish_reason":"tool_calls"}]}

Performance

Measured on 1x RTX 5090 32GB, driver 590.48.01, CUDA 13.1 build, Qwen3-4B BF16 weights, TP1. openinfer main 70888b2, vLLM 0.24.0, same vllm bench serve client, same host, same GPU, prefix cache on, seed 42, input 1024 / output 128 for the QPS sweep. Reproducible via tools/bench/run_serving_bench.sh in the repo.

Footprint

Metric	openinfer	vLLM 0.24.0
RSS before stress, loaded and idle	771 MB	3814 MB
RSS after stress	1064 MB	3863 MB
Startup to HTTP ready, cold	2.99 s	70.0 s
Startup, warm compile cache	~3.0 s	32.7 s
GPU memory, default utilization	28832 MiB	30290 MiB

openinfer is a single process; vLLM RSS is summed over its process tree. The openinfer RSS peak during load is transient while reading safetensors through mmap; steady-state settles at 771 MB after load.

Serving Load

Poisson arrivals, 1024-token prompts, 128-token outputs, greedy (--temperature 0):

QPS	openinfer out tok/s	vLLM out tok/s	openinfer TTFT p50	vLLM TTFT p50	openinfer TPOT p50	vLLM TPOT p50
1	126.3	126.2	45.2 ms	54.9 ms	6.53 ms	6.71 ms
2	252.3	252.2	30.3 ms	38.4 ms	6.93 ms	7.08 ms
4	504.1	503.3	48.8 ms	38.7 ms	8.30 ms	7.95 ms
8	1007.8	1006.9	51.1 ms	66.9 ms	11.39 ms	11.97 ms
10	1258.3	1256.3	53.4 ms	76.3 ms	13.55 ms	14.11 ms
12	1507.7	1506.2	60.0 ms	106.0 ms	16.75 ms	18.36 ms
16	1979.9	1687.9	203.8 ms	3832.3 ms	46.92 ms	79.42 ms

Low load (QPS 1–4) is comparable. At QPS 8–12 openinfer leads on both TTFT and TPOT. At QPS 16 both systems are overloaded, but openinfer edges ahead on throughput (1980 vs 1688 output tok/s) and stays 19× lower on TTFT.

Qwen3-8B Serving Load

Same harness, Qwen3-8B BF16, single RTX 5090 (32 GB). The 8B model is 2× the weights of 4B; throughput scales accordingly until the GPU saturates around QPS 8:

QPS	openinfer out tok/s	vLLM out tok/s	openinfer TTFT p50	vLLM TTFT p50	openinfer TPOT p50	vLLM TPOT p50
1	125.1	125.0	82.2 ms	97.4 ms	11.55 ms	11.63 ms
2	249.9	250.0	54.1 ms	61.5 ms	11.46 ms	11.57 ms
4	498.6	498.5	88.1 ms	103.6 ms	16.08 ms	16.24 ms
8	991.9	990.4	148.0 ms	235.1 ms	30.97 ms	35.56 ms

Qwen3-32B Serving Load

Measured on 1x GH200 120GB (aarch64, sm_90), openinfer main 5959f05, Qwen3-32B BF16, TP1, CUDA Graph on. Load to HTTP-ready is 46 s; the profiled KV budget is 21.4 GB (5360 blocks) next to the 63 GB of weights. QPS sweep with vllm-bench, Poisson arrivals, 1024-token prompts, 128-token outputs, greedy, seed 42 — reproducible via tools/bench/run_serving_bench.sh in the repo.

load	req/s	out tok/s	TTFT p50 / p99	TPOT p50 / p99
c=1	0.35	45	134 / 137 ms	21.1 / 21.1 ms
QPS 1	0.95	122	154 / 316 ms	25.3 / 29.3 ms
QPS 2	1.91	244	103 / 289 ms	24.9 / 34.3 ms
c=4	1.24	159	286 / 296 ms	24.0 / 25.8 ms
QPS 4	3.77	482	202 / 593 ms	59.4 / 74.8 ms
c=8	2.02	258	294 / 462 ms	28.9 / 30.0 ms
QPS 8	5.22	668	16.2 / 28.0 s	107.2 / 107.5 ms

c=N rows hold N requests in flight; QPS n rows are Poisson arrivals. The single GPU saturates around 5.3 req/s and 680 output tok/s at this shape; past that (QPS 10–16) throughput stays flat and TTFT grows with queueing.

Greedy output matches HF transformers (bf16, same GPU) token-for-token on 4 of 5 test prompts over the first 20 tokens. The fifth diverges at the second generated token, where HF’s own top-4 logits sit within a 0.375 spread and openinfer emits HF’s second-ranked token, 0.25 below the top.

Warm Prefix-Cache TTFT

For multi-turn chat and agent workloads, most of the prompt often lands as a warm prefix-cache hit. In this sweep, the same prompt group is sent cold once to populate GPU KV cache, then sent warm:

Input length	openinfer cold	openinfer warm p50	openinfer warm p99	vLLM warm p50	vLLM warm p99
256	16.2 ms	8.5 ms	8.8 ms	14.5 ms	19.1 ms
512	24.6 ms	8.6 ms	8.8 ms	16.0 ms	16.4 ms
1024	44.0 ms	9.2 ms	9.5 ms	18.4 ms	19.0 ms
2048	92.0 ms	10.4 ms	10.8 ms	23.7 ms	24.4 ms
4096	211.5 ms	12.7 ms	13.4 ms	34.1 ms	36.2 ms
8192	460.0 ms	21.6 ms	22.8 ms	58.6 ms	59.9 ms
16384	1143.9 ms	26.3 ms	27.9 ms	95.6 ms	98.2 ms

openinfer wins warm TTFT at every measured length; the 16k warm-cache path is 3.6× faster than vLLM p50.

KV Offload

With --kv-offload, sealed Qwen3 KV blocks can be restored from the pegaflow host tier instead of recomputing full prefill. The pure-L2 mode below disables cross-request HBM prefix reuse, so every prefix hit is restored from host DRAM:

cargo run --release -- \
  --kv-offload \
  --kv-offload-host-gib 16 \
  --no-prefix-cache

Input length	Cold full prefill	L2 warm p50, host restore	Speedup
256	25.4 ms	9.8 ms	2.6x
512	25.6 ms	11.6 ms	2.2x
1024	45.3 ms	15.4 ms	2.9x
2048	92.5 ms	22.9 ms	4.0x
4096	211.1 ms	37.5 ms	5.6x
8192	461.3 ms	71.4 ms	6.5x
16384	1140.5 ms	125.5 ms	9.1x

At 16k, the tiering picture is: HBM hit about 26 ms, host-tier restore about 126 ms, cold prefill about 1.14 s.

DSpark Speculative Decoding

DSpark (DeepSeek-AI, Jun 2026) adds a semi-autoregressive Markov head to a DFlash parallel drafter, raising accepted draft length by conditioning each block position on the previously sampled token. openinfer supports it behind --dflash-draft-model-path — the drafter checkpoint goes in, the target model serves as-is, and greedy verify keeps output lossless.

# Download the released DSpark block7 drafter
huggingface-cli download deepseek-ai/dspark_qwen3_4b_block7 \
  --local-dir models/dspark_qwen3_4b_block7
# https://huggingface.co/deepseek-ai/dspark_qwen3_4b_block7

# Launch with speculative decoding (greedy, single-GPU)
cargo run --release -- \
  --model-path models/Qwen3-4B \
  --dflash-draft-model-path models/dspark_qwen3_4b_block7

Single-stream TPOT drops from 5.8 ms to 3.0 ms — roughly 2× decode speedup from amortizing target forwards over accepted drafts. Concurrency sweep, greedy, sharegpt + SPEED-Bench (coding) datasets:

ShareGPT:

Concurrency	baseline tok/s	DSpark tok/s	baseline TPOT p50	DSpark TPOT p50
1	170	381	5.83 ms	2.96 ms
4	576	1288	6.72 ms	3.59 ms

SPEED-Bench (coding):

Concurrency	baseline tok/s	DSpark tok/s	baseline TPOT p50	DSpark TPOT p50
1	164	314	5.87 ms	3.07 ms
4	574	988	6.73 ms	3.77 ms

DSpark roughly doubles throughput and halves TPOT across both datasets.

DFlash (the non-Markov predecessor, dflash_qwen3_4b_block7) is also supported via the same flag with a DFlash-format drafter checkpoint. DSpark is the recommended drafter for Qwen3-4B.

Architecture Notes

Full attention with grouped-query attention: 32 query heads, 8 KV heads, head dim 128, 36 layers. Qwen3-32B scales this to 64 query heads and 64 layers (GQA group 8).
Qwen3-4B, Qwen3-8B, and Qwen3-32B are the default pure Rust + CUDA build, with no Python build dependency.
Paged KV cache uses full-lifetime admission, so requests that cannot fit are rejected instead of hanging under memory pressure.
Prefix cache is on by default; --no-prefix-cache disables GPU prefix matching, or becomes pure-L2 host restore mode when combined with --kv-offload.
CUDA Graph decode uses pre-allocated buffers and can be disabled with --cuda-graph=false for debugging.
DSpark/DFlash speculative decoding is single-GPU, greedy-only, and forces prefix caching off (the drafter needs clean target hidden states).