50+ Tokens/sec on a Desktop: Running LLMs on the NVIDIA DGX Spark

XRPL Commons has been experimenting with running powerful LLMs locally for coding, docs, and internal tooling. Our CTO managed to get a 30B‑parameter model generating 51–54 tokens/sec on an NVIDIA DGX Spark by combining MoE models, FP8 quantization, and a clever community Docker workaround for Blackwell GPUs. In this article, we break down the hardware limits, the software hurdles, and the exact setup that made fast on‑prem AI possible.

by
Luc Bocahut
March 19, 2026

TL;DR: At XRPL Commons, we’ve been experimenting with local AI infrastructure for development workflows: coding assistants, document drafting, agent systems, and internal tooling. We got a 30B-parameter LLM running at 51–54 tokens/sec on the NVIDIA DGX Spark by combining Mixture-of-Experts models, FP8 quantization, and a community Docker image that fixes Blackwell compatibility issues. Full setup below, brought to you by our CTO.

Why We Wanted Local LLMs

At XRPL Commons, we’ve been experimenting with local AI infrastructure for development workflows: coding assistants, document drafting, agent systems, and internal tooling.

Our requirements were simple:

  • Fast enough for interactive use
  • Private enough to run on-premise
  • Replicable across multiple machines

The DGX Spark looked like an interesting candidate. But achieving good performance requires understanding its real constraint.

The Hardware

The DGX Spark packs significant compute into a desktop system:

  • NVIDIA GB10 Blackwell GPU (SM 12.1)
  • 128 GB unified LPDDR5X memory
  • 273 GB/s memory bandwidth
  • ARM Grace CPU (20 cores: 10 Cortex-X925 + 10 Cortex-A725)
  • 4 TB NVMe M.2 (~3.7 TB usable)
  • DGX OS (Ubuntu 24.04)

The standout feature is the 128 GB unified memory, which allows very large models to run locally.

But the critical limitation is memory bandwidth.

The Bandwidth Wall

LLM inference is primarily memory-bandwidth bound.

During autoregressive decoding, each token requires reading the active model weights from memory. With 273 GB/s bandwidth, the limits become clear:

Model

Our first runs matched this almost exactly:

  • Qwen3-32B (bf16): 3.7 tok/s
  • Qwen3-8B (bf16): 13.1 tok/s

Large models fit comfortably in memory, but generate tokens slowly.

The MoE Breakthrough

The solution is Mixture-of-Experts (MoE) models.

Instead of activating the entire network, MoE models route each token through a subset of experts.

Example:

Qwen3-30B-A3B

  • ~30.5B total parameters
  • ~3.3B active parameters per token

This dramatically changes the bandwidth math:

Model

In practice, routing overhead, KV-cache reads, and software stack inefficiencies reduce throughput. Real systems typically achieve 50–70% of theoretical bandwidth limits.

The Blackwell Software Problem

The DGX Spark’s Blackwell GPU (SM 12.1) is new enough that much of the software stack is still catching up.

Issues we encountered:

  • FlashAttention 2 crashes
  • vLLM MoE kernels missing SM 12.1
  • PyTorch officially supports only SM 12.0
  • CUDA graphs disabled in standard builds

We initially built vLLM from source, patching build scripts and dependencies.

It worked, but required --enforce-eager mode (no CUDA graphs), which capped throughput at about 30 tok/s.

The Avarok Docker Image

A community project solved most of these issues.

The Avarok dgx-vllm Docker image includes:

  • Patched vLLM v0.16.0rc2
  • SM 12.1 Blackwell support
  • Custom CUTLASS kernels
  • Software fallback for missing NVFP4 instructions
  • Working FlashAttention and CUDA graphs

Instead of hours compiling from source, deployment becomes a single Docker command.

Results

Model

The winning combination:

MoE architecture + FP8 quantization + CUDA graphs via Avarok Docker

The FP8 model uses about 110 GB of GPU memory, leaving little headroom but delivering excellent throughput.

Final Setup

Deployment is straightforward:

shell

docker pull avarok/dgx-vllm-nvfp4-kernel:v22

docker run -d \

  --name vllm \

  --gpus all \

  --shm-size=16g \

  --restart unless-stopped \

  -p 8000:8888 \

  -v /home/$USER/.cache/huggingface:/root/.cache/huggingface \

  -e MODEL=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \

  -e PORT=8888 \

  -e GPU_MEMORY_UTIL=0.85 \

  -e MAX_MODEL_LEN=32768 \

  avarok/dgx-vllm-nvfp4-kernel:v22 serve

First startup takes 10–20 minutes (model download and CUDA graph capture). After that, the container auto-starts and exposes an OpenAI-compatible API:

http://localhost:8000/v1

Lessons Learned

1. Understand the bottleneck
On DGX Spark, memory bandwidth determines performance.

2. MoE models are ideal for bandwidth-limited systems
They dramatically reduce active weights per token.

3. FP8 quantization is a free win
Throughput nearly doubled from 30 → 51 tok/s.

4. Avoid building from source if possible
Community builds often include critical patches ahead of official releases.

5. Stop competing inference servers first
One Spark OOM-killed during installation because Ollama was using ~100 GB of memory.

6. Kernel updates can break NVIDIA drivers
If nvidia-smi fails after reboot:

shell

sudo apt install linux-modules-nvidia-580-open-$(uname -r)

Next Experiments

One promising direction is experimenting with reasoning-optimized models. A recent example is Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled (“Qwopus”), which distills structured reasoning from Claude 4.6 Opus into a Qwen3.5 base model.

While it is a dense model and therefore likely slower than our current MoE setup on the DGX Spark, it may offer stronger step-by-step reasoning for coding, math, and agent workflows.

Testing it is simple—swap the model in the same container:

shell

-e MODEL=Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

We plan to benchmark reasoning quality vs throughput alongside the current MoE setup.

References

NVIDIA DGX Spark
https://www.nvidia.com/en-us/data-center/dgx-spark/

Avarok dgx-vllm Docker project
https://github.com/avarok-ai/dgx-vllm

vLLM documentation
https://docs.vllm.ai

Qwen3-30B-A3B FP8 model
https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

Let us know what you’re building.