50+ Tokens/sec on a Desktop: Running LLMs on the NVIDIA DGX Spark

XRPL Commons has been experimenting with running powerful LLMs locally for coding, docs, and internal tooling. Our CTO managed to get a 30B‑parameter model generating 51–54 tokens/sec on an NVIDIA DGX Spark by combining MoE models, FP8 quantization, and a clever community Docker workaround for Blackwell GPUs. In this article, we break down the hardware limits, the software hurdles, and the exact setup that made fast on‑prem AI possible.

TL;DR: At XRPL Commons, we’ve been experimenting with local AI infrastructure for development workflows: coding assistants, document drafting, agent systems, and internal tooling. We got a 30B-parameter LLM running at 51–54 tokens/sec on the NVIDIA DGX Spark by combining Mixture-of-Experts models, FP8 quantization, and a community Docker image that fixes Blackwell compatibility issues. Full setup below, brought to you by our CTO.

‍

Why We Wanted Local LLMs

‍

At XRPL Commons, we’ve been experimenting with local AI infrastructure for development workflows: coding assistants, document drafting, agent systems, and internal tooling.

Our requirements were simple:

Fast enough for interactive use
Private enough to run on-premise
Replicable across multiple machines

The DGX Spark looked like an interesting candidate. But achieving good performance requires understanding its real constraint.

‍

The Hardware

‍

The DGX Spark packs significant compute into a desktop system:

NVIDIA GB10 Blackwell GPU (SM 12.1)
128 GB unified LPDDR5X memory
273 GB/s memory bandwidth
ARM Grace CPU (20 cores: 10 Cortex-X925 + 10 Cortex-A725)
4 TB NVMe M.2 (~3.7 TB usable)
DGX OS (Ubuntu 24.04)

The standout feature is the 128 GB unified memory, which allows very large models to run locally.

But the critical limitation is memory bandwidth.

‍

The Bandwidth Wall

‍

LLM inference is primarily memory-bandwidth bound.

During autoregressive decoding, each token requires reading the active model weights from memory. With 273 GB/s bandwidth, the limits become clear:

Model

‍

‍

Our first runs matched this almost exactly:

Qwen3-32B (bf16): 3.7 tok/s
Qwen3-8B (bf16): 13.1 tok/s

Large models fit comfortably in memory, but generate tokens slowly.

‍

The MoE Breakthrough

‍

The solution is Mixture-of-Experts (MoE) models.

Instead of activating the entire network, MoE models route each token through a subset of experts.

Example:

Qwen3-30B-A3B

~30.5B total parameters
~3.3B active parameters per token

This dramatically changes the bandwidth math:

Model

‍

‍

In practice, routing overhead, KV-cache reads, and software stack inefficiencies reduce throughput. Real systems typically achieve 50–70% of theoretical bandwidth limits.

‍

The Blackwell Software Problem

‍

The DGX Spark’s Blackwell GPU (SM 12.1) is new enough that much of the software stack is still catching up.

Issues we encountered:

FlashAttention 2 crashes
vLLM MoE kernels missing SM 12.1
PyTorch officially supports only SM 12.0
CUDA graphs disabled in standard builds

We initially built vLLM from source, patching build scripts and dependencies.

It worked, but required --enforce-eager mode (no CUDA graphs), which capped throughput at about 30 tok/s.

‍

The Avarok Docker Image

‍

A community project solved most of these issues.

The Avarok dgx-vllm Docker image includes:

Patched vLLM v0.16.0rc2
SM 12.1 Blackwell support
Custom CUTLASS kernels
Software fallback for missing NVFP4 instructions
Working FlashAttention and CUDA graphs

Instead of hours compiling from source, deployment becomes a single Docker command.

‍

Results

Model

‍

‍

The winning combination:

MoE architecture + FP8 quantization + CUDA graphs via Avarok Docker

The FP8 model uses about 110 GB of GPU memory, leaving little headroom but delivering excellent throughput.

‍

Final Setup

‍

Deployment is straightforward:

‍

shell

docker pull avarok/dgx-vllm-nvfp4-kernel:v22

‍

docker run -d \

--name vllm \

--gpus all \

--shm-size=16g \

--restart unless-stopped \

-p 8000:8888 \

-v /home/$USER/.cache/huggingface:/root/.cache/huggingface \

-e MODEL=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \

-e PORT=8888 \

-e GPU_MEMORY_UTIL=0.85 \

-e MAX_MODEL_LEN=32768 \

avarok/dgx-vllm-nvfp4-kernel:v22 serve

‍

First startup takes 10–20 minutes (model download and CUDA graph capture). After that, the container auto-starts and exposes an OpenAI-compatible API:

http://localhost:8000/v1

‍

Lessons Learned

‍

1. Understand the bottleneck
On DGX Spark, memory bandwidth determines performance.

‍

2. MoE models are ideal for bandwidth-limited systems
They dramatically reduce active weights per token.

‍

3. FP8 quantization is a free win
Throughput nearly doubled from 30 → 51 tok/s.

‍

4. Avoid building from source if possible
Community builds often include critical patches ahead of official releases.

‍

5. Stop competing inference servers first
One Spark OOM-killed during installation because Ollama was using ~100 GB of memory.

‍

6. Kernel updates can break NVIDIA drivers
If nvidia-smi fails after reboot:

‍

shell

sudo apt install linux-modules-nvidia-580-open-$(uname -r)

‍

Next Experiments

‍

One promising direction is experimenting with reasoning-optimized models. A recent example is Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled (“Qwopus”), which distills structured reasoning from Claude 4.6 Opus into a Qwen3.5 base model.

‍

While it is a dense model and therefore likely slower than our current MoE setup on the DGX Spark, it may offer stronger step-by-step reasoning for coding, math, and agent workflows.

‍

Testing it is simple—swap the model in the same container:

‍

shell

-e MODEL=Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

‍

We plan to benchmark reasoning quality vs throughput alongside the current MoE setup.

‍

References

‍

NVIDIA DGX Spark
https://www.nvidia.com/en-us/data-center/dgx-spark/

‍

Avarok dgx-vllm Docker project
https://github.com/avarok-ai/dgx-vllm

‍

vLLM documentation
https://docs.vllm.ai

‍

Qwen3-30B-A3B FP8 model
https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

‍

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled

‍

Let us know what you’re building.

‍

Written by

Luc Bocahut

Product Director & Solutions Architect

XRPL Commons

Luc is an experienced technical and financial executive based in Paris. His early career included managing a global macro hedge fund. He holds a Master's in Economics and Social Sciences from the Sorbonne and an MBA from EDHEC. Recently, he has served as a fractional CTO, CFO, and COS for multiple companies in France and the US, offering consulting services with a focus on the crypto space.