Atlas Inference
AI & ML interests
We built Atlas, a pure Rust LLM inference engine with custom CUDA kernels for the NVIDIA DGX Spark GB10, and we are gearing up for an open source release. 102 tok/s on Qwen3.5-35B 3x over vLLM on the same hardware, 2GB binary, 2 minute cold start (at LEAST 10x smaller and faster)
Recent Activity
๐ฃ Read the launch announcement on X โ
Atlas Inference
Pure Rust LLM Inference.
What is Atlas?
Atlas is a from-scratch Rust + CUDA inference engine built for the next decade of LLM deployment. No Python interpreter. No PyTorch. No 20 GB Docker image. One ~2.5 GB binary that boots in under two minutes and pins the bandwidth ceiling on every supported (Hardware ร Model ร Quantization) target.
We started on NVIDIA's DGX Spark (GB10 / SM121) with twelve hand-tuned model targets and a plug-and-play architecture designed so AMD, Intel, and Apple Silicon can land as community contributions, and so the next round of model families slot in the same way the Qwens did this quarter.
Why Atlas
| Atlas | vLLM (same hardware) | |
|---|---|---|
| Image size | ~2.5 GB | 20+ GB |
| Cold start | <2 min | ~10 min |
| Runtime | Rust + CUDA | Python + PyTorch |
| Dependencies | None | 200+ packages |
| Peak Qwen3.5-35B (NVFP4) | 130 tok/s | ~38 tok/s |
| Average across workloads | 111 tok/s (3.0ร) | 37 tok/s |
Same hardware. Same model weights. Bring your own benchmark โ scripts/sweep_all_models.sh is in the repo and we publish the vLLM baseline command alongside ours so you can verify both. If you reproduce a faster vLLM number, file an issue. We would rather be measured than congratulated.
Quick Start
docker pull avarok/atlas-gb10:latest
docker run --gpus all --ipc=host -p 8888:8888 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
avarok/atlas-gb10:latest \
serve Sehyo/Qwen3.5-35B-A3B-NVFP4 --speculative --mtp-quantization nvfp4
Anything OpenAI- or Anthropic-compatible โ curl, the OpenAI SDK, opencode, Claude Code, Cline, Open WebUI โ points at port 8888:
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"atlas","messages":[{"role":"user","content":"Hello!"}],"max_tokens":256}'
Per-model recipes (vision, MoE, multi-node EP=2, single-GPU 122B with the tighter budget) live in QUICKSTART.md.
What Ships Today
Thirteen hand-tuned (Hardware ร Model ร Quantization) targets across Qwen3 / Qwen3.5 / Qwen3.6 / Qwen3-Next / Qwen3-VL / Gemma-4 / Mistral / MiniMax / Nemotron-H families. Every supported model runs off one multi-model binary; the right kernel set is selected at startup from the model's config.json.
| Model | Params / active | Quant | Architecture | Throughput |
|---|---|---|---|---|
| Qwen3.5-35B-A3B (MTP K=2) | 35B / 3B | NVFP4 | GDN + Attention + MoE | ~130 tok/s |
| Qwen3-VL-30B-A3B | 30B / 3B | NVFP4 | Vision + Attention + MoE | ~97 tok/s |
| Nemotron-3-Nano-30B-A3B | 30B / 3.5B | NVFP4 / FP8 | Mamba-2 + Attention + MoE | ~88 tok/s |
| Qwen3-Next-80B-A3B | 80B / 3B | NVFP4 | SSM + Attention + MoE | ~74โ87 tok/s |
| Qwen3.6-35B-A3B | 35B / 3B | FP8 | GDN + Attention + MoE + ViT | ~71 tok/s |
| Gemma-4-26B-A4B | 26B / 4B | NVFP4 | Attention + MoE (GeGLU) | ~67 tok/s |
| Qwen3.5-122B-A10B (EP=2) | 122B / 10B | NVFP4 | GDN + Attention + MoE | ~46 tok/s |
| Mistral-Small-4-119B | 119B / 6.5B | NVFP4 | MLA + MoE | ~33 tok/s |
| Nemotron-3-Super-120B-A12B | 120B / 12B | NVFP4 / FP8 | Mamba-2 + Attention + MoE | ~24 tok/s |
| MiniMax-M2.7 (EP=2) | 229B / ~10B | NVFP4 | Attention + 256-expert MoE | ~15 tok/s |
| Qwen3.5-27B (dense hybrid) | 27B | NVFP4 | Hybrid SSM + Attention | ~13 tok/s |
| Gemma-4-31B | 31B | NVFP4 | Attention (sliding + full) | ~9โ11 tok/s |
Full HuggingFace IDs, methodology, and the kernel-by-kernel comparison against PyTorch eager live in the GitHub README.
What Works Today
| Component | Status |
|---|---|
| OpenAI- and Anthropic-compatible HTTP API (streaming + non-streaming) | โ |
| Tool calling (Hermes, Qwen3-Coder, Mistral formats) with grammar-constrained decoding | โ |
| Reasoning / thinking tokens with budget cap | โ |
| Concurrent batched decode + per-batch CUDA graphs | โ |
| MTP speculative decoding (K=2, pipelined verify) | โ |
| Prefix caching via radix tree (RadixAttention) + SSM snapshot cache (Marconi) โ 10ร warm-cache TTFT | โ |
| KV cache dtypes โ BF16, FP8, NVFP4, turbo3, turbo4 | โ |
| MoE routing up to 512 experts | โ |
| Vision encoder (Qwen3-VL, Qwen3.6 ViT) | โ |
| Multi-GPU expert parallelism (EP=2 over RoCEv2) | โ |
| SLO-aware scheduling, chunked prefill, active context compaction | โ |
| High-speed NVMe KV swap (sliding-window aware) | โ |
| Auto OOM pre-flight + UVM fallback on host OOM | โ |
Plug & Play Architecture
Atlas is built around a small set of Rust traits and a kernel registry โ each marked with ๐ below is the abstraction boundary where a new integration plugs in without touching anything above or below it:
| Plug Point | What It Abstracts | To Add Support |
|---|---|---|
๐ trait ModelWeightLoader |
HuggingFace โ layer translation | Implement one struct + add a match arm in factory.rs |
๐ trait TransformerLayer |
Per-layer compute (attn, SSM, MoE, FFN) | Compose existing primitives or implement a new layer type |
๐ trait GpuBackend |
All GPU memory and kernel ops | Swap CUDA for another accelerator backend |
๐ kernels/<hw>/<model>/<quant>/ |
Hardware-tuned CUDA kernels | Drop a directory with MODEL.toml + .cu files; build.rs auto-discovers it |
๐ trait CommBackend |
Multi-GPU collectives | Implement for MPI, GDR, custom interconnects |
๐ trait StorageBackend |
NVMe KV-cache offload I/O | Implement for CXL, RDMA, other storage tiers |
A MockGpuBackend in spark-runtime lets you write and test the entire scaffold without owning the hardware โ every layer above the GPU trait is hardware-agnostic.
What the Community is Saying
"103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup." โ ronald_15496, Discord #general
"Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas." โ PersonWhoThinks, r/LocalLLaMA
"I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this." โ tetsuro59, Discord #general
Citations
We did not invent the kernels we ship. We picked the right ideas from the right papers, fused them together, and tuned them for one chip until they pinned the bandwidth ceiling. Direct intellectual debts: FlashAttention-2 (Dao, 2024), FlashAttention-4 (Shah et al., 2025), FlashInfer (Ye et al., MLSys 2025), SageAttention 3 (Zhang et al., NeurIPS 2025), LeanAttention (Roy et al., 2024). Full references in the GitHub README.
License & Enterprise Edition
Atlas operates under a dual-license model. Both are real and intentional.
- Community Edition โ AGPLv3. Free, open, copyleft. Use it on your own hardware for research, hobby, side-projects, hosted demos.
- Enterprise Edition โ commercial license. Ship Atlas inside a closed-source product, run it as a SaaS backend without inheriting the AGPLv3 source-disclosure obligation, get a support relationship with the people who wrote the kernels, and prioritized model and hardware ports. Reach us via the website or Discord.
A permissive license keeps us building Atlas full-time; the AGPL community license keeps the project honest. What is in this repository is what we run.
atlasinference.io ยท GitHub ยท Docker Hub ยท Discord