Atlas Inference

community

AI & ML interests

We built Atlas, a pure Rust LLM inference engine with custom CUDA kernels for the NVIDIA DGX Spark GB10, and we are gearing up for an open source release. 102 tok/s on Qwen3.5-35B 3x over vLLM on the same hardware, 2GB binary, 2 minute cold start (at LEAST 10x smaller and faster)

Recent Activity

nologik updated a Space 3 days ago

Atlas-Inference/README

AzeezIsh published a Space about 2 months ago

Atlas-Inference/README

View all activity

Organization Card

Community About org cards

📣 Read the launch announcement on X →

Atlas Inference

Pure Rust LLM Inference.

What is Atlas?

Atlas is a from-scratch Rust + CUDA inference engine built for the next decade of LLM deployment. No Python interpreter. No PyTorch. No 20 GB Docker image. One ~2.5 GB binary that boots in under two minutes and pins the bandwidth ceiling on every supported (Hardware × Model × Quantization) target.

We started on NVIDIA's DGX Spark (GB10 / SM121) with twelve hand-tuned model targets and a plug-and-play architecture designed so AMD, Intel, and Apple Silicon can land as community contributions, and so the next round of model families slot in the same way the Qwens did this quarter.

Why Atlas

	Atlas	vLLM (same hardware)
Image size	~2.5 GB	20+ GB
Cold start	<2 min	~10 min
Runtime	Rust + CUDA	Python + PyTorch
Dependencies	None	200+ packages
Peak Qwen3.5-35B (NVFP4)	130 tok/s	~38 tok/s
Average across workloads	111 tok/s (3.0×)	37 tok/s

Same hardware. Same model weights. Bring your own benchmark — scripts/sweep_all_models.sh is in the repo and we publish the vLLM baseline command alongside ours so you can verify both. If you reproduce a faster vLLM number, file an issue. We would rather be measured than congratulated.

Quick Start

docker pull avarok/atlas-gb10:latest

docker run --gpus all --ipc=host -p 8888:8888 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Sehyo/Qwen3.5-35B-A3B-NVFP4 --speculative --mtp-quantization nvfp4

Anything OpenAI- or Anthropic-compatible — curl, the OpenAI SDK, opencode, Claude Code, Cline, Open WebUI — points at port 8888:

curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"atlas","messages":[{"role":"user","content":"Hello!"}],"max_tokens":256}'

Per-model recipes (vision, MoE, multi-node EP=2, single-GPU 122B with the tighter budget) live in QUICKSTART.md.

What Ships Today

Thirteen hand-tuned (Hardware × Model × Quantization) targets across Qwen3 / Qwen3.5 / Qwen3.6 / Qwen3-Next / Qwen3-VL / Gemma-4 / Mistral / MiniMax / Nemotron-H families. Every supported model runs off one multi-model binary; the right kernel set is selected at startup from the model's config.json.

Model	Params / active	Quant	Architecture	Throughput
Qwen3.5-35B-A3B (MTP K=2)	35B / 3B	NVFP4	GDN + Attention + MoE	~130 tok/s
Qwen3-VL-30B-A3B	30B / 3B	NVFP4	Vision + Attention + MoE	~97 tok/s
Nemotron-3-Nano-30B-A3B	30B / 3.5B	NVFP4 / FP8	Mamba-2 + Attention + MoE	~88 tok/s
Qwen3-Next-80B-A3B	80B / 3B	NVFP4	SSM + Attention + MoE	~74–87 tok/s
Qwen3.6-35B-A3B	35B / 3B	FP8	GDN + Attention + MoE + ViT	~71 tok/s
Gemma-4-26B-A4B	26B / 4B	NVFP4	Attention + MoE (GeGLU)	~67 tok/s
Qwen3.5-122B-A10B (EP=2)	122B / 10B	NVFP4	GDN + Attention + MoE	~46 tok/s
Mistral-Small-4-119B	119B / 6.5B	NVFP4	MLA + MoE	~33 tok/s
Nemotron-3-Super-120B-A12B	120B / 12B	NVFP4 / FP8	Mamba-2 + Attention + MoE	~24 tok/s
MiniMax-M2.7 (EP=2)	229B / ~10B	NVFP4	Attention + 256-expert MoE	~15 tok/s
Qwen3.5-27B (dense hybrid)	27B	NVFP4	Hybrid SSM + Attention	~13 tok/s
Gemma-4-31B	31B	NVFP4	Attention (sliding + full)	~9–11 tok/s

Full HuggingFace IDs, methodology, and the kernel-by-kernel comparison against PyTorch eager live in the GitHub README.

What Works Today

Component	Status
OpenAI- and Anthropic-compatible HTTP API (streaming + non-streaming)	✅
Tool calling (Hermes, Qwen3-Coder, Mistral formats) with grammar-constrained decoding	✅
Reasoning / thinking tokens with budget cap	✅
Concurrent batched decode + per-batch CUDA graphs	✅
MTP speculative decoding (K=2, pipelined verify)	✅
Prefix caching via radix tree (RadixAttention) + SSM snapshot cache (Marconi) — 10× warm-cache TTFT	✅
KV cache dtypes — BF16, FP8, NVFP4, turbo3, turbo4	✅
MoE routing up to 512 experts	✅
Vision encoder (Qwen3-VL, Qwen3.6 ViT)	✅
Multi-GPU expert parallelism (EP=2 over RoCEv2)	✅
SLO-aware scheduling, chunked prefill, active context compaction	✅
High-speed NVMe KV swap (sliding-window aware)	✅
Auto OOM pre-flight + UVM fallback on host OOM	✅

Plug & Play Architecture

Atlas is built around a small set of Rust traits and a kernel registry — each marked with 🔌 below is the abstraction boundary where a new integration plugs in without touching anything above or below it:

Plug Point	What It Abstracts	To Add Support
🔌 `trait ModelWeightLoader`	HuggingFace → layer translation	Implement one struct + add a match arm in `factory.rs`
🔌 `trait TransformerLayer`	Per-layer compute (attn, SSM, MoE, FFN)	Compose existing primitives or implement a new layer type
🔌 `trait GpuBackend`	All GPU memory and kernel ops	Swap CUDA for another accelerator backend
🔌 `kernels/<hw>/<model>/<quant>/`	Hardware-tuned CUDA kernels	Drop a directory with `MODEL.toml` + `.cu` files; `build.rs` auto-discovers it
🔌 `trait CommBackend`	Multi-GPU collectives	Implement for MPI, GDR, custom interconnects
🔌 `trait StorageBackend`	NVMe KV-cache offload I/O	Implement for CXL, RDMA, other storage tiers

A MockGpuBackend in spark-runtime lets you write and test the entire scaffold without owning the hardware — every layer above the GPU trait is hardware-agnostic.

What the Community is Saying

"103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup." — ronald_15496, Discord #general

"Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas." — PersonWhoThinks, r/LocalLLaMA

"I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this." — tetsuro59, Discord #general

Citations

We did not invent the kernels we ship. We picked the right ideas from the right papers, fused them together, and tuned them for one chip until they pinned the bandwidth ceiling. Direct intellectual debts: FlashAttention-2 (Dao, 2024), FlashAttention-4 (Shah et al., 2025), FlashInfer (Ye et al., MLSys 2025), SageAttention 3 (Zhang et al., NeurIPS 2025), LeanAttention (Roy et al., 2024). Full references in the GitHub README.

License & Enterprise Edition

Atlas operates under a dual-license model. Both are real and intentional.

Community Edition — AGPLv3. Free, open, copyleft. Use it on your own hardware for research, hobby, side-projects, hosted demos.
Enterprise Edition — commercial license. Ship Atlas inside a closed-source product, run it as a SaaS backend without inheriting the AGPLv3 source-disclosure obligation, get a support relationship with the people who wrote the kernels, and prioritized model and hardware ports. Reach us via the website or Discord.

A permissive license keeps us building Atlas full-time; the AGPL community license keeps the project honest. What is in this repository is what we run.

atlasinference.io · GitHub · Docker Hub · Discord