Atlas Inference

community
Activity Feed

AI & ML interests

We built Atlas, a pure Rust LLM inference engine with custom CUDA kernels for the NVIDIA DGX Spark GB10, and we are gearing up for an open source release. 102 tok/s on Qwen3.5-35B 3x over vLLM on the same hardware, 2GB binary, 2 minute cold start (at LEAST 10x smaller and faster)

Recent Activity

nologikย  updated a Space 3 days ago
Atlas-Inference/README
AzeezIshย  published a Space about 2 months ago
Atlas-Inference/README
View all activity

Organization Card

๐Ÿ“ฃ Read the launch announcement on X โ†’

Atlas Inference

Pure Rust LLM Inference.

Website GitHub Docker Hub Discord License: AGPLv3


What is Atlas?

Atlas is a from-scratch Rust + CUDA inference engine built for the next decade of LLM deployment. No Python interpreter. No PyTorch. No 20 GB Docker image. One ~2.5 GB binary that boots in under two minutes and pins the bandwidth ceiling on every supported (Hardware ร— Model ร— Quantization) target.

We started on NVIDIA's DGX Spark (GB10 / SM121) with twelve hand-tuned model targets and a plug-and-play architecture designed so AMD, Intel, and Apple Silicon can land as community contributions, and so the next round of model families slot in the same way the Qwens did this quarter.

Why Atlas

Atlas vLLM (same hardware)
Image size ~2.5 GB 20+ GB
Cold start <2 min ~10 min
Runtime Rust + CUDA Python + PyTorch
Dependencies None 200+ packages
Peak Qwen3.5-35B (NVFP4) 130 tok/s ~38 tok/s
Average across workloads 111 tok/s (3.0ร—) 37 tok/s

Same hardware. Same model weights. Bring your own benchmark โ€” scripts/sweep_all_models.sh is in the repo and we publish the vLLM baseline command alongside ours so you can verify both. If you reproduce a faster vLLM number, file an issue. We would rather be measured than congratulated.

Quick Start

docker pull avarok/atlas-gb10:latest

docker run --gpus all --ipc=host -p 8888:8888 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Sehyo/Qwen3.5-35B-A3B-NVFP4 --speculative --mtp-quantization nvfp4

Anything OpenAI- or Anthropic-compatible โ€” curl, the OpenAI SDK, opencode, Claude Code, Cline, Open WebUI โ€” points at port 8888:

curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"atlas","messages":[{"role":"user","content":"Hello!"}],"max_tokens":256}'

Per-model recipes (vision, MoE, multi-node EP=2, single-GPU 122B with the tighter budget) live in QUICKSTART.md.

What Ships Today

Thirteen hand-tuned (Hardware ร— Model ร— Quantization) targets across Qwen3 / Qwen3.5 / Qwen3.6 / Qwen3-Next / Qwen3-VL / Gemma-4 / Mistral / MiniMax / Nemotron-H families. Every supported model runs off one multi-model binary; the right kernel set is selected at startup from the model's config.json.

Model Params / active Quant Architecture Throughput
Qwen3.5-35B-A3B (MTP K=2) 35B / 3B NVFP4 GDN + Attention + MoE ~130 tok/s
Qwen3-VL-30B-A3B 30B / 3B NVFP4 Vision + Attention + MoE ~97 tok/s
Nemotron-3-Nano-30B-A3B 30B / 3.5B NVFP4 / FP8 Mamba-2 + Attention + MoE ~88 tok/s
Qwen3-Next-80B-A3B 80B / 3B NVFP4 SSM + Attention + MoE ~74โ€“87 tok/s
Qwen3.6-35B-A3B 35B / 3B FP8 GDN + Attention + MoE + ViT ~71 tok/s
Gemma-4-26B-A4B 26B / 4B NVFP4 Attention + MoE (GeGLU) ~67 tok/s
Qwen3.5-122B-A10B (EP=2) 122B / 10B NVFP4 GDN + Attention + MoE ~46 tok/s
Mistral-Small-4-119B 119B / 6.5B NVFP4 MLA + MoE ~33 tok/s
Nemotron-3-Super-120B-A12B 120B / 12B NVFP4 / FP8 Mamba-2 + Attention + MoE ~24 tok/s
MiniMax-M2.7 (EP=2) 229B / ~10B NVFP4 Attention + 256-expert MoE ~15 tok/s
Qwen3.5-27B (dense hybrid) 27B NVFP4 Hybrid SSM + Attention ~13 tok/s
Gemma-4-31B 31B NVFP4 Attention (sliding + full) ~9โ€“11 tok/s

Full HuggingFace IDs, methodology, and the kernel-by-kernel comparison against PyTorch eager live in the GitHub README.

What Works Today

Component Status
OpenAI- and Anthropic-compatible HTTP API (streaming + non-streaming) โœ…
Tool calling (Hermes, Qwen3-Coder, Mistral formats) with grammar-constrained decoding โœ…
Reasoning / thinking tokens with budget cap โœ…
Concurrent batched decode + per-batch CUDA graphs โœ…
MTP speculative decoding (K=2, pipelined verify) โœ…
Prefix caching via radix tree (RadixAttention) + SSM snapshot cache (Marconi) โ€” 10ร— warm-cache TTFT โœ…
KV cache dtypes โ€” BF16, FP8, NVFP4, turbo3, turbo4 โœ…
MoE routing up to 512 experts โœ…
Vision encoder (Qwen3-VL, Qwen3.6 ViT) โœ…
Multi-GPU expert parallelism (EP=2 over RoCEv2) โœ…
SLO-aware scheduling, chunked prefill, active context compaction โœ…
High-speed NVMe KV swap (sliding-window aware) โœ…
Auto OOM pre-flight + UVM fallback on host OOM โœ…

Plug & Play Architecture

Atlas is built around a small set of Rust traits and a kernel registry โ€” each marked with ๐Ÿ”Œ below is the abstraction boundary where a new integration plugs in without touching anything above or below it:

Plug Point What It Abstracts To Add Support
๐Ÿ”Œ trait ModelWeightLoader HuggingFace โ†’ layer translation Implement one struct + add a match arm in factory.rs
๐Ÿ”Œ trait TransformerLayer Per-layer compute (attn, SSM, MoE, FFN) Compose existing primitives or implement a new layer type
๐Ÿ”Œ trait GpuBackend All GPU memory and kernel ops Swap CUDA for another accelerator backend
๐Ÿ”Œ kernels/<hw>/<model>/<quant>/ Hardware-tuned CUDA kernels Drop a directory with MODEL.toml + .cu files; build.rs auto-discovers it
๐Ÿ”Œ trait CommBackend Multi-GPU collectives Implement for MPI, GDR, custom interconnects
๐Ÿ”Œ trait StorageBackend NVMe KV-cache offload I/O Implement for CXL, RDMA, other storage tiers

A MockGpuBackend in spark-runtime lets you write and test the entire scaffold without owning the hardware โ€” every layer above the GPU trait is hardware-agnostic.

What the Community is Saying

"103 tok/s sustained on the 35B, startup in 15 seconds. Night and day compared to vLLM's 10-minute torch.compile cycle. Then tried the 122B, 43.8 tok/s with MTP, a 41% speedup over our vLLM hybrid, same hardware, 2-minute startup." โ€” ronald_15496, Discord #general

"Testing atlas-qwen3.5-35b for over an hour on a PNY DGX Spark in an agentic workflow. Super impressed. Spark is actually awesome with Atlas." โ€” PersonWhoThinks, r/LocalLLaMA

"I've grown tired of vLLM and have been hoping for something. I was really surprised and impressed. I'm so glad I bought Spark because I came across this." โ€” tetsuro59, Discord #general

Citations

We did not invent the kernels we ship. We picked the right ideas from the right papers, fused them together, and tuned them for one chip until they pinned the bandwidth ceiling. Direct intellectual debts: FlashAttention-2 (Dao, 2024), FlashAttention-4 (Shah et al., 2025), FlashInfer (Ye et al., MLSys 2025), SageAttention 3 (Zhang et al., NeurIPS 2025), LeanAttention (Roy et al., 2024). Full references in the GitHub README.

License & Enterprise Edition

Atlas operates under a dual-license model. Both are real and intentional.

  1. Community Edition โ€” AGPLv3. Free, open, copyleft. Use it on your own hardware for research, hobby, side-projects, hosted demos.
  2. Enterprise Edition โ€” commercial license. Ship Atlas inside a closed-source product, run it as a SaaS backend without inheriting the AGPLv3 source-disclosure obligation, get a support relationship with the people who wrote the kernels, and prioritized model and hardware ports. Reach us via the website or Discord.

A permissive license keeps us building Atlas full-time; the AGPL community license keeps the project honest. What is in this repository is what we run.


atlasinference.io ยท GitHub ยท Docker Hub ยท Discord

models 0

None public yet

datasets 0

None public yet