Qwen3.5-397B-A17B-TurboQuant-MLX-4bit

4-bit MLX weight-quantized build of Qwen/Qwen3.5-397B-A17B (397B total / 17B active Sparse MoE, multimodal) prepared with TurboQuant randomized Hadamard rotations. Optimized for Apple Silicon via MLX.

4-bit is the recommended production-grade variant for most Apple Silicon users: it fits comfortably on 256 GB Macs at short-to-medium context, and is the tightest bit-width at which TurboQuant maintains strong reasoning.

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-4bit")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a Swift function to compute Fibonacci numbers."}],
    add_generation_prompt=True,
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=256, verbose=True))

Model Specs

Property	Value
Base model	Qwen/Qwen3.5-397B-A17B
Architecture	Sparse Mixture-of-Experts (MoE)
Total parameters	397B
Active per token	17B
Modalities	Image + Text → Text (`image-text-to-text`)
Context window	256K tokens
Weight quantization	4-bit MLX (TurboQuant pre-rotation)
Approx. disk footprint	~220 GB
License	Apache 2.0

RotorQuant vs TurboQuant

Aspect	TurboQuant (this repo)	RotorQuant
Rotation	Randomized Hadamard (static)	Learned orthogonal rotors (data-calibrated)
Calibration	Zero-shot	~512 sample calibration pass
Accuracy @ 4-bit	~98.6% of FP16 baseline	~99.1% of FP16 baseline
Best for	Fastest turnaround	Highest fidelity at same bit-width

Memory Estimates (4-bit MLX)

Context	Active memory (approx.)
8K	~228 GB
32K	~238 GB
128K	~268 GB
256K	~298 GB

Hardware Requirements

Minimum: Apple Silicon with 256 GB unified memory for short/medium contexts
Recommended: 384 GB+ unified memory for full 256K context
Does not fit on 96 GB / 128 GB / 192 GB Macs — use the 2-bit variant or a smaller model

Model tree for majentik/Qwen3.5-397B-A17B-TurboQuant-MLX-4bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(73)

this model

majentik
/

Qwen3.5-397B-A17B-TurboQuant-MLX-4bit