⚡ Gemma 4 31B IT NVFP4 Turbo GGUF
Requires ggml-org/llama.cpp#21971
A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).
Approach
Three changes were made:
- Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
- Updated architecture to
Gemma4ForCausalLMand quantization config accordingly - Stripped the vision and audio encoder
Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.
Why RTN didn't hurt quality
RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:
- FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
- Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
- MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
embed_tokensstays BF16, preventing noise from propagating through all layers
License
Apache 2.0 — same as the base model.
Credits
- Google DeepMind for Gemma 4
- NVIDIA for the modelopt NVFP4 checkpoint
- Downloads last month
- 2,889
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF
Base model
google/gemma-4-31B-it Quantized
LilaRest/gemma-4-31B-it-NVFP4-turboEvaluation results
- Accuracy on GPQA Diamondself-reported72.730
- Accuracy on MMLU Proself-reported83.930