⚡ Gemma 4 31B IT NVFP4 Turbo GGUF

Requires ggml-org/llama.cpp#21971

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).

Approach

Three changes were made:

  1. Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
  2. Updated architecture to Gemma4ForCausalLM and quantization config accordingly
  3. Stripped the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

  • FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
  • Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
  • MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
  • embed_tokens stays BF16, preventing noise from propagating through all layers

License

Apache 2.0 — same as the base model.

Credits

Downloads last month
2,889
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF

Quantized
(1)
this model

Evaluation results