Qwen3-Next-80B-A3B-Instruct FastMTP Speculator — UltraChat Epoch 3

This is a FastMTP (Multi-Token Prediction) speculator for Qwen/Qwen3-Next-80B-A3B-Instruct, saved after epoch 3 of fine-tuning on UltraChat 200k.

Model Details

Property Value
Architecture FastMTPSpeculator
Base model Qwen/Qwen3-Next-80B-A3B-Instruct
Speculative tokens 3
Training dataset HuggingFaceH4/ultrachat_200k
Epoch 3
dtype bfloat16

What is FastMTP?

FastMTP is a speculative decoding algorithm that attaches lightweight Multi-Token Prediction (MTP) heads to a base model. During inference, the MTP heads draft multiple tokens in parallel, which the base model verifies in a single forward pass — improving throughput without changing output quality.

This speculator is compatible with vLLM and the speculators library.

Usage

from speculators import SpeculatorModel

speculator = SpeculatorModel.from_pretrained("inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3")

For vLLM inference, pass the repo ID as the speculative_model argument alongside the base model.

Training Details

  • Fine-tuning dataset: HuggingFaceH4/ultrachat_200k
  • Checkpoint: End of epoch 3 (directory index 2)
  • MTP loss step weights: [0.51, 0.31, 0.18] (exponential decay beta=0.6, normalized)
  • Frozen during training: embed_tokens, lm_head
Downloads last month
9
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3

Finetuned
(36)
this model

Dataset used to train inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3

Collection including inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3