Qwen3-Next-80B-A3B-Instruct FastMTP Speculator — UltraChat Epoch 3

This is a FastMTP (Multi-Token Prediction) speculator for Qwen/Qwen3-Next-80B-A3B-Instruct, saved after epoch 3 of fine-tuning on UltraChat 200k.

Model Details

Property	Value
Architecture	FastMTPSpeculator
Base model	Qwen/Qwen3-Next-80B-A3B-Instruct
Speculative tokens	3
Training dataset	HuggingFaceH4/ultrachat_200k
Epoch	3
dtype	bfloat16

What is FastMTP?

FastMTP is a speculative decoding algorithm that attaches lightweight Multi-Token Prediction (MTP) heads to a base model. During inference, the MTP heads draft multiple tokens in parallel, which the base model verifies in a single forward pass — improving throughput without changing output quality.

This speculator is compatible with vLLM and the speculators library.

Usage

from speculators import SpeculatorModel

speculator = SpeculatorModel.from_pretrained("inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3")

For vLLM inference, pass the repo ID as the speculative_model argument alongside the base model.

Training Details

Fine-tuning dataset: HuggingFaceH4/ultrachat_200k
Checkpoint: End of epoch 3 (directory index 2)
MTP loss step weights: [0.51, 0.31, 0.18] (exponential decay beta=0.6, normalized)
Frozen during training: embed_tokens, lm_head

Downloads last month: 9

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3

Base model

Qwen/Qwen3-Next-80B-A3B-Instruct

Finetuned

(36)

this model

Dataset used to train inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3

Collection including inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3

test-models

Collection

7 items • Updated 26 days ago