test-models
Collection
7 items • Updated
This is a FastMTP (Multi-Token Prediction) speculator for Qwen/Qwen3-Next-80B-A3B-Instruct, saved after epoch 3 of fine-tuning on UltraChat 200k.
| Property | Value |
|---|---|
| Architecture | FastMTPSpeculator |
| Base model | Qwen/Qwen3-Next-80B-A3B-Instruct |
| Speculative tokens | 3 |
| Training dataset | HuggingFaceH4/ultrachat_200k |
| Epoch | 3 |
| dtype | bfloat16 |
FastMTP is a speculative decoding algorithm that attaches lightweight Multi-Token Prediction (MTP) heads to a base model. During inference, the MTP heads draft multiple tokens in parallel, which the base model verifies in a single forward pass — improving throughput without changing output quality.
This speculator is compatible with vLLM and the speculators library.
from speculators import SpeculatorModel
speculator = SpeculatorModel.from_pretrained("inference-optimization/Qwen3-Next-80B-A3B-Instruct-MTP-ultrachat-epoch3")
For vLLM inference, pass the repo ID as the speculative_model argument alongside the base model.
2)embed_tokens, lm_headBase model
Qwen/Qwen3-Next-80B-A3B-Instruct