Instructions to use QwenPilot/FIPO_32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QwenPilot/FIPO_32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QwenPilot/FIPO_32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QwenPilot/FIPO_32B")
model = AutoModelForCausalLM.from_pretrained("QwenPilot/FIPO_32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QwenPilot/FIPO_32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QwenPilot/FIPO_32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QwenPilot/FIPO_32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QwenPilot/FIPO_32B

SGLang

How to use QwenPilot/FIPO_32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QwenPilot/FIPO_32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QwenPilot/FIPO_32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QwenPilot/FIPO_32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QwenPilot/FIPO_32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QwenPilot/FIPO_32B with Docker Model Runner:
```
docker model run hf.co/QwenPilot/FIPO_32B
```

Note:

Important update: we recently discovered that the checkpoint uploaded previously was from an intermediate training step and did not represent a fully converged model. We have since corrected this by replacing it with the proper converged checkpoint. We apologize for the oversight and will ensure the checkpoint remains up to date going forward.

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

🏠 Homepage | 📝 Paper PDF | 🤗 Hugging Face | 🤖 ModelScope | 🐱 GitHub

Qwen Pilot, Alibaba Group | Published on March 20, 2026

FIPO is a value-free RL recipe for eliciting deeper reasoning from a clean base model. The central idea is simple: GRPO-style training works, but its token credit assignment is too coarse. FIPO densifies that signal with a discounted Future-KL term that reflects how the rest of the trajectory evolves after each token. Empirically, this granular reinforcement allows the model to break through the length stagnation observed in standard baselines. Trained on Qwen2.5-32B-Base, FIPO extends the average chain-of-thought length from 4,000 to over 10,000 tokens, driving AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% compared with DAPO.

Overview

Figure 1. FIPO vs. baselines on AIME 2024. FIPO shows that pure RL training alone can outperform reproduced pure-RL baselines such as DAPO and DeepSeek-R1-Zero-32B, surpass o1-mini, and produce substantially longer responses on average.

Highlights

Pure RL only: FIPO outperforms reproduced DAPO and DeepSeek-R1-Zero-32B, and surpasses o1-mini on AIME 2024.
Dense advantage formulation: instead of assigning one uniform outcome-level signal to all tokens, FIPO reweights each token by the discounted signed shift of its future trajectory.
Deeper reasoning: on Qwen2.5-32B-Base, FIPO breaks the usual 4k-token plateau and extends average reasoning length to 10,000+ tokens.
Stronger performance: AIME 2024 Pass@1 improves from 50.0% to a peak of 58.0%.

Core Change

FIPO keeps the standard PPO/DAPO scaffold, but changes how token-level updates are weighted. The local signal is the signed log-probability shift between the current and old policy:

$\Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{1:t-1}) - \log \pi_{old}(y_t \mid x, y_{1:t-1})$

Positive values mean the token is being reinforced, while negative values mean it is being suppressed. Since reasoning is sequential, FIPO then accumulates this signal over the future trajectory:

$FutureKL_t = \sum_{k=t}^{T} M_k \cdot \gamma^{k-t} \cdot \Delta \log p_k$

FIPO maps this future signal into a bounded influence weight:

$f_t = \text{clip}(\exp(FutureKL_t), 1-\epsilon_{f,low}, 1+\epsilon_{f,high}), \quad \tilde{A}_t = \hat{A}_t \cdot f_t$

The final token-level FIPO loss keeps the standard clipped PPO/DAPO form, but replaces the original advantage with the future-aware one:

$r_t = \frac{\pi_\theta(y_t \mid x, y_{1:t-1})}{\pi_{old}(y_t \mid x, y_{1:t-1})}$

$L_t^{FIPO} = \min(r_t \tilde{A}_t,\; \text{clip}(r_t, 1-\epsilon, 1+\epsilon)\tilde{A}_t)$

📊 Results & Figures

Training Dynamics

Under FIPO, the model continues to expand its reasoning budget instead of collapsing into that intermediate plateau. This helps the model use additional length as genuine reasoning depth.

Figure 2. Dynamics of response length and performance scaling during training. Compared to the DAPO baseline, FIPO significantly increases response length and maintains a strong positive correlation between longer chain-of-thought and higher accuracy.

Main Result

The FIPO objective yields longer responses and a stronger AIME 2024 peak than the DAPO baseline.

Figure 3. Main 32B result. FIPO outperforms reproduced pure-RL baselines on AIME 2024 while also producing substantially longer responses on average.

🎈 Citation

@article{ma2026fipo,
  title={FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization},
  author={Ma, Chiyu and Yang, Shuo and Huang, Kexin and Lu, Jinda and Meng, Haoming and Shangshang Wang and Bolin Ding and Soroush Vosoughi and Guoyin Wang and Jingren Zhou},
  journal={arXiv preprint arXiv:2603.19835},
  year={2026}
}

🌻 Acknowledgement

This project builds on top of the VeRL training framework and follows the practical recipe structure introduced by DAPO.

Downloads last month: 118

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for QwenPilot/FIPO_32B

Base model

Qwen/Qwen2.5-32B

Finetuned

(120)

this model

Dataset used to train QwenPilot/FIPO_32B

Paper for QwenPilot/FIPO_32B

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Paper • 2603.19835 • Published Mar 20 • 350