Instructions to use microsoft/phi-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/phi-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/phi-2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/phi-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/phi-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/phi-2
- SGLang
How to use microsoft/phi-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/phi-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/phi-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/phi-2 with Docker Model Runner:
docker model run hf.co/microsoft/phi-2
New tokens generated with FP16 inference are only exclamation marks "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
The model was working fine until a couple of hours ago, then it started generating a bunch of "!!!!!!!!!!!!!!!!!!!!!" no matter the input text. To my knowledge, this issue is only present with FP16 inference, but even the sample code in your model card replicates this problem since torch_dtype="auto" defaults to torch.float16 .
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
inputs = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)
Output:
def print_prime(n):
"""
Print all primes between 1 and n
"""!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Here's another example,
inputs = tokenizer(
'''Write a detailed analogy between mathematics and a lighthouse.\n''',
return_tensors="pt",
return_attention_mask=False
)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)
Output:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!```
All of the newly generated tokens are just a bunch of "!!!!!!!!!!!!!!!!!!!!!!!..."
same here. When using as pipeline for text completion it still answers, btw.
Same here. Does someone have sample code where it doesn't print out only "!"?
Thanks!
@eschmitt88 , If you already have accelerate installed, you only need to change torch_dtype="auto" to device_map="auto" when loading the model like so.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.set_default_device("cuda")
# changed torch_dtype="auto" to device_map="auto" in following line
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
inputs = tokenizer('''def print_prime(n):
"""
Print all primes between 1 and n
"""''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)
@rasyosef thank you!
Same problem here, can someone explain a bit the origin of this issue?
Could you please re-try with the latest commit?
Unfortunately, for Phi-2 to work amongst all use cases, we need to upcast queries and keys to FP32, and disable the autocast in the attention's forward pass.
@gugarosa do you think it is necessary to update the readme as well? Mainly to prevent people not aware of the new behaviour running into issues and to adjust the provided sample code (if needed)?
I don't think we need to update the readme.
The goal is to ensure that the model works with any use case (as it was working prior to the integration with transformers' source code).
I thought we would have to change the torch_dtype="auto" argument to device_map="auto" in the model definition line, as per the @rasyosef post above. In fact, yesterday, after I tried that, it solved the "!!!!!!" response issue for me. In that case, the readme sample code would in fact be outdated.
However, I tested again today, with the new modeling_phi.py, and it is no longer the case.
The readme sample code, with torch_dtype="auto", is working fine again now.
@gugarosa FP16 inference is functioning correctly now, including the sample code from the model card. Closing this issue.
I thought we would have to change the
torch_dtype="auto"argument todevice_map="auto"in the model definition line, as per the @rasyosef post above. In fact, yesterday, after I tried that, it solved the "!!!!!!" response issue for me. In that case, the readme sample code would in fact be outdated.However, I tested again today, with the new
modeling_phi.py, and it is no longer the case.
The readme sample code, withtorch_dtype="auto", is working fine again now.
Removing torch_dtype="auto" loads the model's weights in FP32, which does not produce an overflow.
