New tokens generated with FP16 inference are only exclamation marks "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

#89

by rasyosef - opened Jan 17, 2024

Jan 17, 2024

The model was working fine until a couple of hours ago, then it started generating a bunch of "!!!!!!!!!!!!!!!!!!!!!" no matter the input text. To my knowledge, this issue is only present with FP16 inference, but even the sample code in your model card replicates this problem since torch_dtype="auto" defaults to torch.float16 .

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Output:

def print_prime(n):
   """
   Print all primes between 1 and n
   """!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

rasyosef

Jan 17, 2024

Here's another example,

inputs = tokenizer(
    '''Write a detailed analogy between mathematics and a lighthouse.\n''', 
    return_tensors="pt", 
    return_attention_mask=False
)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

Output:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!```

All of the newly generated tokens are just a bunch of "!!!!!!!!!!!!!!!!!!!!!!!..."

jairoalves

Jan 17, 2024

same here. When using as pipeline for text completion it still answers, btw.

eschmitt88

Jan 17, 2024

Same here. Does someone have sample code where it doesn't print out only "!"?

Thanks!

rasyosef

Jan 18, 2024

@eschmitt88 , If you already have accelerate installed, you only need to change torch_dtype="auto" to device_map="auto" when loading the model like so.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

# changed torch_dtype="auto" to device_map="auto" in following line
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", device_map="auto", trust_remote_code=True) 
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

inputs = tokenizer('''def print_prime(n):
   """
   Print all primes between 1 and n
   """''', return_tensors="pt", return_attention_mask=False)

outputs = model.generate(**inputs, max_length=200)
text = tokenizer.batch_decode(outputs)[0]
print(text)

eschmitt88

Jan 18, 2024

@rasyosef thank you!

yodagitmaster

Jan 18, 2024

Same problem here, can someone explain a bit the origin of this issue?

jairoalves

Jan 18, 2024

I think it was a patch to prevent errors in an "attention overflow issue (with FP16)" that requires autocast to be disabled. as per this change record.

gugarosa

Microsoft org Jan 18, 2024

Could you please re-try with the latest commit?

Unfortunately, for Phi-2 to work amongst all use cases, we need to upcast queries and keys to FP32, and disable the autocast in the attention's forward pass.

jairoalves

Jan 18, 2024

@gugarosa do you think it is necessary to update the readme as well? Mainly to prevent people not aware of the new behaviour running into issues and to adjust the provided sample code (if needed)?

gugarosa

Microsoft org Jan 18, 2024

I don't think we need to update the readme.

The goal is to ensure that the model works with any use case (as it was working prior to the integration with transformers' source code).

jairoalves

Jan 18, 2024

@gugarosa

I thought we would have to change the torch_dtype="auto" argument to device_map="auto" in the model definition line, as per the @rasyosef post above. In fact, yesterday, after I tried that, it solved the "!!!!!!" response issue for me. In that case, the readme sample code would in fact be outdated.

However, I tested again today, with the new modeling_phi.py, and it is no longer the case.
The readme sample code, with torch_dtype="auto", is working fine again now.

rasyosef

Jan 18, 2024

•

edited Jan 18, 2024

@gugarosa FP16 inference is functioning correctly now, including the sample code from the model card. Closing this issue.

rasyosef changed discussion status to closed Jan 18, 2024

gugarosa

Microsoft org Jan 18, 2024

•

edited Jan 18, 2024

@gugarosa

I thought we would have to change the torch_dtype="auto" argument to device_map="auto" in the model definition line, as per the @rasyosef post above. In fact, yesterday, after I tried that, it solved the "!!!!!!" response issue for me. In that case, the readme sample code would in fact be outdated.

However, I tested again today, with the new modeling_phi.py, and it is no longer the case.
The readme sample code, with torch_dtype="auto", is working fine again now.

Removing torch_dtype="auto" loads the model's weights in FP32, which does not produce an overflow.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment