Instructions to use bigscience/bloomz with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigscience/bloomz with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigscience/bloomz")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigscience/bloomz") model = AutoModelForCausalLM.from_pretrained("bigscience/bloomz") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigscience/bloomz with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigscience/bloomz" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloomz", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigscience/bloomz
- SGLang
How to use bigscience/bloomz with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigscience/bloomz" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloomz", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigscience/bloomz" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigscience/bloomz", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigscience/bloomz with Docker Model Runner:
docker model run hf.co/bigscience/bloomz
No longer available, why?
Whyyyyyyyyyyyyyyyyyyyyyyyyyy?
Whyyyyyyyyyyyyyyyyyyyyyyyyyy?
It costs ~30K USD / month to keep up the inference widget, so we decided to turn it off after the first month. Really sorry :(
You can of course still download the model and run it on your own hardware if you have the resources available.
oh no
I like it more than bloom
I like it more than bloom
Same
NOOOOOOOOOO 😭
:(
On the bright side mt0-xxl & mt0-xxl-mt can still be used via the inference widget. 🤗
Definitely share if you find them more / less useful & if so why 🧐
In my experiments I found them better at following instructions requiring short answers & worse at instructions requiring long answers.
:(
On the bright side mt0-xxl & mt0-xxl-mt can still be used via the inference widget. 🤗Definitely share if you find them more / less useful & if so why 🧐
In my experiments I found them better at following instructions requiring short answers & worse at instructions requiring long answers.
Bloomz know when to stop, Bloom don't.
I also found that Bloomz almost stopped too soon. When summarizing text, it ended after a single sentence. And since it only generated one sentence, it was never given the opportunity to follow the prompt. I honestly found Bloom more helpful. It could respond to longer prompts well, especially few shot prompts. But Bloomz seems to only work with short Q and A prompts. I do have hope that if it keeps getting better, Bloomz will become more diverse in capability.
I also found that Bloomz almost stopped too soon. When summarizing text, it ended after a single sentence. And since it only generated one sentence, it was never given the opportunity to follow the prompt. I honestly found Bloom more helpful. It could respond to longer prompts well, especially few shot prompts. But Bloomz seems to only work with short Q and A prompts. I do have hope that if it keeps getting better, Bloomz will become more diverse in capability.
Because of XP3 dataset i think. Most of the answers in this dataset are short.
Now you can run inference and fine-tune BLOOMZ (the 176B English version) using the Petals swarm.
You can use BLOOMZ via this Colab notebook to get the inference speed of 1-2 sec/token for a single sequence. Running the notebook on a local machine is also fine, you'd need only 10+ GB GPU memory or 12+ GB RAM (though it will be slower without a GPU).
Note: Don't forget to replace bigscience/bloom-petals with bigscience/bloomz-petals in the model name.
As an example, there is a chatbot app running BLOOMZ this way.
Bloomz is back and even stronger than before. You can now do token streaming:
pip install sseclient-py (do NOT install sseclient, be sure to install sseclient-py)
import sseclient
import requests
prompt = "Why is the sky blue? Explain in a detailled paragraph."
parameters = {"max_new_tokens": 200, "top_p": 0.9, "seed": 0}
options = {"use_cache": False}
payload = {"inputs": prompt, "stream": True, "parameters": parameters, "options": options}
r = requests.post("https://api-inference.huggingface.co/models/bigscience/bloomz", stream=True, json=payload)
sse_client = sseclient.SSEClient(r)
for i, event in enumerate(sse_client.events()):
print(i, event.data)