iisys-hof/olaph-data
Viewer • Updated • 7.99M • 86 • 1
How to use iisys-hof/olaph with Transformers:
# Use a pipeline as a high-level helper
# Warning: Pipeline type "translation" is no longer supported in transformers v5.
# You must load the model directly (see below) or downgrade to v4.x with:
# 'pip install "transformers<5.0.0'
from transformers import pipeline
pipe = pipeline("translation", model="iisys-hof/olaph") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("iisys-hof/olaph")
model = AutoModelForCausalLM.from_pretrained("iisys-hof/olaph")This model is outdated, its new version can be found here
OLaPh is a large language model for phonemization, finetuned from GemmaX2-28-2B-v0.1. Its tokenizer was extended with 1,024 phoneme tokens, derived from a BPE tokenizer trained on phoneme sequences generated by the OLaPh framework).
The model was then finetuned for grapheme-to-phoneme conversion on a multilingual dataset (English, German, French, Spanish), created by phonemizing text from HuggingFaceFW/fineweb and HuggingFaceFW/fineweb-2 using the OLaPh framework.
from transformers import AutoModelForCausalLM, AutoTokenizer
lang = "English" #German, French, Spanish
sentence = "But we are not sorry, for the rain is delightful."
model_id = "iisys-hof/olaph"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda")
stop_tokens = [tokenizer.eos_token_id, tokenizer.encode(".", add_special_tokens=False)[0]]
prompt = f"Translate this from {lang} to Phones:\n{lang}: "
inputs = tokenizer(f"{prompt}{sentence}\nPhones:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=stop_tokens)
phonemized = tokenizer.decode(outputs[0], skip_special_tokens=False)
phonemized = phonemized.split("\n")[-1].replace("Phones:", "")
print(phonemized)
@misc{wirth2026olaphoptimallanguagephonemizer,
title={OLaPh: Optimal Language Phonemizer},
author={Johannes Wirth},
year={2026},
eprint={2509.20086},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.20086},
}
Base model
google/gemma-2-2b