Bloom's tokenizer vocab is messy code

#216

by ShaneSue - opened Mar 23, 2023

Discussion

ShaneSue

Mar 23, 2023

anyone know how to fix it?

christopher

BigScience Workshop org Mar 23, 2023

The tokenizer operates on bytes, so it's normal for the tokens to contain weird characters. If your goal is to manually inspect individual tokens you can convert them back to strings using the tokenizer's convert_tokens_to_string method.

ShaneSue

Mar 23, 2023

I got it, thanks a lot

christopher changed discussion status to closed Mar 23, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment