Diffusion Tokenizer

How your prompt/words gets turned into tokens, privately in your browser, For diffusion models.

Notes

End of word character </w>

End of word character (</w>) is part of the token to represent a token that is at the end of the word. This is different than if it was at the start of word, or in the middle of a word. Used in CLIP tokenizer.

Start of word character ▁

Start of word character (▁) is part of the token to represent a token that is at the start of the word. This is different than if it was at the end of word, or in the middle of a word. Used in T5 tokenizer.

Token highlighting

Tokens have alternating background colors to represent how the words can be broken up as tokens.

How do we get the tokens?

We use the Tokenizer library to extract the tokens from your text. We download the tokenizer.json from the relevant text encoder. These tokens are the same that are used when prompting diffusion models.

What is the limit for tokens in Stable Diffusion?

Tokens are limited to 77 in CLIP Text Encoder models. 2 tokens are used to indicate the start and stop of the text input, leaving us with 75 tokens to be used in prompting.

To get around this limitation we can concatenate our tokens into smaller pieces. If we have 120 tokens, we can split it at 75 tokens and put the remaining 45 tokens in another.

[75 tokens, 45 tokens]

To prevent splitting at 75 tokens, image generation tools may have a way to set a BREAK point. This allows you to control where the break happens and create complete prompts.

Example (a1111 webui):

a large dog BREAK a small cat

What is the limit for tokens using Pixart or Flux models with T5xxl?

Tokens are limited to the max of 512 for the T5xxl model. The CLIP component will reflect the similar limit as Stable Diffusion of 75 tokens.

How is it private? Can anyone see what I'm looking up?

The tokenizer runs privately in your browser. Communications are limited to your browser and no connections/servers are used.

How much bandwidth does this use?

Loading tokenizers are around 2-3MB and are loaded into cache for further runs.

What tokenizers are each model using?

The tokenizers are usually associated with the text encoder model. Below is the associated text encoder which is assocated with a tokenizer.

Model	Model source	Text Encoder source
SD1	runwayml/stable-diffusion-v1-5	CLIP ViT/B (openai/clip-vit-base-patch32)
SD2	stabilityai/stable-diffusion-2-1	CLIP ViT/B (openai/clip-vit-base-patch32)
SDXL	stabilityai/stable-diffusion-xl-base-1.0	CLIP ViT/L (openai/clip-vit-large-patch14) OpenCLIP ViT/bigG (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
SD3	stabilityai/stable-diffusion-3.5-large	CLIP ViT/L (openai/clip-vit-large-patch14) OpenCLIP ViT/bigG (laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) T5xxl (google/t5-v1_1-xxl)
Flux	black-forest-labs/FLUX.1-schnell	CLIP ViT/L (openai/clip-vit-large-patch14) T5xxl (google/t5-v1_1-xxl)
Pixart	PixArt-alpha/PixArt-XL-2-1024-MS	T5xxl (google/t5-v1_1-xxl)

Diffusion Tokenizer

Tokens

Input ids