"Freshly baked croissants", Meilisearch splits it into three tokens: freshly, baked, and croissants. These tokens are what Meilisearch stores and matches against when a user performs a search query.
Breaking sentences into smaller chunks requires understanding where one word ends and another begins, making tokenization a highly complex and language-dependent task. Meilisearch’s solution to this problem is a modular tokenizer that follows different processes, called pipelines, based on the language it detects.
This allows Meilisearch to function in several different languages with zero setup.
Deep dive: The Meilisearch tokenizer
Meilisearch uses charabia, an open-source Rust library purpose-built for multilingual tokenization. When you add documents to a Meilisearch index, the tokenization process is handled by an abstract interface called the tokenizer. The tokenizer is responsible for splitting each field by writing system (for example, Latin alphabet, Chinese hanzi). It then applies the corresponding pipeline to each part of each document field. We can break down the tokenization process like so:- Crawl the document(s), splitting each field by script
- Go back over the documents part-by-part, running the corresponding tokenization pipeline, if it exists
Customizing tokenization behavior
Meilisearch provides three settings that let you control how text is split into tokens.Separator tokens
By default, Meilisearch uses whitespace and punctuation to determine word boundaries. You can add custom characters or strings as separators using the separator tokens setting. For example, if your dataset uses| as a delimiter within a field, you can add it as a separator token so Meilisearch treats it as a word boundary:
"red|green|blue" is tokenized into red, green, and blue.
Non-separator tokens
Conversely, you can tell Meilisearch to treat certain characters as part of a word rather than as separators using the non-separator tokens setting. This is useful when your data includes special characters that should be searchable. For example, if your dataset contains programming terms likeC++ or C#, you can prevent + and # from acting as separators:
Dictionary
The dictionary setting lets you define custom word boundaries for strings that Meilisearch would not otherwise split correctly. This is particularly useful for compound words or domain-specific terms. For example, if users need to search for “ice cream” and your data contains the compound form “icecream”, you can add it to the dictionary so Meilisearch knows how to handle it:How tokenization affects search
Tokenization directly determines which queries match which documents. Here are common scenarios to be aware of:- Compound words: A search for
"ice cream"will not match a document containing"icecream"because they produce different tokens. Use the dictionary setting or synonyms to bridge the gap. - Special characters: By default, characters like
@,#, and+act as separators. If your data includes terms likeC#or email addresses, configure non-separator tokens so these characters are preserved during tokenization. - CJK languages: Chinese, Japanese, and Korean do not use whitespace between words. Meilisearch’s dedicated pipelines handle segmentation for these languages automatically, but for best results consider using localized attributes.
Next steps
Multilingual datasets
Best practices for indexing and searching content in multiple languages
Separator tokens reference
API reference for configuring custom separator tokens
Non-separator tokens reference
API reference for configuring non-separator tokens
Dictionary reference
API reference for configuring the dictionary setting