Subword tokenization is a technique used in natural language processing (NLP) that involves breaking down words into smaller units, or “subwords”. This approach is particularly useful for handling the issue of large vocabularies in languages, as well as dealing with out-of-vocabulary (OOV) words.
The AutoTokenizer class in Transformers offers a user-friendly way to rapidly load the tokenizer linked to a pretrained model.
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "Tokenizing text is a core task of NLP."
encoded_text = tokenizer(text)The encoded_text would look like below
{‘input_ids’: [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
We can extract tokens from our encoded_text:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)tokens contain the text split in different tokens including 2 special tokens “CLS” and “SEP” which denotes the start and end of the sentence
[‘[CLS]’, ‘token’, ‘##izing’, ‘text’, ‘is’, ‘a’, ‘core’, ‘task’, ‘of’, ‘nl’, ‘##p’, ‘.’, ‘[SEP]’]
AutoTokenizer has a function “convert_tokens_to_string” which can convert tokens back to text
retrived_text = tokenizer.convert_tokens_to_string(tokens)retrieved_text will look like below
[CLS] tokenizing text is a core task of nlp. [SEP]
We have covered different methods of tokenizer in our Hugging Face tutorial