The most common form, where text is split into words. It’s useful for tasks where the words are the primary focus, like in most language translation or sentiment analysis tasks.
text = "Tokenizing text is a core task of NLP."
tokenized_text = text.split()This will give tokenized_text as list list of words
[‘Tokenizing’, ‘text’, ‘is’, ‘a’, ‘core’, ‘task’, ‘of’, ‘NLP.’]
A large vocabulary presents a challenge because it demands that neural networks incorporate a vast number of parameters. For example, consider a scenario where there are 1 million unique words, and we aim to reduce the dimensionality of input vectors from 1 million to 1 thousand in the neural network’s initial layer. This reduction is a common procedure in many NLP frameworks. Consequently, the weight matrix for this layer alone would comprise 1 billion weights (1 million times 1 thousand).
In some languages, it’s not always clear where one word ends and another begins. Compound words, hyphenated words, and contractions (e.g., “don’t” as “do not”) add complexity.