Word Tokenization

The most common form, where text is split into words. It’s useful for tasks where the words are the primary focus, like in most language translation or sentiment analysis tasks.

text = "Tokenizing text is a core task of NLP."
tokenized_text = text.split()

This will give tokenized_text as list list of words
[‘Tokenizing’, ‘text’, ‘is’, ‘a’, ‘core’, ‘task’, ‘of’, ‘NLP.’]

Challenges

A large vocabulary presents a challenge because it demands that neural networks incorporate a vast number of parameters. For example, consider a scenario where there are 1 million unique words, and we aim to reduce the dimensionality of input vectors from 1 million to 1 thousand in the neural network’s initial layer. This reduction is a common procedure in many NLP frameworks. Consequently, the weight matrix for this layer alone would comprise 1 billion weights (1 million times 1 thousand).

In some languages, it’s not always clear where one word ends and another begins. Compound words, hyphenated words, and contractions (e.g., “don’t” as “do not”) add complexity.

No comments yet! You be the first to comment.

GET HELP

CONTACT US

Address : Sector 63A, Anishi's Utsav, Noida

Practical NLP With Transformers

Word Tokenization

Challenges

Leave a Reply Cancel reply

GET HELP

CONTACT US

Address : Sector 63A, Anishi's Utsav, Noida

Modal title