Character Tokenization

In this tokenization, text is split into individual characters. This is helpful in tasks where the exact spelling of words matters, like in character-level language modeling.

text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)

This will give tokenized_text as list of characters

[‘T’, ‘o’, ‘k’, ‘e’, ‘n’, ‘i’, ‘z’, ‘i’, ‘n’, ‘g’, ‘ ‘, ‘t’, ‘e’, ‘x’, ‘t’, ‘ ‘, ‘i’, ‘s’, ‘ ‘, ‘a’, ‘ ‘, ‘c’, ‘o’, ‘r’, ‘e’, ‘ ‘, ‘t’, ‘a’, ‘s’, ‘k’, ‘ ‘, ‘o’,
‘f’, ‘ ‘, ‘N’, ‘L’, ‘P’, ‘.’]

Next step would be to map each unique character to numerical representation

token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

token2idx is a dictionary which maps each unique character to a number as below
{‘ ‘: 0, ‘.’: 1, ‘L’: 2, ‘N’: 3, ‘P’: 4, ‘T’: 5, ‘a’: 6, ‘c’: 7, ‘e’: 8, ‘f’: 9, ‘g’: 10, ‘i’: 11, ‘k’: 12, ‘n’: 13, ‘o’: 14, ‘r’: 15,
‘s’: 16, ‘t’: 17, ‘x’: 18, ‘z’: 19}

We can utilize the token2idx to convert the tokenized text into a list of integers:

input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

input_ids will be

[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]

Final step in the character tokenization will be to convert the each element of input_ids to One Hot Encoding. Pytorch provide a functionality to convert a list to one-hot encoding using its one_hot() function

import torch
import torch.nn.functional as F
input_ids = torch.tensor(input_ids)
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))

For each of the 38 input token , there is 20 dimensional vector since we have 20 unique characters in our input

No comments yet! You be the first to comment.

GET HELP

CONTACT US

Address : Sector 63A, Anishi's Utsav, Noida

Practical NLP With Transformers

Character Tokenization

Leave a Reply Cancel reply

GET HELP

CONTACT US

Address : Sector 63A, Anishi's Utsav, Noida

Modal title