In this tokenization, text is split into individual characters. This is helpful in tasks where the exact spelling of words matters, like in character-level language modeling.
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)This will give tokenized_text as list of characters
[‘T’, ‘o’, ‘k’, ‘e’, ‘n’, ‘i’, ‘z’, ‘i’, ‘n’, ‘g’, ‘ ‘, ‘t’, ‘e’, ‘x’, ‘t’, ‘ ‘, ‘i’, ‘s’, ‘ ‘, ‘a’, ‘ ‘, ‘c’, ‘o’, ‘r’, ‘e’, ‘ ‘, ‘t’, ‘a’, ‘s’, ‘k’, ‘ ‘, ‘o’,
‘f’, ‘ ‘, ‘N’, ‘L’, ‘P’, ‘.’]
Next step would be to map each unique character to numerical representation
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)token2idx is a dictionary which maps each unique character to a number as below
{‘ ‘: 0, ‘.’: 1, ‘L’: 2, ‘N’: 3, ‘P’: 4, ‘T’: 5, ‘a’: 6, ‘c’: 7, ‘e’: 8, ‘f’: 9, ‘g’: 10, ‘i’: 11, ‘k’: 12, ‘n’: 13, ‘o’: 14, ‘r’: 15,
‘s’: 16, ‘t’: 17, ‘x’: 18, ‘z’: 19}
We can utilize the token2idx to convert the tokenized text into a list of integers:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)input_ids will be
[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]
Final step in the character tokenization will be to convert the each element of input_ids to One Hot Encoding. Pytorch provide a functionality to convert a list to one-hot encoding using its one_hot() function
import torch
import torch.nn.functional as F
input_ids = torch.tensor(input_ids)
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))For each of the 38 input token , there is 20 dimensional vector since we have 20 unique characters in our input