Self-Attention

Self-attention, a fundamental component of Transformer models, is a mechanism that allows each position in a sequence to attend to all positions within the same sequence. This is crucial for capturing the context and relationships between words in tasks like language modeling, translation, and many others in natural language processing.

The self-attention mechanism in the context of Transformers can be explained more formally using mathematical formulas. Let’s delve into the details of how Queries (Q), Keys (K), and Values (V) interact in this process.

Definitions

First, let’s define our terms:

Q (Query): For each token in the input, a Query vector is computed. This is done by multiplying the input vector ( X ) with a weight matrix ( W^Q ). So, for an input vector ( X_i ), the Query vector ( Q_i ) is calculated as
Q_i = X_i W^Q
K (Key): Similarly, a Key vector is computed for each input token, using different weight matrix W^K.
K_i = X_i W^K
V (Value): Each input token also gets a Value vector, using a third weight matrix W^V .
V_i = X_i W^V

These weight matrices ( W^Q ), ( W^K ), and ( W^V ) are learned during the training process.

The Self-Attention Mechanism

The self-attention for a single token is computed as follows:

Score Calculation: The first step is to compute a score for each Query-Key pair. This score indicates how much focus should be placed on other parts of the input for each word in the sequence. The score between a query ( Q_i ) and a key (Kj ) is usually computed using the dot product:
Score(Q_i, Kj) = Q_i . Kj
Normalization: These scores are then normalized using the softmax function to ensure they sum
to 1. This makes them probabilities. The normalization is usually scaled by the square root of the dimension of the key vectors, $\sqrt{d_k}$ , to avoid extremely small gradients when the dot products are large. So, the normalized attention weights $A_{ij}$ are:

A_ij = softmax($\frac{Qi_i⋅K_j}{\sqrt{d_k}}$) = $\frac{\frac{Q_i⋅K_j}{\sqrt{d_k}}}{\sum_k \frac{Qi_i⋅K_k}{\sqrt{d_k}}}$
Weighted Sum of Values: Each Value vector is then multiplied by its corresponding normalized score and summed up to produce the output of the self-attention layer for each word. The output vector ( O_i ) for the word ( i ) is:
O_i = $\sum_j A_(ij) V_j$

Summary

In essence, for each word in the input, self-attention computes a weighted sum of all Value vectors, where the weights are determined by the normalized scores of the dot products between the Query of the current word and the Keys of all words (including itself) in the input. This allows each word to dynamically consider all other words in the sequence when forming its contextual representation.

No comments yet! You be the first to comment.

GET HELP

CONTACT US

Address : Sector 63A, Anishi's Utsav, Noida

Practical NLP With Transformers