Multi Head Attention

Multi-head attention is an extension of the self-attention mechanism and is a key feature of the Transformer model architecture, which has significantly impacted natural language processing tasks. It allows the model to jointly attend to information from different representation subspaces at different positions.

Basic Concept of Multi-head Attention

In self-attention, for each word in a sentence, a single attention mechanism is applied, which computes a set of Query, Key, and Value vectors and then uses these to produce an output vector for each word. Multi-head attention, on the other hand, splits this attention into multiple “heads”. Here’s how it works:

Splitting into Heads: The Query, Key, and Value vectors are split into multiple smaller sets of vectors (heads). Each set (or head) performs the attention mechanism independently. For instance, if the model uses 8 heads, the original Query, Key, and Value vectors are divided into 8 smaller sets of vectors, each set representing a different ‘aspect’ or ‘subspace’ of the attention.
Parallel Attention Operations: Each head computes attention scores and outputs independently. This parallel processing allows the model to capture different types of information from each part of the sentence simultaneously. One head might focus on the syntactic structure, while another might focus on semantic aspects, and so on.

Mathematical Formulation

Let’s assume our input has ( $d_{model}$ ) dimensions, and we use ( h ) heads for the multi-head attention. Then, each head will operate on ( $\frac{d_{model}}{h}$ ) dimensions. The process for each head ( i ) is as follows:

Linear Projections: For each head, separate linear projections are applied to the Queries, Keys, and Values: [ $Q_i = X W^Q_i, \quad K_i = X W^K_i, \quad V_i = X W^V_i$ ] Here, ( $W^Q_i $), ( $W^K_i $), and ( $W^V_i$ ) are the projection matrices for the ( i )-th head.
Attention Mechanism: Each head computes attention scores and outputs as per the self-attention mechanism: [$ \text{Head}_i = \text{Attention}(Q_i, K_i, V_i) $]
Concatenation of Heads: The outputs from each head are concatenated: [$ \text{Concat}(\text{Head}_1, \text{Head}_2, \ldots, \text{Head}_h) $]
Final Linear Projection: The concatenated result is then subjected to one final linear projection: [ $\text{MultiHead}(X) = \text{Concat}(\text{Head}_1, \text{Head}_2, \ldots, \text{Head}_h)W^O $] Where ( W^O ) is another learned weight matrix.

Significance

The advantage of multi-head attention is that it allows the model to capture information from different representational spaces. For instance, different heads can focus on different types of relationships between words – like syntactic and semantic relationships – which might be crucial for understanding the meaning of a sentence. This parallel and diverse processing of information leads to more expressive and powerful models, as evidenced by the success of Transformer-based models in a wide range of language tasks.

No comments yet! You be the first to comment.

GET HELP

CONTACT US

Address : Sector 63A, Anishi's Utsav, Noida

Practical NLP With Transformers