Multi-head attention is an extension of the self-attention mechanism and is a key feature of the Transformer model architecture, which has significantly impacted natural language processing tasks. It allows the model to jointly attend to information from different representation subspaces at different positions.
In self-attention, for each word in a sentence, a single attention mechanism is applied, which computes a set of Query, Key, and Value vectors and then uses these to produce an output vector for each word. Multi-head attention, on the other hand, splits this attention into multiple “heads”. Here’s how it works:
Let’s assume our input has ( $d_{model}$ ) dimensions, and we use ( h ) heads for the multi-head attention. Then, each head will operate on ( $\frac{d_{model}}{h}$ ) dimensions. The process for each head ( i ) is as follows:
The advantage of multi-head attention is that it allows the model to capture information from different representational spaces. For instance, different heads can focus on different types of relationships between words – like syntactic and semantic relationships – which might be crucial for understanding the meaning of a sentence. This parallel and diverse processing of information leads to more expressive and powerful models, as evidenced by the success of Transformer-based models in a wide range of language tasks.