Skip to main content
Analysis of Multi-Headed Attention in Transformer-based Models

Analysis of Multi-Headed Attention in Transformer-based Models

Date2nd Jul 2021

Time03:00 PM

Venue https://meet.google.com/xzq-dush-xir

PAST EVENT

Details

Within a short span of time, transformer-based models have established themselves as the state-of-the-art models for NLP. As a result of their wide adoption across multiple tasks, the interest in understanding how and why they work has naturally gained momentum. Multi-headed self-attention heads are a mainstay in the architecture of transformer-based models. Several works have attributed the high performance of the model to various attention patterns learnt by the self-attention heads. In this work, we first analyse the attention heads and later propose a simplification so as to motivate the design of efficient transformer models.

Different methods have been proposed in the literature to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. Given the large diversity of the roles of the heads and large variability across input sentences, existing methods do not enable a systematic or formal study of the role of attention heads. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and the effect of fine-tuning for specific NLP tasks on these functional roles.

Based on the insights originating from this analysis, we next propose a simplification to the attention operation. We constrain self-attention only to nearby tokens instead of all tokens of the sentence. We provide formal statistical and sensitivity analysis to show why attending to local information is sufficient. We then define the notion of effective attention, i.e., the attention obtained by combining the outputs of individual attention heads and show that this effective attention is often local. We then train several configurations of the models with varying percentage and distribution of local attention and show that even for the extreme configuration of all local attention heads, the BLEU score drop for MT tasks is negligible and the average accuracy drop for GLUE tasks is 1.9%. Finally, we propose and evaluate parameter sharing as a further optimization and show that query and key projection matrices of local attention heads can be shared with small drops in accuracy (0.15 BLEU and 2.5% on GLUE tasks). Our results establish that local attention is an effective model bias for NLP tasks to reduce the computation complexity and the size of Transformer-based models.

Speakers

Madhura Pande

Computer Science and Engineering