Seminars at NAVER LABS Europe are open to the public but space is limited. Please register
Date: 30th January 2020, 11:00 AM-12:00 PM
Speaker: Lena Voita , Yandex Research, University of Amsterdam
We will discuss what, how and why Transformers learn by analyzing
1. the mechanisms the model uses to encode different kinds of information;
2. how training objective defines information flow in a model.
First, we will start with an in-depth analysis of multi-head attention. Using attribution methods, we will assess the importance of individual heads and will show that the most important heads play interpretable roles. Surprisingly, all the rest of the heads are redundant and, using our novel heads-pruning method, can be pruned with almost no loss in translation quality.
Then, we will look at how the representations of individual tokens in the Transformer evolve between layers under different learning objectives: MT, LM and MLM (BERT-style). While previous work mostly used so-called ‘probing tasks’ and has made some interesting observations, an explanation of the process behind the observed behavior has been lacking. I will attempt to explain more generally why such behavior is observed by characterizing how the learning objective determines the information flow in the model. I look at this task from the information bottleneck perspective on learning in neural networks and will show that patterns in information flow are substantially different. For example, while LMs gradually forget past when forming predictions about future, for MLMs the evolution proceeds in two stages of context encoding and token reconstruction.