NAVER LABS Europe seminars are open to the public. This seminar is virtual and requires registration.
Date: 21st November 2024, 11:00 am (CET)
Improving representations for language modeling
About the speaker: Nathan Godey is a final year PhD student in the ALMAnaCH lab at Inria Paris, advised by Benoît Sagot and Éric de la Clergerie. He was recently a visiting student in Edoardo Ponti’s lab at the University of Edinburgh. He also teaches the Advanced NLP course in the SCIA MSc at EPITA.
Abstract: Generative models (e.g. Llama) have now mostly replaced traditional predictive models (e.g. BERT) for a variety of tasks, driving language systems to prioritize expansive generative capability over strong feature extraction. As a consequence, recent models can be used as black-box systems that only need to be dissected for explanation or interpretation purposes. In our work, we find that observing high-level characteristics of the representations these models produce can provide insights on the inherent limitations of the LLM paradigm, by exposing biases and distortions that emerge from both the nature of the training data and from the inductive biases used in model architectures.
Our work not only reveals key bottlenecks but also guides alternatives to standard modeling approaches, including a neural tokenization layer that enhances robustness, a contrastive LM objective that improves training efficiency, and paves the way for compression schemes aimed at more memory-efficient generative modeling. Overall, this presentation shows how representation analysis can shed light on fundamental modeling limitations while inspiring new approaches to overcome them.