Abstract: The Transformer architecture has taken the NLP community by storm. A large body of work has subsequently focused on understanding how they behave. In this talk, I will focus on the fact that a Transformer embedding can be expressed as a sum of vector factors, owing to the use of residual connections across all sublayers. This view sheds light on seemingly disconnected observations often made in the literature: why do Transformer embedding spaces exhibit anisotropy? how does BERT next sentence prediction objective shapes its vector space? why do lower layers tend to fare better on lexical semantic tasks? How different are Transformer embeddings from pure bag-of-word representations? Are multi-head attention modules the most important components in a Transformer?
Speakers: Timothee Mickus is a post doc at University of Helsinki, working with Jörg Tiedemann on the ERC Fotran. Previously, his PhD research topic was on distributional semantics and dictionaries: do dictionary definitions depict meaning in the same way as neural networks-based word vectors? Can we come up with quantitative ways of measuring how similar these two theories are?
Affiliation: University of Helsinki
Place of Seminar: Kumpula exactum C323 (in person) & Otaniemi, T5 (streaming)