Speaker: Dirk Hovy, post-doctoral researcher, University of Copenhagen, Copenhagen, Denmark
Abstract: Abstract The way we express ourselves is heavily influenced by our demographic background and our communicative goals. In NLP, however, we have mostly assumed that a) the goal of language is information, and b) all demographic groups use language the same. As NLP is applied to more and more domains and text types, these assumptions are challenged.
Sociolinguistics has long investigated the interplay of demographic factors and language use, and it seems likely that the same factors are also present in the data we use to train NLP systems. The resulting bias can harm performance, but can also systematically disadvantage whole demographic groups. As a result, some of the problems we have addressed in domain adaptation might actually require demographic adaptation. In this talk, I will show how we can combine statistical NLP methods and sociolinguistic theories to the benefit of both fields. I present ongoing research into large-scale statistical analysis of demographic language variation, how this variation affects the performance (and fairness) of NLP systems, and how we can incorporating demographic information to address both problems.
Abstract The way we express ourselves is heavily influenced by our demographic background and our communicative goals. In NLP, however, we have mostly assumed that a) the goal of language is information, and b) all demographic groups use language the same. As NLP is applied to more and more domains and text types, these assumptions are challenged. Sociolinguistics has long investigated the interplay of demographic factors and language use, and it seems likely that the same factors are also present in the data we use to train NLP systems. The resulting bias can harm performance, but can also systematically disadvantage whole demographic groups. As a result, some of the problems we have addressed in domain adaptation might actually require demographic adaptation. In this talk, I will show how we can combine statistical NLP methods and sociolinguistic theories to the benefit of both fields. I present ongoing research into large-scale statistical analysis of demographic language variation, how this variation affects the performance (and fairness) of NLP systems, and how we can incorporating demographic information to address both problems.