Article written at the occasion of the XRCE 20th anniversary celebration.
During the past decade, we have seen an explosion in the amount of digital information. This has led to information overload, making it difficult for humans to make sense of very large document collections, such as emails, digital libraries, news articles or legal documents.
Probabilistic topic models, in part pioneered by Xerox under the trademark Smarter Document Management Technologies, are now routinely used to analyse and explore large sets of documents.[1] Topic models automatically organize documents into semantic clusters or topics, based on the statistical properties of the text that lies within. Their tremendous success (more than 6,500 citations in Google Scholar at the time of writing) can be attributed to their simplicity and appealing interpretation. Yet while successful, current topic models still do not account for the fundamental properties of natural language, which would lead to more diverse and interpretable topics.
What are probabilistic topic models?
Topic models extract human intelligible topics from text in an unsupervised way. This means that the clusters of documents are automatically learnt from data without any human intervention. Probabilistic topic models posit a generative process for document collections: they propose a probabilistic model (i.e., a set of interdependent random variables), which describe how documents are generated.
To capture the semantics, the key simplifying assumption made in topic models is that documents can be represented by a mixture of topics, which ignore the word order of the text. While rather basic, this simplifying assumption has proven to be effective in practice when one is only interested in extracting the topics. Providing computers with the capability to recognize the topics of a document enables them to identify documents discussing similar content and then suggest these findings to the human user.
Figure 1 is an example of four topics and a piece of text, where each word is assigned to one of the four topics (arts, budgets, children, education). Each topic is defined by a list of vocabulary words, each one being assigned a probability. A document is then assumed to be generated from these topics as follows:
This probabilistic model not only proposes an appealing generative model of documents, but it also enjoys a relatively simple inference procedure (a collapsed Gibbs sampler to be precise) based on simple word counts, which is able to handle millions of documents in a couple of minutes.[2] Inference is the process of deciding which topics should be associated with the documents. It is done automatically, based on the data-driven evidence. Knowing the topic association is useful in practice as it enables one, for example, to recover documents that share the same set of topics.
Figure 1
Figure 1 (reproduced from reference 1). Topics are defined by a list of words. Each column corresponds to a topic and the words in the list are ranked according to their relevance. The words in the boxed text are coloured according to the topics shown at the top. Each word is modelled as being drawn independently from one of these topics, neglecting the sequential structure of text.
Weaknesses of standard probabilistic topic models
A practical issue with topic models is the identification of the most likely number of topics describing the data. This is because the identification is a computationally expensive procedure. When modelling real data, the number of topics is expected to grow logarithmically with the size of the corpus. When the number of documents in the corpus increases, it is reasonable to assume that new topics will appear, but that the increase will not be linear with the number of documents; there will be a saturation effect. The issue can be dealt with in a principled way by considering nonparametric Bayesian extensions, a recent trend in probabilistic machine learning.[3]
A second weakness of topic models is their limited expressiveness. The prevalence of a topic in the corpus is correlated with its prevalence in individual documents. Similarly, the prevalence of a word occurring in the corpus is correlated with its prevalence in the individual topics. These are undesirable properties. For example, a good model should be able to identify that a word characterizes a specific topic irrespective of its frequency in the document collection.
Finally, and perhaps most importantly, the probabilistic model postulated by topic models are inappropriate for modelling real text. Data sampled from the model are statistically distributed differently than real observations. For example, it is well-known that modern languages exhibit power-law properties (see Figure 2). This means that human languages have a very heavy tail: few words are extremely frequent, while many words are very infrequent. This is not accounted for in classical topic models.
Figure 2
Figure 2 shows the ordered word frequencies of four benchmark corpora available from Let be the frequency of word in the corpus. It can be observed that the ranked word frequencies follow Zipf’s law, which is an example of a power-law distribution: where is a positive constant. Like many natural phenomena, human languages including English exhibit this property. Intuitively, this means that human languages have a very heavy tail: few words are extremely frequent, while many words are very infrequent.
Using probabilistic topic models with power-law characteristics in an idea management system
The data we observe in practice, such as text, images or social networks, show significant departures from standard distributions encountered in statistics. When our target application is to automatically organize a large set of documents according to topics, we should use models that are able to learn a potentially infinite number of topics and capable of accounting for the power-law characteristics of natural language. Moreover, we would like to increase the model expressiveness, either by allowing more diverse topic distributions, or by favouring more specialized topics, while preserving a simple and efficient inference procedure. This can be achieved by basing topic models on a stochastic process called the Indian Buffet Process;(IBP). [4], [5]
The generative model for a document resulting from the IBP-based topic model is similar to that of the standard topic model, except that a small subset of topics is selected before assigning them a weight. Similarly, each topic is defined by a relatively small subset of the vocabulary words, but which follow a power-law. The IBP operates as a binary mask on the discrete distributions defining topics and their association to documents. Topics extracted from the corpus are more specific, possibly assigning a large weight to infrequent, but informative words and they are more discriminative. We observed experimentally that fewer topics were associated to each document.
We currently are exploring how this new type of topic models can be integrated into an Idea Management System (IMS), which can be viewed as a collaborative brainstorming system. In its most simple form, an IMS is a so-called suggestion box, where customers and/or employees can submit feedback or make suggestions for product/service improvements. Large companies such as IBM, Dell, Microsoft, Whirlpool, UBS or Starbucks have deployed such systems to better support innovation with the aim of capturing the collective wisdom residing in the employee and/or customer base. Xerox is adapting IMS to other domains, such as urban planning and policy design, facilitating the communication between citizens and political decision makers through an IMS with advanced filtering, browsing and aggregating capabilities.
However, when a large number of ideas are collected, it quickly becomes very time consuming to identify common themes, as well as overlaps, duplicates and related ideas. The system we are developing aims to facilitate this process for the decision maker by providing him or her tools to explore, curate and aggregate ideas. Probabilistic topic models with power-law characteristics are very useful in this context as they enable users and curators to find more relevant and targeted topics, increasing the relevance of retrieved documents and improving their browsing experience.
About the author (2014): Cédric Archambeau is Area Manager of the Machine Learning group at Xerox Research Centre Europe. He also holds an Honorary Senior Research Associate position in the Centre for Computational Statistics and Machine Learning at University College London. His research interests include probabilistic machine learning and data science, with applications in natural language processing, relational learning, personalised content creation and data assimilation.
[1] D. M. Blei, A. Y. Ng, M. I. Jordan: Latent Dirichlet allocation. Journal of Machine Learning Research 3 (4–5): 993–1022, 2003.[2] T. L. Griffiths and M. Steyvers: Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.[3] Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei: Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.[4] C. Archambeau, B. Lakshminarayanan, G. Bouchard: Latent IBP compound Dirichlet Allocation. To appear in IEEE transactions in Pattern Analysis and Machine Intelligence.[5] Z. Ghahramani, T. Griffiths, P. Sollich: Bayesian nonparametric latent feature models (with discussion). Bayesian Statistics 8:201–226, 2007.
NAVER LABS Europe 6-8 chemin de Maupertuis 38240 Meylan France Contact
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimization problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimization to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments. More details on our research can be found in the Explore section below.
For a robot to be useful it must be able to represent its knowledge of the world, share what it learns and interact with other agents, in particular humans. Our research combines expertise in human-robot interaction, natural language processing, speech, information retrieval, data management and low code/no code programming to build AI components that will help next-generation robots perform complex real-world tasks. These components will help robots interact safely with humans and their physical environment, other robots and systems, represent and update their world knowledge and share it with the rest of the fleet. More details on our research can be found in the Explore section below.
Visual perception is a necessary part of any intelligent system that is meant to interact with the world. Robots need to perceive the structure, the objects, and people in their environment to better understand the world and perform the tasks they are assigned. Our research combines expertise in visual representation learning, self-supervised learning and human behaviour understanding to build AI components that help robots understand and navigate in their 3D environment, detect and interact with surrounding objects and people and continuously adapt themselves when deployed in new environments. More details on our research can be found in the Explore section below.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
Details on the gender equality index score 2024 (related to year 2023) for NAVER France of 87/100.
1. Difference in female/male salary: 34/40 points
2. Difference in salary increases female/male: 35/35 points
3. Salary increases upon return from maternity leave: Non calculable
4. Number of employees in under-represented gender in 10 highest salaries: 5/10 points
The NAVER France targets set in 2022 (Indicator n°1: +2 points in 2024 and Indicator n°4: +5 points in 2025) have been achieved.
Index NAVER France de l’égalité professionnelle entre les femmes et les hommes pour l’année 2024 au titre des données 2023 : 87/100
Détail des indicateurs :
1. Les écarts de salaire entre les femmes et les hommes: 34 sur 40 points
2. Les écarts des augmentations individuelles entre les femmes et les hommes : 35 sur 35 points
3. Toutes les salariées augmentées revenant de congé maternité : Incalculable
4. Le nombre de salarié du sexe sous-représenté parmi les 10 plus hautes rémunérations : 5 sur 10 points
Les objectifs de progression de l’Index définis en 2022 (Indicateur n°1 : +2 points en 2024 et Indicateur n°4 : +5 points en 2025) ont été atteints.
To make robots autonomous in real-world everyday spaces, they should be able to learn from their interactions within these spaces, how to best execute tasks specified by non-expert users in a safe and reliable way. To do so requires sequential decision-making skills that combine machine learning, adaptive planning and control in uncertain environments as well as solving hard combinatorial optimisation problems. Our research combines expertise in reinforcement learning, computer vision, robotic control, sim2real transfer, large multimodal foundation models and neural combinatorial optimisation to build AI-based architectures and algorithms to improve robot autonomy and robustness when completing everyday complex tasks in constantly changing environments.
The research we conduct on expressive visual representations is applicable to visual search, object detection, image classification and the automatic extraction of 3D human poses and shapes that can be used for human behavior understanding and prediction, human-robot interaction or even avatar animation. We also extract 3D information from images that can be used for intelligent robot navigation, augmented reality and the 3D reconstruction of objects, buildings or even entire cities.
Our work covers the spectrum from unsupervised to supervised approaches, and from very deep architectures to very compact ones. We’re excited about the promise of big data to bring big performance gains to our algorithms but also passionate about the challenge of working in data-scarce and low-power scenarios.
Furthermore, we believe that a modern computer vision system needs to be able to continuously adapt itself to its environment and to improve itself via lifelong learning. Our driving goal is to use our research to deliver embodied intelligence to our users in robotics, autonomous driving, via phone cameras and any other visual means to reach people wherever they may be.
This web site uses cookies for the site search, to display videos and for aggregate site analytics.
Learn more about these cookies in our privacy notice.
You may choose which kind of cookies you allow when visiting this website. Click on "Save cookie settings" to apply your choice.
FunctionalThis website uses functional cookies which are required for the search function to work and to apply for jobs and internships.
AnalyticalOur website uses analytical cookies to make it possible to analyse our website and optimize its usability.
Social mediaOur website places social media cookies to show YouTube and Vimeo videos. Cookies placed by these sites may track your personal data.
This content is currently blocked. To view the content please either 'Accept social media cookies' or 'Accept all cookies'.
For more information on cookies see our privacy notice.