35ème Conférence sur la Gestion de Données – Principes, Technologies et Applications: The BDA conference is the conference of the data management community in France.
09:00-10:30: Tutorial: “Data Pipelines for User Group Analytics.” Behrooz Omidvar-Tehrani, Sihem Amer-Yahia.
User data is becoming increasingly available in various domains ranging from the social Web to electronic patient health records (EHRs). User data is characterized by a combination of demographics (e.g., age, gender, life status) and user actions (e.g., posting a tweet, following a diet). Domain experts rely on user data to conduct large-scale population studies. Information consumers, on the other hand, rely on user data for routine tasks such as finding a book club and getting advice from look-alike patients. User data analytics is usually based on identifying group-level behaviors such as “teenage females who watch Titanic” and “old male patients in Paris who suffer from Bronchitis.” In this tutorial, we review data pipelines for User Group Analytics (UGA). These pipelines admit raw user data as input and return insights in the form of user groups. We review research on UGA pipelines and discuss approaches and open challenges for discovering, exploring, and visualizing user groups. Throughout the tutorial, we will illustrate examples in two key domains: “the social Web” and “health-care”.
15:15-15:45: “Cohort Representation and Exploration”. Behrooz Omidvar-Tehrani, Sihem Amer-Yahia and Laks V.S. Lakshmanan
The abundant availability of health-care data calls for effective analysis methods which help medical experts gain a better understanding of their data. While the focus has been largely on prediction, “representation” and “exploration” of health-care data have received little attention. In this paper, we introduce CORE, a framework for representing and exploring patient cohorts. Obtaining a readable and succinct representation of health data of a cohort is challenging because cohorts often consist of hundreds of patients whose medical actions are of various types and occur at different points in time. We extend the Needleman-Wunsch algorithm for sequence matching to handle temporal sequences, and propose “trajectory families”, a customized index to efficiently compare and aggregate patient trajectories into a cohort representation. We define cohort exploration as finding similar cohorts to a given cohort. This problem is challenging because the potential number of similar cohorts is huge. We propose a two-staged approach based on limiting the search space to “contrast cohorts” and then computing their similarity to the given cohort. To speed up cohort similarity computation, we use “event sets” in the same spirit as the double dictionary encoding proposed for keyword search. We run qualitative and quantitative experiments on real data to explore the efficiency and usefulness of CORE. We show that CORE representations reduce time-to-insight from hours to seconds and help medical experts find insights better than state-of-the-art Visual Analytics tools.