|Bernard Omidvar-Tehrani, Sihem Amer-Yahia, Laks V.S. Lakshmanan|
|The International Journal on Very Large Data Bases (VLDBJ), Springer, August 2020|
The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration”. Our contributions are two-fold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction of the Multiple Sequence Alignment problem. We propose a heuristic that extends the NeedlemanWunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling”. Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge.
We prove NP-completeness with a reduction of the Maximum Edge Subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts”.