Date of Award
Doctor of Philosophy (PhD)
School of Computing
Daniela Stan Raicu, PhD
Jacob Furst, PhD
Thiruvarangan Ramaraj, PhD
Samuel G. Armato, PhD
Hidden stratification represents a phenomenon in which a training dataset contains unlabeled (hidden) subsets of cases that may affect machine learning model performance. Machine learning models that ignore the hidden stratification phenomenon--despite promising overall performance measured as accuracy and sensitivity--often fail at predicting the low prevalence cases, but those cases remain important. In the medical domain, patients with diseases are often less common than healthy patients, and a misdiagnosis of a patient with a disease can have significant clinical impacts. Therefore, to build a robust and trustworthy CAD system and a reliable treatment effect prediction model, we cannot only pursue machine learning models with high overall accuracy, but we also need to discover any hidden stratification in the data and evaluate the proposing machine learning models with respect to both overall performance and the performance on certain subsets (groups) of the data, such as the ‘worst group’.
In this study, I investigated three approaches for data stratification: a novel algorithmic deep learning (DL) approach that learns similarities among cases and two schema completion approaches that utilize domain expert knowledge. I further proposed an innovative way to integrate the discovered latent groups into the loss functions of DL models to allow for better model generalizability under the domain shift scenario caused by the data heterogeneity.
My results on lung nodule Computed Tomography (CT) images and breast cancer histopathology images demonstrate that learning homogeneous groups within heterogeneous data significantly improves the performance of the computer-aided diagnosis (CAD) system, particularly for low-prevalence or worst-performing cases. This study emphasizes the importance of discovering and learning the latent stratification within the data, as it is a critical step towards building ML models that are generalizable and reliable. Ultimately, this discovery can have a profound impact on clinical decision-making, particularly for low-prevalence cases.
Wang, Yiyang, "Towards generalizable machine learning models for computer-aided diagnosis in medicine" (2023). College of Computing and Digital Media Dissertations. 48.