College of Computing and Digital Media Dissertations

Towards generalizable machine learning models for computer-aided diagnosis in medicine

Yiyang Wang, DePaul UniversityFollow

Date of Award

Spring 5-30-2023

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

School

School of Computing

First Advisor

Daniela Stan Raicu, PhD

Second Advisor

Jacob Furst, PhD

Third Advisor

Thiruvarangan Ramaraj, PhD

Fourth Advisor

Samuel G. Armato, PhD

Abstract

Hidden stratification represents a phenomenon in which a training dataset contains unlabeled (hidden) subsets of cases that may affect machine learning model performance. Machine learning models that ignore the hidden stratification phenomenon--despite promising overall performance measured as accuracy and sensitivity--often fail at predicting the low prevalence cases, but those cases remain important. In the medical domain, patients with diseases are often less common than healthy patients, and a misdiagnosis of a patient with a disease can have significant clinical impacts. Therefore, to build a robust and trustworthy CAD system and a reliable treatment effect prediction model, we cannot only pursue machine learning models with high overall accuracy, but we also need to discover any hidden stratification in the data and evaluate the proposing machine learning models with respect to both overall performance and the performance on certain subsets (groups) of the data, such as the ‘worst group’.

In this study, I investigated three approaches for data stratification: a novel algorithmic deep learning (DL) approach that learns similarities among cases and two schema completion approaches that utilize domain expert knowledge. I further proposed an innovative way to integrate the discovered latent groups into the loss functions of DL models to allow for better model generalizability under the domain shift scenario caused by the data heterogeneity.

My results on lung nodule Computed Tomography (CT) images and breast cancer histopathology images demonstrate that learning homogeneous groups within heterogeneous data significantly improves the performance of the computer-aided diagnosis (CAD) system, particularly for low-prevalence or worst-performing cases. This study emphasizes the importance of discovering and learning the latent stratification within the data, as it is a critical step towards building ML models that are generalizable and reliable. Ultimately, this discovery can have a profound impact on clinical decision-making, particularly for low-prevalence cases.

Recommended Citation

Wang, Yiyang, "Towards generalizable machine learning models for computer-aided diagnosis in medicine" (2023). College of Computing and Digital Media Dissertations. 48.
https://via.library.depaul.edu/cdm_etd/48

Download

Included in

Biomedical Informatics Commons, Data Science Commons

COinS

Towards generalizable machine learning models for computer-aided diagnosis in medicine

Date of Award

Degree Type

Degree Name

School

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Recommended Citation

Included in

Search

Login and Notify

About The Commons

Links

Browse

Author Corner

At A Glance

Towards generalizable machine learning models for computer-aided diagnosis in medicine

Author

Date of Award

Degree Type

Degree Name

School

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Abstract

Recommended Citation

Included in

Share

Search

Login and Notify

About The Commons

Links

Browse

Author Corner

At A Glance