Machine Learning Predictions for Cancer Diagnosis

The WiDS Datathon 2024 focused on predicting whether a patient’s diagnosis period is less than 90 days using a real world dataset comprising approximately 39,000 patient records. This anonymized dataset includes patient characteristics such as age, race, BMI, and zip code, along with diagnosis and treatment information (e.g., breast cancer diagnosis type, treatments), demographic data at the zip-code level (e.g., income, education, rent, race, poverty), and toxic air quality data (Ozone, PM25, NO2) thereby linking health outcomes to environmental conditions.

The primary objective is to determine if the likelihood of a patient’s diagnosis period being less than 90 days can be predicted based on these characteristics and patient information. Feature selection and data cleaning were the most time-consuming tasks in the analysis process.

View the process in its entirety in datalore below: