Data Science

Diabetes Risk Analysis & Population Segmentation

Calibrated two-stage diabetes screening, clustering, and explainability on CDC BRFSS survey data

77%

At-Risk Recall

0.810

Screening ROC-AUC

0.738

Balanced Accuracy

The Problem

Diabetes affects over 37 million Americans, with an estimated 96 million more in a prediabetic state, yet early screening tools remain blunt instruments. The CDC's Behavioral Risk Factor Surveillance System (BRFSS) captures self-reported health indicators from 253,680 respondents (229,781 after deduplication) across demographics, lifestyle, and clinical factors, but the data presents serious modeling challenges. The target variable has three classes (no diabetes, prediabetes, diabetes) with severe imbalance: 83.8% of respondents report no diabetes, while prediabetes accounts for just 2% of observations. Beyond classification, public health practitioners need to understand which risk factors drive predictions and whether distinct population risk segments exist that could inform targeted intervention strategies.

Approach

Rather than forcing a single model to handle the full three-class problem, the pipeline uses a two-stage decomposition. Stage 1 is a binary screen (at-risk vs. no diabetes) using L2-regularized logistic regression with RandomUnderSampler to address class imbalance, tuned via Optuna over 60 trials with balanced accuracy as the primary objective. Stage 2 takes the at-risk population and attempts to distinguish prediabetes from diabetes using XGBoost with RUSBoost, though this stage faces the hardest separation in the dataset. Both stages use probability calibration (sigmoid and isotonic) fitted on held-out validation data. For explainability, four complementary methods are applied: SHAP beeswarm plots for global feature attribution, LIME for local instance-level explanations, permutation importance for model-agnostic feature ranking, and logistic regression coefficients for direct interpretability. On the unsupervised side, K-Means clustering (k=4, selected via elbow and silhouette analysis) identifies population risk segments from the feature space, validated post-hoc against actual diabetes prevalence rates. DBSCAN and hierarchical clustering serve as comparison methods. Association rule mining via Apriori surfaces co-occurring risk factor patterns.

Results

Stage 1 reached 0.810 ROC-AUC, 0.738 balanced accuracy, and 77% recall for at-risk screening on the held-out test set. K-Means found four population segments with diabetes prevalence ranging from 13.0% to 27.2% and strong stability across seeds (ARI 0.984), while SHAP and permutation importance consistently highlighted general health, high blood pressure, high cholesterol, BMI, and age as the strongest risk drivers, with clear directionality in the SHAP beeswarm (e.g., high BMI and older age push predictions toward diabetes). Prediabetes-versus-diabetes separation remained materially harder (Stage 2 balanced accuracy: 0.590; final hard-gated three-class: 0.510, vs. 0.49 for a single-model baseline), which supports positioning the project as a screening and segmentation tool rather than a diagnostic classifier. All metrics are reported on held-out test data with calibration validated separately.

Figures

SHAP beeswarm plot showing feature importance and directionality for diabetes class predictions — Fig. 1 — SHAP beeswarm plot for the Diabetes class showing per-feature impact on model output. General health, high blood pressure, BMI category, and high cholesterol are the top drivers, with clear directionality: high BMI and older age push predictions toward diabetes, while higher education and income are protective.

Scatter plot of four K-Means clusters projected onto first two principal components — Fig. 2 — K-Means clusters (k=4) projected into 2D PCA space (~74.5% variance explained by PC1 + PC2). The four segments separate along the first two components, with cluster assignment validated post-hoc against actual diabetes prevalence.

Chart showing diabetes prevalence rates across four K-Means clusters ranging from 13.0% to 27.2% — Fig. 3 — Diabetes prevalence by K-Means cluster (post-hoc validation). Cluster 1 has the highest diabetes rate (27.2%) — more than double Cluster 0 (13.0%) — demonstrating that the unsupervised segmentation captures clinically meaningful risk structure.