Feature Selection Predicts Cancer Information Seeking Behavior

The study uses the 2022 Health Information National Trends Survey (HINTS 6) data from 5,505 U.S. adults to compare feature selection approaches for predicting cancer information seeking. Researchers applied four selection strategies-Boruta, LASSO, principal component analysis (PCA), and no feature selection-then trained five classifiers: SVM, LR, RF, KNN, and XGBoost. The prevalence of cancer information seeking was 47.2%. Boruta and LASSO returned 45 and 55 variables with 36 overlapping; PCA produced 21 uncorrelated factors. Random Forest (RF) achieved the strongest results, with area under the curve approximately 0.950 and accuracy around 0.860 using Boruta, LASSO, or no selection. PCA-based inputs lowered AUC to 0.931 with similar accuracy. Stepwise regression validated 21 of the 36 shared predictors, including personal/family cancer history, health information access, education, income, social media use, and smoking status. The results show model robustness to selection method and provide a ranked predictor set for public health applications.
What happened - The preprint compares feature selection pipelines on the 2022 HINTS 6 dataset of 5,505 U.S. adults to predict whether individuals seek cancer information. The authors evaluated Boruta, LASSO, their combination, PCA, and no selection, and trained five classifiers: SVM, LR, RF, KNN, and XGBoost. Overall prevalence of information seeking was 47.2%. The Boruta and LASSO procedures selected 45 and 55 variables respectively, with 36 features in common; PCA yielded 21 factors.
Technical details - Modeling used standard supervised classifiers with feature sets derived from each selection method. Key empirical outcomes: RF produced the best discrimination with AUC approximately 0.950 and accuracy near 0.860 for Boruta, LASSO, and no feature selection. Using PCA-derived factors reduced AUC to 0.931 while maintaining similar accuracy (0.853). Stepwise regression confirmed 21 of the 36 shared predictors. The study highlights a stable core feature set that includes personal/family cancer history, measures of health information access, education, income, social media use, and smoking status. The paper does not report extensive ablation experiments on correlated inputs or computational cost comparisons across selection approaches.
Context and significance - For practitioners, the central takeaway is that when predicting a well-measured public health outcome on a moderate-sized survey, model choice (here RF) can dominate fine-grained differences among common feature selection pipelines. The near-identical performance of Boruta, LASSO, and no selection suggests RF is robust to noisy or correlated predictors in this setting. PCA reduced dimensionality at modest cost to discrimination, which could still be useful when multicollinearity or privacy-preserving factorization is required.
What to watch - Confirmatory peer review should check for overfitting controls, calibration, class-balance handling, and external validation. Next steps include external replication on independent surveys, systematic ablation to quantify feature importance, and deploying the validated predictor set for targeted health-communication interventions.
Scoring Rationale
The study offers a practical, method-focused comparison useful to ML practitioners working with survey data and public-health prediction. It is methodologically relevant but not a frontier-model or paradigm shift, hence a mid-high notable score.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


