Researchparquetelectronic health recordsxgboostmultilabel
Parquet Pipeline Improves Clinical Data Processing Efficiency
8.1
Relevance Score
Yonsei University researchers (JMIR Med Inform, 2026) evaluated a Parquet-based end-to-end pipeline on 13.76 million EHR rows, comparing Parquet, CSV, PostgreSQL, and DuckDB for storage, processing, modeling, and privacy. Parquet reduced disk access from 940.2 to 44.2 seconds and cut feature-transformation and training latencies; multilabel GPU XGBoost classifier chains preserved predictive performance (P<.001) while membership inference attacks performed at chance (AUC=0.500).



