Parquet Pipeline Improves Clinical Data Processing Efficiency

Yonsei University researchers (JMIR Med Inform, 2026) evaluated a Parquet-based end-to-end pipeline on 13.76 million EHR rows, comparing Parquet, CSV, PostgreSQL, and DuckDB for storage, processing, modeling, and privacy. Parquet reduced disk access from 940.2 to 44.2 seconds and cut feature-transformation and training latencies; multilabel GPU XGBoost classifier chains preserved predictive performance (P<.001) while membership inference attacks performed at chance (AUC=0.500).
Key Points
- 1Demonstrates Parquet reduces disk access from 940.2s to 44.2s (95.3% reduction) on 13.76M rows
- 2Shows predictive performance remains statistically equivalent across metrics (P<.001) using GPU XGBoost
- 3Indicates scalable clinical workflows achievable without increasing privacy risk; membership inference AUC=0.500
Scoring Rationale
High practical impact and peer-reviewed validation, with modest novelty limited to engineering-level improvements rather than conceptual advances.
Sources
Public references used for this report.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems


