Phonological ML Identifies Non-Mainstream Sulawesi Vocabulary

An arXiv preprint submitted March 11, 2026 applies rule-based cognate subtraction and an XGBoost phonological classifier to 1,357 basic-vocabulary forms from six Sulawesi Austronesian languages. The study identifies 438 candidate non-mainstream forms (26.5%), reports classifier AUC=0.763 with 266 high-confidence cases, and finds no coherent substrate word families while noting higher predicted non-mainstream rates in Sulawesi (mean 0.606) versus western Indonesia (0.393).
Scoring Rationale
Fresh arXiv research applies ML to a sizable historical-linguistic dataset with measurable results (AUC=0.763) and practical methods, giving moderate novelty and actionability. Score is tempered because it's a single preprint (not peer-reviewed) addressing a specialist subfield rather than an industry-wide development.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.
Sources
- Read Original[2604.00023] Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexiconarxiv.org


