Machine Learning Selects Optimal Threads for GEMM
Yufan Xia (arXiv preprint submitted Jan 14, 2026) presents a proof-of-concept ADSALA library that uses an on-the-fly machine learning model to select optimal thread counts for GEMM. Tests on two-socket Intel Cascade Lake and two-socket AMD Zen 3 nodes report 25–40% speedups versus traditional BLAS GEMM for workloads with up to 100 MB memory usage. The approach targets multi-core shared-memory tuning complexity.
Key Points
- 1Demonstrates ML-based on-the-fly selection of thread counts for GEMM achieving 25–40% speedup.
- 2Addresses multicore shared-memory tuning complexity across architectures (Intel Cascade Lake, AMD Zen 3).
- 3Enables practitioners to auto-tune thread parallelism for BLAS GEMM within 100 MB workloads.
Scoring Rationale
Strong cross-architecture ML optimization and actionable speedups, limited by single arXiv preprint validation and 100 MB workload scope.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems