Researcher Uncovers Qualcomm NPU Compiler Internals

Sagnik Bhattacharjee (datavorous) published a June 2026 reverse-engineering analysis of stripped shared-object files from Qualcomm QAIRT v2.46.0.260424, targeting libHtpPrepare.so using Ghidra and empirical Linux testing. Bhattacharjee reports five findings he states were not previously documented publicly: the HTP compiler writes VTCM placement as a Mixed Integer Linear Program (MILP) solved by the HiGHS open-source optimizer rather than heuristics; a Priority BFS scheduler minimises peak on-chip working-set size before placement begins; the compiler inserts silent float32-to-FP16/BF16 downcasts via relaxed_precision_cast during placement without notifying the user; a compiled metadata field, spillFillBufferSize, can serve as a static fit-or-spill oracle; and HTP contains a hidden analytical simulator called Hextimate that estimates performance via a textbook roofline formula and models resource contention. Empirical tests showed a 33x DDR read traffic difference between SM8350 and SM8650 running the same model, despite both reporting the same nominal VTCM size.
What happened
Sagnik Bhattacharjee, an edge ML practitioner working under the handle datavorous, published a June 2026 reverse-engineering writeup based on decompilation of stripped *.so files from Qualcomm QAIRT v2.46.0.260424 using Ghidra and empirical parameter sweeping on Linux. The primary analysis target is libHtpPrepare.so (BuildID 63e60947ee8df89fe11592a8af12a30ddedb91cd). Bhattacharjee states all five core findings were not previously documented publicly for this SDK version.
Key findings
- •VTCM placement as MILP: HTP writes the tensor placement problem as a formal Mixed Integer Linear Program and passes it to HiGHS, an open-source optimization solver, rather than using heuristics. The objective minimises total bytes spilled to DDR, filled back from DDR, and sent between cores on multi-core chips. The compiler can dump the problem to a .mps file for debugging.
- •Priority BFS scheduler: Before placement, HTP uses a Priority BFS traversal to find the computation order that minimises peak VTCM working-set size, using depth-first topological ordering as the base metric with tie-breaking heuristics. The scheduler classifies the outcome as SMALL (no spill) or LARGE (spill expected) based on whether peak usage fits within VTCM capacity.
- •Silent precision rewrites: Operations called relaxed_precision_cast convert tensors between float32, FP16, and BF16 during placement to relieve memory pressure. The user receives no notification. Bhattacharjee confirmed these casts are inserted during placement but notes uncertainty on whether they are variables inside the MILP solver or a separate post-pass.
- •spillFillBufferSize oracle: The compiled binary carries a metadata field spillFillBufferSize; when 0, model weights fit entirely on-chip. Bhattacharjee proposes this as a static diagnostic for slow edge inference, allowing practitioners to quickly determine whether quantisation is needed for a given target chip.
- •Hextimate simulator: HTP contains an undocumented analytical simulator named Hextimate. It runs two passes - one assuming perfect resource overlap, one assuming none - and returns a performance range. The memory-side roofline formula recovered from machine code is: bandwidth = channels * width * efficiency * frequency; time = bytes / bandwidth. Hextimate includes dedicated detectors for FlashAttention, MoE architectures, KV caches, and rotary embeddings.
Empirical observation
Testing Qwen 0.8B on SM8350 vs. SM8650 (V75): SM8350 reported 5.46 MB spilled and 33.9 MB filled (37.9 MB total DDR read), while SM8650 produced no spills (1.15 MB DDR) - a 33x difference in DDR read traffic from the same model, despite both chips reporting the same nominal vtcmSize value in compiler output.
Caveats
The analysis covers one binary version; a different QAIRT release could change any finding. Bhattacharjee notes pending legal review before publishing the full RE methodology. All claims rest on a single researcher's work and have not been independently replicated.
Scoring Rationale
Technically detailed reverse-engineering of undocumented Qualcomm NPU compiler internals with five specific findings directly actionable for edge ML deployment on Qualcomm hardware. Single-source, single-researcher origin without independent corroboration limits the score. Adjusted from 7.0 to 6.2 - solid notable range for edge ML practitioners, appropriate for a deep technical blog without broader industry pickup.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
