XProf Unlocks Deep Kernel Profiling for TPUs

In a Google Open Source blog post, Google introduces enhancements to XProf, its accelerator profiler for the JAX/XLA ecosystem, that expose cycle-level visibility and runtime telemetry for custom TPU kernels. The update targets kernels authored with frameworks such as Pallas, Mosaic and Triton, addressing what the post describes as optimization blind spots, where legacy profilers capture custom compilation paths as opaque, single-block traces. Per Google, XProf can now visualize Low-Level Operations (LLO) bundle data - the machine instructions issued to the TPU's functional units each clock cycle - and uses dynamic instrumentation to record exact execution times and block-utilization metrics rather than static compiler estimates. The post does not publish end-to-end speedups or external benchmarks. XProf is open source via the OpenXLA project and supports JAX, PyTorch/XLA and TensorFlow.
What happened
Per a Google Open Source blog post, XProf - Google's accelerator profiler for the JAX/XLA stack - now provides cycle-level profiling and runtime telemetry for custom TPU kernels, including those written with Pallas, Mosaic and Triton. The post says legacy profilers often render custom compilation paths as opaque, single execution blocks, leaving optimization blind spots, and that XProf can now visualize Low-Level Operations (LLO) bundle data - the instructions issued to the TPU's functional units each clock cycle. It does not publish end-to-end speedups or external benchmarks.
Technical details
Custom kernels compiled via Pallas (which lowers to Mosaic on TPU and piggybacks on Triton on GPU) frequently produce traces that traditional profilers cannot map back to high-level ops. According to Google, XProf uses dynamic instrumentation to insert traces exactly when a bundle executes, yielding measured execution times and block-utilization metrics instead of static compiler estimates, so developers can check whether the compiler's instruction scheduling honors their intent.
Why it matters (editorial analysis)
Finer-grained, cycle-accurate telemetry typically shortens the iterate-measure-optimize loop for kernel authors and helps correlate codegen decisions, vectorization and memory-access patterns with per-core stalls. Wider availability of such traces also supports regression testing for new compiler passes across tensor toolchains. The audience is narrow - teams writing custom TPU kernels - but for them this removes a recurring debugging friction.
What to watch
Indicators to follow include published case studies or benchmarks using XProf on representative workloads, deeper trace-mapping annotations from upstream compiler projects, and integration of XProf telemetry into continuous performance tests for TPU training and inference.
Bottom line
XProf adds deeper, cycle-level kernel telemetry for TPUs, reducing a class of profiler blind spots for custom compilation paths and giving Pallas and Triton developers measured rather than estimated execution data.
Scoring Rationale
Genuine, primary-sourced tooling upgrade (Google Open Source) adding cycle-level and LLO-bundle telemetry to XProf for custom TPU kernels - corroborated by OpenXLA and Google Cloud documentation. Useful and concrete for the narrow audience of Pallas/Mosaic/Triton kernel authors rather than a frontier or industry-wide event, so it sits in the solid niche-tool range. Adjusted from 6.8 to 6.2.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems
