Infrastructurexproftpuprofilingpallas

XProf Unlocks Deep Kernel Profiling for TPUs

|June 8, 2026|By LDS Team

6.2

Relevance Score

XProf Unlocks Deep Kernel Profiling for TPUs — Photo: blogger.googleusercontent.com · rights & takedowns

In a Google Open Source blog post, Google introduces enhancements to XProf, its accelerator profiler for the JAX/XLA ecosystem, that expose cycle-level visibility and runtime telemetry for custom TPU kernels. The update targets kernels authored with frameworks such as Pallas, Mosaic and Triton, addressing what the post describes as optimization blind spots, where legacy profilers capture custom compilation paths as opaque, single-block traces. Per Google, XProf can now visualize Low-Level Operations (LLO) bundle data - the machine instructions issued to the TPU's functional units each clock cycle - and uses dynamic instrumentation to record exact execution times and block-utilization metrics rather than static compiler estimates. The post does not publish end-to-end speedups or external benchmarks. XProf is open source via the OpenXLA project and supports JAX, PyTorch/XLA and TensorFlow.

What happened

Per a Google Open Source blog post, XProf - Google's accelerator profiler for the JAX/XLA stack - now provides cycle-level profiling and runtime telemetry for custom TPU kernels, including those written with Pallas, Mosaic and Triton. The post says legacy profilers often render custom compilation paths as opaque, single execution blocks, leaving optimization blind spots, and that XProf can now visualize Low-Level Operations (LLO) bundle data - the instructions issued to the TPU's functional units each clock cycle. It does not publish end-to-end speedups or external benchmarks.

Technical details

Custom kernels compiled via Pallas (which lowers to Mosaic on TPU and piggybacks on Triton on GPU) frequently produce traces that traditional profilers cannot map back to high-level ops. According to Google, XProf uses dynamic instrumentation to insert traces exactly when a bundle executes, yielding measured execution times and block-utilization metrics instead of static compiler estimates, so developers can check whether the compiler's instruction scheduling honors their intent.

Why it matters

Finer-grained, cycle-accurate telemetry typically shortens the iterate-measure-optimize loop for kernel authors and helps correlate codegen decisions, vectorization and memory-access patterns with per-core stalls. Wider availability of such traces also supports regression testing for new compiler passes across tensor toolchains. The audience is narrow - teams writing custom TPU kernels - but for them this removes a recurring debugging friction.

What to watch

Indicators to follow include published case studies or benchmarks using XProf on representative workloads, deeper trace-mapping annotations from upstream compiler projects, and integration of XProf telemetry into continuous performance tests for TPU training and inference.

Bottom line

XProf adds deeper, cycle-level kernel telemetry for TPUs, reducing a class of profiler blind spots for custom compilation paths and giving Pallas and Triton developers measured rather than estimated execution data.

Key Points

1XProf now exposes cycle-level traces and LLO bundle telemetry for custom TPU kernels, addressing profiler blind spots Google reports for Pallas/Mosaic/Triton code.
2Why it matters: dynamic instrumentation records exact per-bundle execution times and utilization, letting developers verify the compiler's instruction scheduling against intent.
3So what: finer-grained telemetry shortens the iterate-measure-optimize loop for TPU kernel authors and supports regression testing of new codegen passes.

Scoring Rationale

Genuine, primary-sourced tooling upgrade (Google Open Source) adding cycle-level and LLO-bundle telemetry to XProf for custom TPU kernels - corroborated by OpenXLA and Google Cloud documentation. Useful and concrete for the narrow audience of Pallas/Mosaic/Triton kernel authors rather than a frontier or industry-wide event, so it sits in the solid niche-tool range. Adjusted from 6.8 to 6.2.

Sources

Primary source and supporting public references used for this report.

4 sources

Primary sourceblogger.comUnlocking TPU performance: Deep kernel profiling with XProf

View 3 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems