ThunderKittens Releases 2.0 CUDA DSL and Kernels

ThunderKittens, an open-source CUDA-embedded DSL, announces version 2.0, introducing MXFP8/NVFP4 support, CLC scheduling, tensor memory controllability, PDL, multi-GPU features, a major internal refactor, and a simplified build system. The project reports BF16/MXFP8/NVFP4 GEMM kernels that match or exceed cuBLAS on Nvidia B200s and provides a technical deep dive on memory consistency, tensor-core pipelining, PTX hinting, occupancy, and benchmarking.
Scoring Rationale
Practical performance gains and usable tools; relevance limited to kernel developers and NVIDIA-focused GPU optimizations.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

