KEDA Autoscaling Enables GPU Inference on Kubernetes

In a contributed Cloud Native Now article, cloud-native architect Pavan Madduri shows how KEDA can drive event-based autoscaling for GPU-backed AI inference on Kubernetes, replacing CPU-based Horizontal Pod Autoscaling with hardware-aware metrics. He outlines a three-layer design: a telemetry layer exposing GPU SM utilization and VRAM allocation (via exporters such as NVIDIA DCGM), a translation layer implemented as a KEDA external scaler, and an execution layer where the KEDA operator scales inference pods, including scale-to-zero during idle periods. Madduri says he open-sourced an external scaler, keda-gpu-scaler, to feed GPU telemetry into KEDA, and a CNCF community blog covers the same approach. The piece argues standard HPA is a poor fit for GPU-bound inference because it keys on CPU and memory, and that scale-to-zero can cut idle-GPU spend. It is a how-to from a single practitioner, not a vendor product launch.
What happened
In a contributed Cloud Native Now article, cloud-native architect Pavan Madduri demonstrates using KEDA (Kubernetes Event-driven Autoscaling) to autoscale GPU-backed AI inference based on GPU telemetry rather than CPU or memory. He describes a three-layer architecture - a telemetry layer exposing GPU SM utilization and VRAM allocation (for example via NVIDIA DCGM exporters), a translation layer implemented as a KEDA external scaler, and an execution layer where the KEDA operator adjusts replica counts and can scale to zero. Madduri says he open-sourced an external scaler, keda-gpu-scaler, to bridge GPU telemetry into KEDA; a CNCF community blog covers the same external-scaler approach.
How it works
Per the article, standard Horizontal Pod Autoscaling (HPA) is ineffective for GPU-bound inference because it relies on CPU and memory metrics while inference bottlenecks occur on GPU compute (SM utilization) and VRAM. The external scaler translates hardware telemetry into KEDA metrics, replacing more complex Prometheus-plus-HPA configurations and enabling a ScaledObject to scale, for example, when GPU utilization exceeds 80 percent. The author targets high-performance setups such as Oracle Kubernetes Engine with bare-metal GPU shapes, but notes the pattern ports to other clusters that expose GPU telemetry.
Industry context
What to watch
Editorial analysis
Production LLM and generative-AI inference commonly hits a mismatch between cluster autoscalers and accelerator-level utilization. External scalers that consume hardware telemetry are a recurring cloud-native pattern for cutting idle accelerator cost while preserving latency, and a minReplicaCount of zero lets idle GPU pods terminate entirely - which standard HPA cannot do.
Watch adoption of keda-gpu-scaler and similar scalers, standardization of GPU telemetry exports, and integration with GPU node-pool autoscaling. Practical signals include community contributions, multi-tenant examples, and documented scale-to-zero behavior where node provisioning time affects request latency.
Key Points
- 1KEDA can autoscale GPU inference on Kubernetes using GPU telemetry, addressing standard HPA's CPU/memory blind spot for accelerator workloads.
- 2The author open-sourced keda-gpu-scaler to bridge GPU metrics into KEDA; scale-to-zero can eliminate idle-GPU cost during off-hours.
- 3This is a single-author contributed how-to (also on a CNCF community blog), useful for platform teams but not a benchmarked or vendor-backed release.
Scoring Rationale
A practical, well-explained pattern for GPU-aware autoscaling with a working open-source scaler, useful to platform and ML-infrastructure engineers running production inference. But it is a single-author contributed article (also cross-posted to a CNCF community blog) and a personal project rather than a benchmarked result or vendor product, so it lands in the solid-but-niche band below the original 6.9.
Sources
Public references used for this report.
Practice with real Ride-Hailing data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Ride-Hailing problems
