Infrastructuregpu utilizationcloud costskubernetescast ai

Companies Hoard GPUs, Leaving Most Capacity Idle

|April 21, 2026

6.8

Relevance Score

Companies Hoard GPUs, Leaving Most Capacity Idle — Photo: i.insider.com · rights & takedowns

Cast AI data from roughly 23,000 Kubernetes clusters shows enterprise GPU utilization averages 5%, meaning 95% of provisioned GPU capacity sits idle. CPU and memory are underutilized as well, at 8% and 20% respectively. Firms are overprovisioning GPUs out of fear of missing out rather than sustained workload demand, driving up GPU procurement and cloud spend. The report flags one-time rightsizing and static autoscaler configs as insufficient; continuous, autonomous optimization is required to close the gap between paid-for and actually used capacity. For practitioners, this is a capital and operational inefficiency problem that also increases pressure on premium GPU supply and cloud pricing.

What happened

Cast AI published its 2026 State of Kubernetes Optimization Report based on telemetry from about 23,000 clusters. The headline metric: average GPU utilization is 5%, so roughly 95% of provisioned GPU capacity is idle. CPU utilization averages 8% and memory utilization averages 20%. "Companies are overbuying GPUs out of fear of missing out," said Laurent Gil, CEO of Cast AI.

Technical details

The dataset comes from organizations running Kubernetes with Cast AI agents that observe node, pod, and workload behavior. Key operational failures the report calls out are static rightsizing at deployment, suboptimal autoscaler configurations, poor Spot instance selection, and neglected node lifecycle management. These translate into persistent overprovisioning for expensive accelerator classes where an idle GPU costs dollars per hour versus cents for idle CPU. Practitioners should note the specific utilization profile:

•GPU utilization: 5% average across clusters
•CPU utilization: 8% average
•Memory utilization: 20% average

Cast AI recommends autonomous, continuous optimization to adapt to shifting workload patterns rather than one-time rightsizing.

Context and significance

This pattern amplifies two current infrastructure pressures. First, it wastes capital and inflates cloud bills for organizations moving from pilot to production. Second, it tightens demand for premium accelerators, principally units from NVIDIA, exacerbating supply constraints and keeping prices high. The finding also exposes a mismatch between Kubernetes's theoretical efficiency and real-world operations, where configuration drift and infrequent tuning leave expensive resources idle. For SREs and ML platform teams, the core takeaway is that orchestration alone does not equal efficiency; control-plane configuration, autoscaling policies, and workload-aware bin-packing must be continuously managed.

What to watch

Expect rising interest in platforms that automate continuous rightsizing and workload placement, stronger governance on GPU procurement, and renewed emphasis on Spot/eviction-aware ML training pipelines. Also watch GPU resale and marketplace activity as organizations seek to monetize idle capacity.

Scoring Rationale

The report surfaces a notable operational problem affecting ML platform cost and GPU supply dynamics. The dataset is broad but the finding is operational rather than a technical breakthrough, so its impact is notable but not industry-shaking.