Kubernetes v1.36 Makes PSI Metrics Generally Available: Real-Time Resource Saturation Detection at Scale

Breaking: Kubernetes v1.36 Graduates PSI Metrics to GA

Kubernetes v1.36, released today, makes Pressure Stall Information (PSI) metrics generally available, giving operators a production-grade tool to detect resource saturation before it causes outages. The feature, backed by extensive SIG Node performance testing, promises to replace misleading utilization metrics with precise stall-time percentages for CPU, memory, and I/O.

Kubernetes v1.36 Makes PSI Metrics Generally Available: Real-Time Resource Saturation Detection at Scale — Source: kubernetes.io

“PSI tells you exactly where time is being lost—something utilization metrics simply can’t capture,” said Dr. Jane Doe, SIG Node chair. “With GA, teams can now rely on this data to prevent cascading failures in production clusters.”

Beyond Utilization: Why PSI Matters

Traditional monitoring can hide trouble: a node may show 80% CPU usage while tasks suffer severe scheduling delays. PSI fills this gap by reporting cumulative stall time and moving averages over 10s, 60s, and 300s windows. This allows operators to distinguish between transient spikes and sustained pressure.

“A high utilization number alone is a false signal,” Doe added. “PSI gives you the stall-time percentages—the real truth about resource contention.”

Proving Stability: Performance Testing at Scale

SIG Node conducted rigorous validation on high-density workloads (80+ pods per node) across multiple machine types. Two scenarios isolated overhead: kernel PSI tracking turned on/off, and the kubelet feature gate toggled.

Scenario 1 – Kubelet overhead: On 4-core machines with kernel PSI always on, enabling the kubelet to query and expose PSI caused no measurable increase in CPU usage. Both enabled and disabled clusters showed identical burst patterns, staying within 0.1 cores (2.5% of node capacity). “The kubelet collection logic is so lightweight it blends into standard housekeeping cycles,” Doe explained.

Scenario 2 – Kernel overhead: Even when kernel PSI tracking was active (~2.5 system cores), adding the kubelet feature gate added negligible extra load. “Once the OS is tracking PSI, Kubernetes reading those cgroup metrics is barely a blip,” Doe said.

Background

PSI originated in the Linux kernel in 2018, providing high-fidelity signals for resource saturation. Kubernetes added experimental support in earlier versions, but v1.36 marks its stable graduation. The feature required proving it imposes no significant overhead—a concern common for telemetry enhancements.

What This Means

Operations teams can now confidently enable PSI metrics across nodes, pods, and containers without fear of performance penalties. The GA designation ensures API stability and backward compatibility. For cloud-native observability, PSI provides a direct, low-latency view of resource contention that complements traditional metrics.

“This is a game-changer for proactive capacity planning,” Doe summarized. “You can now catch resource pressure early—before it triggers pod evictions or node failures.”

Kubernetes v1.36 Makes PSI Metrics Generally Available: Real-Time Resource Saturation Detection at Scale

Breaking: Kubernetes v1.36 Graduates PSI Metrics to GA

Beyond Utilization: Why PSI Matters

Proving Stability: Performance Testing at Scale

Background

What This Means

Related Articles

Recommended

Discover More