How to Monitor Resource Contention with Kubernetes PSI Metrics (v1.36 GA)

By

Introduction

Pressure Stall Information (PSI) has been a game-changer in the Linux kernel since 2018, offering high-fidelity signals to detect resource saturation before it escalates into an outage. Unlike traditional utilization metrics, PSI reveals the actual time tasks spend stalled due to CPU, memory, or I/O contention, expressed as clear percentages. With the graduation of PSI metrics to General Availability in Kubernetes v1.36, cluster operators now have a stable, reliable interface to observe resource pressure at the node, pod, and container levels. This guide will walk you through enabling, interpreting, and leveraging PSI metrics to keep your workloads healthy and responsive.

How to Monitor Resource Contention with Kubernetes PSI Metrics (v1.36 GA)
Source: kubernetes.io

What You Need

Step-by-Step Guide

Step 1: Verify Kernel and Kubelet Support

PSI must be enabled at the Linux kernel level. Most modern kernels have it enabled by default (psi=1). Confirm with:

cat /boot/config-$(uname -r) | grep CONFIG_PSI

If you see CONFIG_PSI=y, you're good. Next, ensure your Kubernetes version is v1.36 or higher. Run:

kubectl version --short

The Kubelet feature gate KubeletPSI is enabled by default in v1.36, but verify via the Kubelet configuration or check the metrics endpoint later.

Step 2: Enable PSI Metrics in the Kubelet

In Kubernetes v1.36, the KubeletPSI feature gate is stable and enabled automatically. No manual action is required unless you upgraded from an older version where it was alpha/beta. Optionally, you can confirm the feature gate status on each node:

kubectl get nodes -o yaml | grep -i psi

After confirming, the Kubelet will start collecting PSI metrics from the cgroup v2 interface (if available) or from /proc/pressure/ for system-wide metrics. The collection overhead is negligible—performance testing during GA validation showed the Kubelet uses about 0.1 cores (2.5% of a 4-core node) even under high-density workloads (80+ pods).

Step 3: Access PSI Metrics via the Metrics Endpoint

Kubelet exposes PSI metrics under the /metrics endpoint. Use port-forwarding to access it:

kubectl port-forward node/<node-name> 10250:10250 -n kube-system

Then curl the endpoint (you may need to provide authentication tokens):

curl -k -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://localhost:10250/metrics | grep psi

You'll see metrics like:

Pod-level and container-level metrics are also available if cgroup v2 is used.

Step 4: Interpret the Metrics

PSI metrics come in two forms:

For example, a 10s average of 20% for CPU pressure means that over the last 10 seconds, tasks were stalled 20% of the time waiting for CPU. This is more informative than plain CPU utilization which might show 80% but hide scheduling delays.

Step 5: Set Up Monitoring and Alerts

Integrate PSI metrics into your monitoring stack. For Prometheus, add a scrape target that includes the Kubelet metrics. Create alerting rules like:

groups:
- name: psi-alerts
  rules:
  - alert: HighCPUPressure
    expr: node_pressure_cpu_waiting_seconds_total_10s > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU pressure > 50% on {{ $labels.instance }}"

Adjust thresholds based on your workload sensitivity.

Step 6: Validate with a Test Workload

To confirm PSI metrics are working and understand their behavior, generate controlled pressure. Install stress-ng on a test node:

kubectl run stress --image=alpine --command -- sh -c "apk add stress-ng && stress-ng --cpu 4 --timeout 60s"

While it runs, watch the PSI metrics. You should see the 10s average spike. This exercise also helps you calibrate baseline vs. unhealthy values for your environment.

Tips for Production Use

By following these steps, you'll be able to proactively detect resource contention before it impacts your users, thanks to the power of PSI metrics now mature in Kubernetes v1.36.

Tags:

Related Articles

Recommended

Discover More

A DNA-Based Revolution in Cholesterol Management: Answers to Key QuestionsNavigating the Google Family Link Call Blocking Bug: Step-by-Step FixesDebunking Common Rooftop Solar Misconceptions: What Homeowners Need to KnowTop Questions About the 2026 Apple Design Awards FinalistsIreland to Sign Artemis Accords: Expanding Global Cooperation in Space Exploration