TechOps Examples
Posts
Kubernetes Node Not Ready - How To Fix It

Kubernetes Node Not Ready - How To Fix It

Govardhana M K
August 28, 2024

Sign Up | Advertise

Good day. It's Wednesday, Aug. 28, and in this issue, we're covering:

Kubernetes Node Not Ready - How To Fix It ?
Google Cloud Run now supports GPUs to host your LLMs
Streamline Local Development with Dev Containers and Test containers
The Ultimate Docker Cheat Sheet
Talos Kubernetes on Proxmox using OpenTofu
End-to-End DevOps Project: Building, Deploying, and Monitoring a Full-Stack Application

Use Case

Kubernetes Node Not Ready - How To Fix It ?

It is very familiar to see a mix of node statuses in a Kubernetes cluster, especially when troubleshooting. Sometimes, nodes might be marked as NotReady due to various issues.

Typically it looks like:

techops_examples@master:~$ kubectl get nodes
NAME              STATUS     ROLES    AGE   VERSION
master            Ready      master   51m   v1.31.0
node-worker-1     NotReady   worker   49m   v1.31.0
node-worker-2     Ready      worker   47m   v1.31.0

Behind the scenes:

The kubelet on each node is responsible for reporting the node's status to the control plane, specifically to the node-lifecycle-controller. The control plane then assesses this data (or the absence of it) to determine the node’s state.

This is what happens in the background:

Behind the Scenes

The node’s kubelet sends information about various checks it performs, including:

Whether the network for the container runtime is functional.
If the CSI (Container Storage Interface) provider on the node is fully initialized.
The completeness of the container runtime status checks.
The operational state of the container runtime itself.
The functionality of the pod lifecycle event generator.
Whether the node is in the process of shutting down.
The availability of sufficient CPU, memory, or pod capacity on the node.

This information is then relayed to the node-lifecycle-controller, which uses it to assign the node one of the following statuses:

True: All checks have passed, indicating the node is operational and healthy.
False: One or more checks have failed, showing the node has issues and isn’t functioning correctly.
Unknown: The kubelet hasn’t communicated with the control plane within the expected timeframe, leaving the node's status unclear.

When the status is marked as Unknown, it usually indicates that the node has lost contact with the control plane, possibly due to network problems, kubelet crashes, or other communication failures.

Diagnosis:

1. Node Status Check:

Run → Kubectl get nodes and watch out for the status ‘NotReady’

techops_examples@master:~$ kubectl get nodes
NAME              STATUS     ROLES    AGE   VERSION
master            Ready      master   51m   v1.31.0
node-worker-1     NotReady   worker   49m   v1.31.0
node-worker-2     Ready      worker   47m   v1.31.0

2. Node Details and Conditions Check:

To dive deeper into why a node might be NotReady, use the kubectl describe command to get detailed information on the node's condition:

MemoryPressure: Node is low on memory.

DiskPressure: Node is running out of disk space.

PIDPressure: Node has too many processes running.

techops_examples@master:~$ kubectl describe node node-worker-1

Name:               node-worker-1
Roles:              worker
Labels:             kubernetes.io/hostname=node-worker-1
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
CreationTimestamp:  2024-08-28T09:25:10Z
Conditions:
  Type             Status    LastHeartbeatTime                       LastTransitionTime                      Reason              Message
  ----             ------    -----------------                       ------------------                      ------              -------
  MemoryPressure   False     2024-08-28T10:14:52Z                    2024-08-28T09:26:35Z                    KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False     2024-08-28T10:14:52Z                    2024-08-28T09:26:35Z                    KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False     2024-08-28T10:14:52Z                    2024-08-28T09:26:35Z                    KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False     2024-08-28T10:14:52Z                    2024-08-28T09:27:45Z                    KubeletNotReady              PLEG is not healthy: pleg was last seen active 5m58.89150698s ago; threshold is 3m

This output shows the node's current conditions and highlights the specific reason (PLEG is not healthy) for the NotReady status, allowing you to take appropriate action.

3. Network Misconfiguration Check:

Run → ping <node-IP> to check connectivity between the nodes.

If there's packet loss, it indicates a possible network issue that might be causing the node's NotReady status.

techops_examples@master:~$ ping 10.0.0.67
PING 10.0.0.67 (10.0.0.67) 56(84) bytes of data.
--- 10.0.0.67 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3054ms

4. Kubelet Issue Check:

Run → systemctl status kubelet on the node to verify if the kubelet service is running properly.

If the kubelet is down, it may be the reason for the node's NotReady status.

techops_examples@node-worker-1:~$ systemctl status kubelet
 kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2024-08-28 09:25:10 UTC; 1h 29min ago
   Main PID: 2345 (kubelet)
   Tasks: 13 (limit: 4915)
   Memory: 150.1M
   CPU: 8min 27.345s
   CGroup: /system.slice/kubelet.service
           └─2345 /usr/bin/kubelet

5. Kube-proxy Issue Check:

Run → kubectl get pods -n kube-system -o wide | grep kube-proxy to check the status of the kube-proxy pods on the node.

If the kube-proxy pod is in a crash loop or not running, it could cause network issues leading to the NotReady status.

techops_examples@master:~$ kubectl get pods -n kube-system -o wide | grep kube-proxy
kube-proxy-5b7c8dfd9f-lk1bp   1/1     Running   0          1h     10.0.0.67   node-worker-1

How To Fix:

1. Resolve Lack of Resources:

Increase Resources: Scale up the node or optimize pod resource requests and limits.
Monitor & Clean: Use top or htop to monitor usage, stop non-Kubernetes processes, and check for hardware issues.

2. Resolve Kubelet Issues:

Check Status: Run systemctl status kubelet.

active (running): Kubelet is fine; the issue might be elsewhere.
active (exited): Restart with sudo systemctl restart kubelet.
inactive (dead): Check logs with sudo cat /var/log/kubelet.log to diagnose.

3. Resolve Kube-proxy Issues:

Check Logs: Use kubectl logs <kube-proxy-pod-name> -n kube-system to review logs.
DaemonSet: Ensure the kube-proxy DaemonSet is configured correctly. If needed, delete the kube-proxy pod to force a restart.

4. Checking Connectivity:

Network Setup: Verify network configuration, ensure necessary ports are open.
Test Connections: Use ping <node-IP> and traceroute <node-IP> to check network connectivity.

I believe the next time you see "NotReady," you'll know the reason and where to begin checking!

p.s. if you think someone else you know may like this newsletter, share with them to join here

Tool Of The Day

k8sGPT - a tool for scanning your Kubernetes clusters, diagnosing, and triaging issues in simple English.

Try it out !

Trends & Updates

Docker: Streamline Local Development with Dev Containers and Testcontainers Cloud

Resources & Tutorials

Talos Kubernetes on Proxmox using OpenTofu

Picture Of The Day

Did someone forward this email to you? Sign up here

Interested in reaching smart techies?

Our newsletter puts your products and services in front of the right people - engineering leaders and senior engineers - who make important tech decisions and big purchases.