- TechOps Examples
- Posts
- Why Did My Kubernetes Pod Stop Abruptly?
Why Did My Kubernetes Pod Stop Abruptly?
TechOps Examples
Hey — It's Govardhana MK 👋
Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.
👋 Before we begin... a big thank you to today's sponsor NOTOPS
⚡ Cloud-Native Without the Learning Curve
Skip the months of trial and error. NotOps.io gives you an ideal Kubernetes and AWS setup, so you’re production-ready on day one.
🔄 Automated Day-Two Operations
Focus on innovation, not maintenance. NotOps.io automates patching, upgrades, and updates for EKS control planes, nodes, and tools like Argo CD.
👀 Why NotOps.io?
From stability to security to speed, get results from day one.
IN TODAY'S EDITION
🧠 Use Case
Why Did My Kubernetes Pod Stop Abruptly?
🚀 Top News
👀 Remote Jobs
GR8TECH is hiring a DevOps Engineer
Remote Location: Worldwide
WealthWizards is hiring a Platform Engineer
Remote Location: Worldwide
📚️ Resources
📢 Reddit Threads
Here’s Why Over 4 Million Professionals Read Morning Brew
Business news explained in plain English
Straight facts, zero fluff, & plenty of puns
100% free
🛠️ TOOL OF THE DAY
vanna.ai - Text to SQL Application, the fastest way to get actionable insights from your database just by asking questions.
🧠 USE CASE
Why Did My Kubernetes Pod Stop Abruptly?
We’ve all been there. Your Pod is running along, doing its job, and then suddenly - it stops. No graceful shutdown, no clear reason. It’s frustrating.
A quick glimpse of the Pod lifecycle runs through our senses, for sure.
In fact, more or less, we’re ready for the frequent and obvious ones like:
📌 Pod stops with 'Evicted' when disk pressure hits the node.
📌 Pod stops with 'OOMKilled' when the node runs out of memory.
📌 Pod stops with 'CrashLoopBackOff' when it keeps failing to start.
📌 Pod stops with 'ImagePullBackOff' when it can’t fetch the container image.
One of the clients reached out a while back for a consultation to solve this recurring issue.
In a cluster, a critical Pod running a multi-threaded app intermittently failed without clear logs. It vanished as 'Failed' with a blank reason, while other Pods on the node seemed fine - until they weren’t.
What happened behind this mess-up?
Application spawned subprocesses without cleanup, leaving zombie processes behind.
These zombies accumulated, exhausting all available PIDs on the node.
Kubernetes couldn’t allocate PIDs for new Pods, causing abrupt failures.
Basic processes like the pause container couldn’t start, resulting in Pod terminations with unclear logs.
Process ID Limits and Reservations is a fantastic guide to help you understand PID exhaustion in Kubernetes.
This wasn’t a straightforward problem, but here’s how we cracked it:
1. Analyzing the Node State
SSH’ed into the node hosting the failing Pods and checked the available PIDs:
cat /proc/sys/kernel/pid_max
This showed a max limit of 32,768 PIDs.
Running processes (ps aux | wc -l
) revealed that nearly all PIDs were in use.
2. Inspecting Zombie Processes
looked for zombie processes (stat
status Z
):
ps -e -o pid,ppid,stat,cmd | grep 'Z'
Hundreds of zombie processes were tied to the legacy application.
3. Identifying the Offending Pod
Cross-referenced the zombie process PIDs with Pod logs to identify the application responsible for spawning these processes.
4. Correlating with Kubernetes Events
Ran kubectl describe node <node-name>
to confirm PIDPressure
. Kubernetes marked the node as unhealthy due to PID exhaustion.
The Fix:
Increased node PID limit temporarily (
sysctl -w kernel.pid_max=4194304
).Fixed application to handle child processes and clean up zombies with s6-overlay.
Isolated the legacy app to a dedicated node pool to protect other workloads.
Ofcourse, this could have been completely avoided.
📌 Use a process supervisor like s6-overlay in containerized environments to manage child processes effectively.
📌 Low-density nodes can also hit PID exhaustion. Monitor PIDPressure
withkubectl get nodes -o wide
Hope this use case was interesting and equally informative.
As a DevOps Engineer, you don't need to know everything.
When I moved from a release engineer to a DevOps role, I didn't know Python to automate. I learned it on the job.
I was familiar with Perl till then.
Singtel was using Puppet, and I never worked on Ansible for a while.… x.com/i/web/status/1…
— Govardhana Miriyala Kannaiah (@govardhana_mk)
4:24 PM • Dec 14, 2024