Why Did My Kubernetes Pod Stop Abruptly?

TechOps Examples

Hey — It's Govardhana MK 👋

Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.

👋 Before we begin... a big thank you to today's sponsor NOTOPS

 Cloud-Native Without the Learning Curve
Skip the months of trial and error. NotOps.io gives you an ideal Kubernetes and AWS setup, so you’re production-ready on day one.

🔄 Automated Day-Two Operations
Focus on innovation, not maintenance. NotOps.io automates patching, upgrades, and updates for EKS control planes, nodes, and tools like Argo CD.

👀 Why NotOps.io?
From stability to security to speed, get results from day one.

IN TODAY'S EDITION

🧠 Use Case
  • Why Did My Kubernetes Pod Stop Abruptly?

🚀 Top News

👀 Remote Jobs

📚️ Resources

📢 Reddit Threads

Here’s Why Over 4 Million Professionals Read Morning Brew

  • Business news explained in plain English

  • Straight facts, zero fluff, & plenty of puns

  • 100% free

🛠️ TOOL OF THE DAY

vanna.ai -  Text to SQL Application, the fastest way to get actionable insights from your database just by asking questions.

🧠 USE CASE

Why Did My Kubernetes Pod Stop Abruptly?

We’ve all been there. Your Pod is running along, doing its job, and then suddenly - it stops. No graceful shutdown, no clear reason. It’s frustrating.

A quick glimpse of the Pod lifecycle runs through our senses, for sure.

In fact, more or less, we’re ready for the frequent and obvious ones like:

📌 Pod stops with 'Evicted' when disk pressure hits the node.

📌 Pod stops with 'OOMKilled' when the node runs out of memory.

📌 Pod stops with 'CrashLoopBackOff' when it keeps failing to start.

📌 Pod stops with 'ImagePullBackOff' when it can’t fetch the container image.

One of the clients reached out a while back for a consultation to solve this recurring issue.

In a cluster, a critical Pod running a multi-threaded app intermittently failed without clear logs. It vanished as 'Failed' with a blank reason, while other Pods on the node seemed fine - until they weren’t.

What happened behind this mess-up?
  • Application spawned subprocesses without cleanup, leaving zombie processes behind.

  • These zombies accumulated, exhausting all available PIDs on the node.

  • Kubernetes couldn’t allocate PIDs for new Pods, causing abrupt failures.

  • Basic processes like the pause container couldn’t start, resulting in Pod terminations with unclear logs.

Process ID Limits and Reservations is a fantastic guide to help you understand PID exhaustion in Kubernetes.

This wasn’t a straightforward problem, but here’s how we cracked it:

1. Analyzing the Node State

SSH’ed into the node hosting the failing Pods and checked the available PIDs:

cat /proc/sys/kernel/pid_max

This showed a max limit of 32,768 PIDs.

Running processes (ps aux | wc -l) revealed that nearly all PIDs were in use.

2. Inspecting Zombie Processes

looked for zombie processes (stat status Z):

ps -e -o pid,ppid,stat,cmd | grep 'Z'

Hundreds of zombie processes were tied to the legacy application.

3. Identifying the Offending Pod

Cross-referenced the zombie process PIDs with Pod logs to identify the application responsible for spawning these processes.

4. Correlating with Kubernetes Events

Ran kubectl describe node <node-name> to confirm PIDPressure. Kubernetes marked the node as unhealthy due to PID exhaustion.

The Fix:
  • Increased node PID limit temporarily (sysctl -w kernel.pid_max=4194304).

  • Fixed application to handle child processes and clean up zombies with s6-overlay.

  • Isolated the legacy app to a dedicated node pool to protect other workloads.

Ofcourse, this could have been completely avoided.

📌 Use a process supervisor like s6-overlay in containerized environments to manage child processes effectively.

📌 Low-density nodes can also hit PID exhaustion. Monitor PIDPressure with
kubectl get nodes -o wide

Hope this use case was interesting and equally informative.

Looking to promote your company, product, service, or event to 26,000+ TechOps Professionals? Let's work together.