How Karpenter Feature Gates Helped on Black Friday

TechOps Examples

Hey โ€” It's Govardhana MK ๐Ÿ‘‹

Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.

๐Ÿ‘‹ Before we begin... a big thank you to today's sponsor NOTOPS

๐Ÿš€ Simplify Cloud-Native with NotOps.io ๐ŸŒ

Overwhelmed by Kubernetes and cloud-native complexities? NotOps.io is here to transform the way you manage cloud infrastructure.

๐ŸŒŸ Battle-Tested Cloud-Native Tooling
Leverage decades of expertise. Weโ€™ve curated the perfect mix of tools and practices to deliver unmatched stability, security, and scalability.

๐Ÿ”’ Secure by Design
Security isnโ€™t an afterthought; itโ€™s built into every layer of NotOps.io. With regular updates, your infrastructure stays patched and protected, effortlessly.

๐Ÿ‘€ Why NotOps.io?
From stability to security to speed, organizations are experiencing measurable results from day one with NotOps.io.

IN TODAY'S EDITION

๐Ÿง  Use Case

  • How Karpenter Feature Gates Helped on Black Friday

๐Ÿš€ Top News

๐Ÿ‘€ Remote Jobs

๐Ÿ“š๏ธ Resources

Why struggle with file uploads? Pinataโ€™s File API is your fix

Simplify your development workflow with Pinataโ€™s File API. Add file uploads and retrieval to your app in minutes, without the need for complicated configurations. Pinata provides simple file management so you can focus on creating great features.

๐Ÿ› ๏ธ TOOL OF THE DAY

Pixie - Instant Kubernetes Native Application Observability tool to view the high-level state of your cluster (service maps, cluster resources, application traffic).

๐Ÿง  USE CASE

How Karpenter Feature Gates Helped on Black Friday

My Black Friday experience this time is with an e-commerce platform, handling the typical objective: scaling quickly under heavy traffic while keeping costs low.

We all know Karpenter is a great Kubernetes cluster autoscaler, designed to help dynamically manage workloads by provisioning nodes tailored to your requirements.

And I would like to talk today about the Karpenter feature gates and how it helped to end up here ๐Ÿ‘‡๏ธ

And these were the challenges starring at me:

  • Traffic could skyrocket at any moment, requiring immediate node scaling without delays.

  • We heavily relied on Spot Instances. These instances could be reclaimed at any time, creating potential downtime.

  • Over time, some nodes in the cluster could drift from their intended configurations, resulting in wasted resources.

  • High workloads could stress some nodes to failure, requiring quick detection and repairs.

To tackle these challenges, I enabled three powerful Karpenter Feature Gates:

  • SpotToSpotConsolidation:
    Migrates workloads from at-risk spot instances to more stable, cost-effective options before termination.

  • Drift:
    Automatically detects misaligned or underutilized nodes and replaces them to maintain efficiency.

  • NodeRepair:
    Automatically detects unhealthy nodes and repairs or replaces them without manual intervention.

Implementation:

1. Enable Feature Gates

Helm Chart Configuration to update and deploy:

settings:

featureGates:

SpotToSpotConsolidation: true

Drift: true

NodeRepair: true

2. Configure Provisioner

apiVersion: karpenter.sh/v1alpha5

kind: Provisioner

metadata:

name: techops-provisioner

spec:

requirements:

- key: "karpenter.sh/capacity-type"

operator: In

values: ["spot"]

provider:

instanceTypes: ["m5.large", "m5.xlarge"]

ttlSecondsAfterEmpty: 30

consolidation:

enabled: true

This configuration ensures:

  • Spot Instances are prioritized for cost savings.

  • Instances match workload requirements.

  • Empty nodes are terminated after 30 secs.

  • SpotToSpotConsolidation is enabled via the consolidation.enabled parameter.

Note: We already configured and deployed Helm with the Drift and NodeRepair feature gates enabled. These parameters work automatically in the background and donโ€™t need to be added to the Provisioner file.

3. Monitoring and Observability

SpotToSpotConsolidation Logs:

{"level":"info","msg":"Migrating workload from spot node techops1 to more stable node techops2"}

Drift Detection Logs:

{"level":"info","msg":"Drift detected on node techops1. Marking for termination and replacement."}

NodeRepair Logs:

{"level":"info","msg":"Node repair initiated for unhealthy node techops1"}
{"level":"info","msg":"Node replaced successfully"}

Final Results:

  • 30% cost savings

  • Lean, healthy infrastructure

  • Zero downtime throughout the event

You may even like:

Looking to promote your company, product, service, or event to 23,000+ TechOps Professionals? Let's work together.