• TechOps Examples
  • Posts
  • Multi Cluster Batch Job Scheduling Now a Reality with Kueue

Multi Cluster Batch Job Scheduling Now a Reality with Kueue

In partnership with

TechOps Examples

Hey — It's Govardhana MK 👋

Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.

👋 Before we begin... a big thank you to today's sponsor Pinata

Add file uploads instantly with Pinata’s developer-friendly File API

As a developer, your time is valuable. That’s why Pinata’s File API is built to simplify file management, letting you add uploads and retrieval quickly and effortlessly. Forget the headache of complex setups—our API integrates in minutes, so you can spend more time coding and less time on configurations. With secure, scalable storage and easy-to-use endpoints, Pinata takes the stress out of file handling, giving you a streamlined experience to focus on what really matters: building your app.

IN TODAY'S EDITION

🧠 Use Case

  • Multi Cluster Batch Job Scheduling Now a Reality with Kueue

🚀 Top News

👀 Remote Jobs

  • GR8TECH is hiring a DevOps Lead

    Remote Location: Worldwide

📚️ Resources

🛠️ TOOL OF THE DAY

Sonic Screwdriver - Multi functional tool to manage AWS infrastructure.

  • SSH with instance IDs or ECS service names, even via a bastion.

🧠 USE CASE

Multi Cluster Batch Job Scheduling Now a Reality with Kueue

In the recent KubeCon+CloudNativeCon North America 2024, one session really caught my eye.

Ricardo Rocha, Lead Platforms Infrastructure at CERN, and Marcin Wielgus, Staff Software Engineer at Google, talked about Kueue and its new beta feature, MultiKueue.

It’s a Kubernetes native batch scheduler that can now send jobs across clusters, regions, and even clouds. Pretty cool, right?

If you're curious, you can check out the keynote here.

Let’s imagine, you’re running a Kubernetes batch job to train a machine learning model.

It’s a resource-hungry task needing multiple GPUs and hours of processing.

Here’s the YAML you might use:

apiVersion: batch/v1

kind: Job

metadata:

name: gpu-intensive-job

spec:

parallelism: 3

completions: 3

template:

spec:

containers:

- name: gpu-job

image: tensorflow/tensorflow:latest-gpu

resources:

limits:

nvidia.com/gpu: 1

command: ["python", "/app/train_model.py"]

restartPolicy: OnFailure

Sounds simple, right? You submit the job, expecting Kubernetes to magically handle the rest.

But then reality hits:

  1. If the local cluster runs out of GPU nodes, the job is stuck in a Pending state.

  2. You have idle GPU nodes in another cluster, but Kubernetes doesn’t know how to use them.

Frustrating, isn’t it? This is where Kueue comes handy.

Instead of letting jobs flounder when resources are tight, Kueue offers:

  • Your job will only run when all its required resources are available.

  • Kueue knows the specifics of your workloads (like GPU needs) and places jobs intelligently.

  • With the beta MultiKueue feature, Kueue can dispatch jobs to remote clusters if local resources aren’t enough.

Kueue Workflow

Did I mention it’s fully open source? You can install it on any vanilla Kubernetes setup.

Here’s what sets Kueue apart:

📌 ResourceFlavors: Define your job's resource needs, whether it’s a GPU node or a spot instance. Kueue places jobs where they fit best.

📌 Admission Checks: Only valid jobs get through - no rogue workloads sneaking in.

📌 Topology Aware Scheduling: Jobs are scheduled intelligently based on cluster and resource topology.

📌 MultiKueue (Beta): Schedule jobs across clusters, regions, and clouds seamlessly.

Let’s rewrite the earlier discussed same job using Kueue:

apiVersion: kueue.x-k8s.io/v1beta1

kind: Workload

metadata:

name: gpu-intensive-job

spec:

queueName: gpu-workload-queue

podSets:

- name: main

count: 3

template:

spec:

containers:

- name: gpu-job

image: tensorflow/tensorflow:latest-gpu

resources:

limits:

nvidia.com/gpu: 1

command: ["python", "/app/train_model.py"]

restartPolicy: OnFailure

admission:

clusterQueue: gpu-cluster-queue

What’s happening here?

  1. The job is added to a specific queue (gpu-workload-queue) for prioritization.

  2. Admission Checks ensures the job has enough resources available in gpu-cluster-queue before it starts.

  3. If the local cluster lacks resources, MultiKueue can move the job to another cluster with available GPUs.

This level of automation and scalability is a game-changer for organizations managing hybrid, multi-cloud, or HPC environments.

Kueue is just getting started. The next steps include tighter integrations with schedulers like Slurm and advanced tools like Kubeflow.

As Ricardo Rocha summed it up in his keynote:

"The idea is to submit jobs and not care where they run."

With Kueue, that vision is becoming a reality.

Looking to promote your company, product, service, or event to 22,000+ TechOps Professionals? Let's work together.