- TechOps Examples
- Posts
- Multi Cluster Batch Job Scheduling Now a Reality with Kueue
Multi Cluster Batch Job Scheduling Now a Reality with Kueue
TechOps Examples
Hey — It's Govardhana MK 👋
Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.
👋 Before we begin... a big thank you to today's sponsor Pinata
Add file uploads instantly with Pinata’s developer-friendly File API
As a developer, your time is valuable. That’s why Pinata’s File API is built to simplify file management, letting you add uploads and retrieval quickly and effortlessly. Forget the headache of complex setups—our API integrates in minutes, so you can spend more time coding and less time on configurations. With secure, scalable storage and easy-to-use endpoints, Pinata takes the stress out of file handling, giving you a streamlined experience to focus on what really matters: building your app.
IN TODAY'S EDITION
🧠 Use Case
Multi Cluster Batch Job Scheduling Now a Reality with Kueue
🚀 Top News
Amazon EC2 Auto Scaling introduces highly responsive scaling policies
Supports sub-minute CloudWatch metrics for faster scaling and self-tuning Target Tracking policies that optimize cost and performance using historical data. Configurations can be managed via Console, CLI, SDKs, and CloudFormation.
👀 Remote Jobs
GR8TECH is hiring a DevOps Lead
Remote Location: Worldwide
Gigster is hiring a SRE Support Engineer
Remote Location: Worldwide
📚️ Resources
How to Simplify Your Git Commands with Git Aliases
Learn setup via config files or CLI, create parameterized aliases, and explore examples for efficient usage.
Curated list of resources on HashiCorp's Terraform and OpenTofu
Featuring 200+ tools (e.g.,
tflint
,terragrunt
), 50+ modules (e.g., AWS, Azure), beginner guides, books, CI integrations, videos, and tutorials.
How to Deploy Preview Environments on Kubernetes with GitHub Actions
Automate builds, manage ephemeral deployments, and enable fast feedback loops. This setup enhances collaboration, streamlines workflows, and reduces resource usage.
🛠️ TOOL OF THE DAY
Sonic Screwdriver - Multi functional tool to manage AWS infrastructure.
SSH with instance IDs or ECS service names, even via a bastion.
🧠 USE CASE
Multi Cluster Batch Job Scheduling Now a Reality with Kueue
In the recent KubeCon+CloudNativeCon North America 2024, one session really caught my eye.
Ricardo Rocha, Lead Platforms Infrastructure at CERN, and Marcin Wielgus, Staff Software Engineer at Google, talked about Kueue and its new beta feature, MultiKueue.
It’s a Kubernetes native batch scheduler that can now send jobs across clusters, regions, and even clouds. Pretty cool, right?
If you're curious, you can check out the keynote here.
Let’s imagine, you’re running a Kubernetes batch job to train a machine learning model.
It’s a resource-hungry task needing multiple GPUs and hours of processing.
Here’s the YAML you might use:
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-intensive-job
spec:
parallelism: 3
completions: 3
template:
spec:
containers:
- name: gpu-job
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
command: ["python", "/app/train_model.py"]
restartPolicy: OnFailure
Sounds simple, right? You submit the job, expecting Kubernetes to magically handle the rest.
But then reality hits:
If the local cluster runs out of GPU nodes, the job is stuck in a
Pending
state.You have idle GPU nodes in another cluster, but Kubernetes doesn’t know how to use them.
Frustrating, isn’t it? This is where Kueue comes handy.
Instead of letting jobs flounder when resources are tight, Kueue offers:
Your job will only run when all its required resources are available.
Kueue knows the specifics of your workloads (like GPU needs) and places jobs intelligently.
With the beta MultiKueue feature, Kueue can dispatch jobs to remote clusters if local resources aren’t enough.
Kueue Workflow
Did I mention it’s fully open source? You can install it on any vanilla Kubernetes setup.
Here’s what sets Kueue apart:
📌 ResourceFlavors: Define your job's resource needs, whether it’s a GPU node or a spot instance. Kueue places jobs where they fit best.
📌 Admission Checks: Only valid jobs get through - no rogue workloads sneaking in.
📌 Topology Aware Scheduling: Jobs are scheduled intelligently based on cluster and resource topology.
📌 MultiKueue (Beta): Schedule jobs across clusters, regions, and clouds seamlessly.
Let’s rewrite the earlier discussed same job using Kueue:
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
name: gpu-intensive-job
spec:
queueName: gpu-workload-queue
podSets:
- name: main
count: 3
template:
spec:
containers:
- name: gpu-job
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
command: ["python", "/app/train_model.py"]
restartPolicy: OnFailure
admission:
clusterQueue: gpu-cluster-queue
What’s happening here?
The job is added to a specific queue (
gpu-workload-queue
) for prioritization.Admission Checks ensures the job has enough resources available in
gpu-cluster-queue
before it starts.If the local cluster lacks resources, MultiKueue can move the job to another cluster with available GPUs.
This level of automation and scalability is a game-changer for organizations managing hybrid, multi-cloud, or HPC environments.
Kueue is just getting started. The next steps include tighter integrations with schedulers like Slurm and advanced tools like Kubeflow.
As Ricardo Rocha summed it up in his keynote:
"The idea is to submit jobs and not care where they run."
With Kueue, that vision is becoming a reality.
I started using Lazydocker for container management a while ago during some client projects, and I’m not going back.
✅ All services in one simple view.
✅ Logs are clean and easy to follow.
✅ Restart or rebuild with just a keypress.
✅ Managing Docker finally feels stress… x.com/i/web/status/1…
— Govardhana Miriyala Kannaiah (@govardhana_mk)
7:35 AM • Nov 25, 2024