TechOps Examples
Posts
How MxPlayer Scales with EKS SPOT Instances

How MxPlayer Scales with EKS SPOT Instances

Govardhana M K
April 22, 2025

TechOps Examples

Hey — It's Govardhana MK 👋

Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.

👋 Before we begin... a big thank you to today's sponsor PERFECTSCALE

[Live Webinar] Using AI in Dev & Ops: What Works, What Fails, and Why

Join Patrick Debois (the DevOps guy) and Anton Weiss for a no fluff walkthrough of:
→ Code gen models
→ Context prompts & RAG
→ Function calling & MCP
→ How AI tools use tools

You’ll walk away with the knowledge to build or evaluate GenAI tools with confidence.

IN TODAY'S EDITION

🧠 Use Case

How MxPlayer Scales with EKS SPOT Instances

🚀 Top News

Azure Boards + GitHub Recent Updates: Smarter Link Management for Branches, PRs, and Commits

👀 Remote Jobs

Scroll is hiring a Senior Site Reliability Engineer
Remote Location: Worldwide

Consensys is hiring a Senior DevOps Engineer - Tech Operations
Remote Location: Worldwide

📚️ Resources

Top 10 ways to monitor Linux in the console

AWS Security Reference Architecture Examples

An Overview of Kubernetes CPU Management: Limits, Requests, Throttling

📢 Reddit Threads

Looking for a cheap server to practice my DevOps/cloud skills

Help AWS Cognito/SNS vulnerability caused over $10k in charges – AWS Support won't help after 6 months

🛠️ TOOL OF THE DAY

KYE (Know Your Enemies) - Check external access on your AWS account

Analyzes IAM Role trust policies to identify who can assume your roles
Checks S3 bucket policies to identify who has access to your data
Identifies IAM roles vulnerable to the confused deputy problem (missing ExternalId condition)

🧠 USE CASE

How MxPlayer Scales with EKS SPOT Instances

MX Player, one of India’s largest OTT platforms, delivers more than 200,000 hours of video to over 300 million users.

That kind of reach demands serious backend power. And they’ve nailed it by building a cost efficient, high scale transcoding pipeline using EKS Spot Instances.

First, what is transcoding?

It’s the process of converting raw video into streamable formats. Think:

Changing video resolution from 4K to 720p
Compressing large files for mobile users
Converting formats to work across devices
Adding audio, subtitles, and metadata

Without this step, videos would be heavy, incompatible, and slow to stream.

Now, imagine doing this for 100,000 videos in one hour.

What MX Player built ?

Note: This design is based on the last publicly available information. The current live setup and audience volume may have evolved as we speak. This is for educational purposes only.

Download a high resolution copy of this diagram here for future reference.

Their architecture leans on stateless workers, SQS queues, and Spot instances on Amazon EKS. The flow looks like this:

Vendors upload raw video files
Files are ingested and pushed into SQS
Pre transcoding workers check the video, extract subtitles, and validate audio
Transcoding workers compress the video and apply H.266 encoding (an ultra efficient codec)
Processed content is stored in S3
Delivered to users via CloudFront
Coordination is handled with Elasticache for fast lookups and tagging

The smart use of Spot fleets allows them to scale aggressively without breaking the bank.

At Peak Load

100,000 videos ingested
3,000 Spot workers in action
All transcoded in under 1 hour

This isn’t theory. It’s a production system running at one of the highest scales in the country.

What I’d do to take this even further

1. Tiered Transcoding Queues
Not all content needs the same speed. I’d break the workload into tiers, flag premium or trending content to be processed first, while archive or long tail videos wait in lower priority queues.

2. Elasticache as a Coordination Layer
Elasticache is already in the mix, but I’d go deeper using it to assign unique transcoding tokens or prevent double processing when the queue gets backed up. This helps avoid duplicate work during traffic bursts.

3. Shift S3 from Passive Store to Active Trigger
Right now, ingestion is app driven. I’d use S3 event triggers to automatically invoke pre-transcoding as soon as a file is uploaded. It makes the pipeline even more reactive.

4. Job Status Dashboards for Content Teams
Give non engineering teams visibility into job status. A simple Grafana view showing "pending", "processing", and "completed" per asset helps reduce back and forth and improves alignment across teams.

5. Transcoding Profiles Optimized for Regions
If the audience is global, create region aware profiles. For example, compress heavier in low bandwidth regions, and maintain quality in urban areas. Saves bandwidth while preserving experience.

If you're in media, publishing, or data processing, this is a real world case of how to marry scale with savings and still leave room to improve.

We are bringing a live workshop on AI in Dev & Ops — what works, what fails, and why, featuring industry pioneer Patrick Debois and PerfectScale’s Chief Storyteller Ant Weiss.

DevOps official language has to be 'YAML'
- Helm uses YAML
- GitHub uses YAML
- Ansible uses YAML
- Argo CD uses YAML
- Kubernetes uses YAML
- Azure DevOps uses YAML
- Docker Compose uses YAML
and more ...
All the sophisticated toolsets you aspire to learn and adopt run on
— TechOps Examples (@techopsexamples)
4:36 AM • Apr 22, 2025

If you’re interested in starting a newsletter like this, try out beehiiv (it’s what I use).

You get a 30 day free trial + 20% OFF for 3 months when you sign up using the link below.

Looking to promote your company, product, service, or event to 45,000+ Cloud Native Professionals? Let's work together.

Partner Disclosure: Please note that some of the links in this post are affiliate links, which means if you click on them and make a purchase, I may receive a small commission at no extra cost to you. This helps support my work and allows me to continue to provide valuable content. I only recommend products that I use and love. Thank you for your support!