How MxPlayer Scales with EKS SPOT Instances

TechOps Examples

Hey — It's Govardhana MK 👋

Along with a use case deep dive, we identify the remote job opportunities, top news, tools, and articles in the TechOps industry.

👋 Before we begin... a big thank you to today's sponsor PERFECTSCALE

Join Patrick Debois (the DevOps guy) and Anton Weiss for a no fluff walkthrough of:
→ Code gen models
→ Context prompts & RAG
→ Function calling & MCP
→ How AI tools use tools

You’ll walk away with the knowledge to build or evaluate GenAI tools with confidence.

IN TODAY'S EDITION

🧠 Use Case
  • How MxPlayer Scales with EKS SPOT Instances

🚀 Top News
👀 Remote Jobs

📚️ Resources

📢 Reddit Threads

🛠️ TOOL OF THE DAY

KYE (Know Your Enemies) - Check external access on your AWS account

  • Analyzes IAM Role trust policies to identify who can assume your roles

  • Checks S3 bucket policies to identify who has access to your data

  • Identifies IAM roles vulnerable to the confused deputy problem (missing ExternalId condition)

🧠 USE CASE

How MxPlayer Scales with EKS SPOT Instances

MX Player, one of India’s largest OTT platforms, delivers more than 200,000 hours of video to over 300 million users.

That kind of reach demands serious backend power. And they’ve nailed it by building a cost efficient, high scale transcoding pipeline using EKS Spot Instances.

First, what is transcoding?

It’s the process of converting raw video into streamable formats. Think:

  • Changing video resolution from 4K to 720p

  • Compressing large files for mobile users

  • Converting formats to work across devices

  • Adding audio, subtitles, and metadata

Without this step, videos would be heavy, incompatible, and slow to stream.

Now, imagine doing this for 100,000 videos in one hour.

What MX Player built ?

Note: This design is based on the last publicly available information. The current live setup and audience volume may have evolved as we speak. This is for educational purposes only.

Download a high resolution copy of this diagram here for future reference.

Their architecture leans on stateless workers, SQS queues, and Spot instances on Amazon EKS. The flow looks like this:

  1. Vendors upload raw video files

  2. Files are ingested and pushed into SQS

  3. Pre transcoding workers check the video, extract subtitles, and validate audio

  4. Transcoding workers compress the video and apply H.266 encoding (an ultra efficient codec)

  5. Processed content is stored in S3

  6. Delivered to users via CloudFront

  7. Coordination is handled with Elasticache for fast lookups and tagging

The smart use of Spot fleets allows them to scale aggressively without breaking the bank.

At Peak Load

  • 100,000 videos ingested

  • 3,000 Spot workers in action

  • All transcoded in under 1 hour

This isn’t theory. It’s a production system running at one of the highest scales in the country.

What I’d do to take this even further

1. Tiered Transcoding Queues
Not all content needs the same speed. I’d break the workload into tiers, flag premium or trending content to be processed first, while archive or long tail videos wait in lower priority queues.

2. Elasticache as a Coordination Layer
Elasticache is already in the mix, but I’d go deeper using it to assign unique transcoding tokens or prevent double processing when the queue gets backed up. This helps avoid duplicate work during traffic bursts.

3. Shift S3 from Passive Store to Active Trigger
Right now, ingestion is app driven. I’d use S3 event triggers to automatically invoke pre-transcoding as soon as a file is uploaded. It makes the pipeline even more reactive.

4. Job Status Dashboards for Content Teams
Give non engineering teams visibility into job status. A simple Grafana view showing "pending", "processing", and "completed" per asset helps reduce back and forth and improves alignment across teams.

5. Transcoding Profiles Optimized for Regions
If the audience is global, create region aware profiles. For example, compress heavier in low bandwidth regions, and maintain quality in urban areas. Saves bandwidth while preserving experience.

If you're in media, publishing, or data processing, this is a real world case of how to marry scale with savings and still leave room to improve.

We are bringing a live workshop on AI in Dev & Ops — what works, what fails, and why, featuring industry pioneer Patrick Debois and PerfectScale’s Chief Storyteller Ant Weiss.

If you’re interested in starting a newsletter like this, try out beehiiv (it’s what I use).

You get a 30 day free trial + 20% OFF for 3 months when you sign up using the link below.

Looking to promote your company, product, service, or event to 45,000+ Cloud Native Professionals? Let's work together.

Partner Disclosure: Please note that some of the links in this post are affiliate links, which means if you click on them and make a purchase, I may receive a small commission at no extra cost to you. This helps support my work and allows me to continue to provide valuable content. I only recommend products that I use and love. Thank you for your support!