- TechOps Examples
- Posts
- How GitHub's Database Self-Destructed in 43 Seconds
How GitHub's Database Self-Destructed in 43 Seconds
TechOps Examples
Hey — It's Govardhana MK 👋
Along with a use case deep dive, we identify the top news, tools, videos, and articles in the TechOps industry.
Before we begin... a big thank you to today's sponsor
Auto-pilot meeting bookings with AI Agent.
Try Salesforge - Send unique emails at scale in any language.
Scale your outreach, not your team.
IN TODAY'S EDITION
🧠 Use Case
How GitHub's Database Self-Destructed in 43 Seconds
🚀 Top News
Splunk Free Training - self-paced, train anytime from any location
📽️ Videos
📚️ Resources
Collection of Azure labs coded in Terraform
SSH vs. VPN: what’s the difference, and which is more secure?
🛠️ TOOL OF THE DAY
kubetrim - Trim 📏 your KUBECONFIG automatically.
It tidies up old and broken cluster and context entries from your kubeconfig file.
🧠 USE CASE
How GitHub's Database Self-Destructed in 43 Seconds
Before visiting the curious destructive 43 seconds, let’s take a moment to look into GitHub’s architecture.
GitHub ran a distributed infrastructure with data centers on the U.S. East and West Coasts, supported by global POPs and cloud regions and stores all its metadata (like issues, pull requests, comments, notifications) in MySQL
This setup balances performance and availability, using a MySQL primary-replica topology with an East Coast primary for writes and replicas on both coasts for read resilience.
Some West Coast replicas acted as intermediate primaries, meaning replicas can have their own replicas.
This setup reduces cross-data center traffic and costs by limiting the cross-data center hop to a single instance, reducing the load on the primary by replicating only to direct replicas.
Chapter 1: The 43 Seconds That Changed Everything
On Oct 21, 2018, routine maintenance at GitHub’s East Coast data center led to an accidental misconfiguration—possibly a flipped switch or unplugged cable—that disconnected the East Coast from the network.
This disconnection lasted just 43 seconds, but it caused Orchestrator, GitHub’s automated failover tool, to promote a West Coast replica as the new primary.
With East Coast writes not replicated to the West due to the connection break, each coast held unique data.
Writes timed out as traffic couldn’t reach primaries, prompting West Coast replicas to become primaries.
When traffic moved to the West, both centers developed distinct data sets.
After reconnection, primaries remained on the West, leaving the East Coast idle, like diverging branches in Git history.
Shortly, Down Detector and Reddit users began reporting issues.
Once GitHub identified a cross data center failover, the root cause became clear.
Chapter 2: Data Dilemma
Once the network connection was restored, GitHub faced a difficult choice.
Reverting back to the East Coast primary would mean discarding up to 40 minutes of new West Coast data.
However, leaving the West Coast as primary caused severe latency for services that relied on East Coast infrastructure, making GitHub slow and mostly read-only.
To preserve data integrity, GitHub decided to proceed with the West Coast as primary and work to synchronize the East Coast over time.
This “fail-forward” approach meant maintaining a degraded state but ensured no data loss.
Chapter 3: The Long Road to Recovery
Synchronizing required rolling back the East Coast cluster, restarting replication, and saving unreplicated writes for later.
MySQL backups (every 4 hours, stored in the cloud) took hours to decompress, prepare, and load
GitHub restored from the West Coast to speed up, syncing several clusters in 6 hours.
8 hours in, GitHub explained the situation in a blog.
3 hours later, East Coast clusters were ready to fail back, though some replicas lagged, returning outdated data.
Full sync took longer due to rising traffic, completing 5 hours later and restoring the original setup.
Over 5 million webhooks and 80,000 page builds processed afterward took more time; full operations resumed after 24 hours, with GitHub updating affected users.
Throughout this recovery, GitHub kept the site operational in a limited mode, disabling features like webhooks and GitHub Pages builds to ease the database load.
Some key insights we can derive from this incident:
Design for resilience to prevent split-brain, using quorum-based systems and consensus algorithms.
Establish robust failover protocols to ensure data consistency when the primary is isolated.
Automate multi-step recovery like restores and re-syncs to handle high data volumes smoothly.
Implement throttling for non-essential services during recovery to ease database load.
Regularly test cross-region failover to uncover latency and topology issues proactively.
I hope this use case has been helpful in your learning journey.
You may even like:
Fully Automated Email Outreach With AI Agent Frank
Agent Frank was designed to fully take care of prospecting, emailing and booking meetings for you, so you can focus on closing deals!
With Agent Frank you won’t need 20+ different tools anymore - just set up your Agent in 4 steps and get started.