- TechOps Examples
- Posts
- Cloud Disaster Recovery Strategies
Cloud Disaster Recovery Strategies
TechOps Examples
Hey — It's Govardhana MK 👋
Along with a use case deep dive, we identify the top news, tools, videos, and articles in the TechOps industry.
IN TODAY'S EDITION
🧠 Use Case
Cloud Disaster Recovery Strategies
🚀 Top News
Amazon DynamoDB reduces on-demand prices by 50% and Global Tables prices by up to 67%
📽️ Videos
📚️ Resources
A curated list of Docker resources and projects
The Architecture of Dropbox's Load Balancing Service
🛠️ TOOL OF THE DAY
AcSecurity - A Python module designed to scan applications for common security vulnerabilities.
It checks for hardcoded secrets, dependency vulnerabilities, and code quality issues.
🧠 USE CASE
Cloud Disaster Recovery Strategies
Disaster Recovery Planning and implementation cannot be an afterthought, as many assume.
A lack of DR impacts not only availability but can also damage reputation and incur hefty penalties.
Unbelievable? Check out these real-world incidents.
British Airways lost £80 million due to a data center failure, stranding thousands of passengers because both primary and backup systems failed.
Fire at an OVH Data Center in Europe takes down thousands of sites, including WP Rocket and Imagify.
Any DR strategy starts with finalizing - RTO and RPO.
RTO (Recovery Time Objective):
How much downtime one can accept?
A tighter RTO means less tolerance for downtime, indicating that systems need to be restored quickly to avoid prolonged interruptions.
RPO (Recovery Point Objective):
How much data loss one can accept?
A tighter RPO means more frequent backups or data syncs to minimize data loss, which is especially crucial for data-intensive operations.
Most prominent DR strategies are:
Data and configurations are backed up at intervals, typically stored on separate media or cloud storage, and restored manually or through scripts when needed.
Typical RTO: Several hours to days
Typical RPO: Can vary from several hours to the last successful backup
Cost: Low cost, suitable for small budgets, as it primarily involves storage for backups rather than live infra.
Caveats: Recovery can be slow, leading to extended downtime and data loss, as it relies on the last available backup.
Best Suited For: Small businesses, personal projects, and non-critical applications.
A minimal version of core services is always active in a secondary location (e.g., core database running, app servers shut down), allowing faster ramp-up in emergencies while saving on idle resources.
Typical RTO: Minutes to a few hours
Typical RPO: Depends on synchronization frequency (e.g., hourly or daily)
Cost: Moderate, as minimal resources are running until needed, but requires investment in a standby environment.
Caveats: Scaling up during a disaster can take time, and RPO may still be significant based on sync intervals.
Best Suited For: Critical applications where quick recovery is needed but cost constraints prevent full redundancy.
A secondary environment with near-production-level resources is actively synced with the primary, enabling quick scaling with a load balancer redirect.
Typical RTO: Minutes to a few hours
Typical RPO: Within the last few minutes or hours
Cost: Higher than Pilot Light, as more resources are actively running and up-to-date, but still cheaper than a fully redundant setup.
Caveats: While faster than Backup and Restore, it can still result in downtime and data loss, though on a smaller scale.
Best Suited For: Applications needing faster recovery, such as ecommerce websites with high traffic but limited budget for full redundancy.
Both sites are fully operational, handling real-time data synchronization across locations with failover mechanisms, minimizing downtime to near-zero; this setup is best for mission-critical, latency-sensitive applications.
Typical RTO: Near-zero to a few minutes
Typical RPO: Very minimal, often within the last few seconds
Cost: Very high, as it requires full duplication of the production environment, essentially doubling infrastructure costs.
Caveats: Costly to implement and maintain, often feasible only for large enterprises with critical, real-time operations.
Best Suited For: Highly critical applications where downtime is unacceptable, like financial trading platforms or mission-critical services (e.g., banking systems, healthcare applications).
Remeber, the worst part of any disaster is ‘surprise’ - you never know when it happens, untill it happens.
Better be ready !