TechOps Examples
Posts
How Canva Processed Billions of Events with OLAP Migration

How Canva Processed Billions of Events with OLAP Migration

Govardhana M K
October 21, 2024

Sign Up | Advertise

TechOps Examples

Hey — It's Govardhana MK 👋

Along with a use case deep dive, we identify the top news, tools, videos, and articles in the TechOps industry.

IN TODAY'S EDITION

🧠 Use Case

How Canva Processed Billions of Events with OLAP Migration

🚀 Top News

AWS Lambda now supports CloudWatch Logs Live Tail and Metrics Insights in the console

📽️ Videos

Git & GitHub Tutorial - Visualized Git Course for Beginner & Professionals in 2024

AWS Networking - What is a NAT Gateway?

📚️ Resources

Cloud Logging Tips and Tricks

Why Kubernetes doesn’t manage users?

Top 15 Linux Networking commands that you should know

🛠️ TOOL OF THE DAY

cloud_enum - Multi-cloud OSINT tool. Enumerate public resources in AWS, Azure, and Google Cloud.

🧠 USE CASE

How Canva Processed Billions of Events with OLAP Migration

When Canva first built its architecture, they used MySQL and separated major components using worker services, storing multiple layers of reusable intermediary output.

Ref: Canva

The deduplication worker scanned and matched event types, updating records before aggregating results into counters. While this setup worked initially, Canva encountered three key challenges:

Processing scalability
Incident handling complexity
Rapid storage consumption

Processing scalability:

The deduplication scan used a single-threaded process with a pointer tracking the latest record, making it easy to verify fixes but not scalable.

Each record needed a database round trip, resulting in O(N) queries.

Batching helped but didn’t fully solve the issue. Multi-threading added complexity, and errors delayed the entire pipeline.

Increase total counts using database round trips.

Incident handling complexity:

Incident handling was complex, requiring manual fixes in the database. The key types of incidents were:

Overcounting: New usage types mistakenly included, needing pipeline pauses, data removal, and table corrections.
Undercounting: Missing event types required retrieving backup data, causing delays due to processing limits.
Misclassification: Usage events categorized incorrectly, needing code fixes and full recalculation of deduplication and aggregation data.
Processing delays: Sequential scans or unexpected data slowed down the pipeline, delaying aggregation.

Rapid storage consumption:

MySQL RDS couldn’t scale horizontally, leading to doubling the instance size every 8-10 months and rapidly consuming storage.

As it grew to several TBs, maintenance became complex with downtime risks for critical features.

Regular upgrades without downtime added complexity. A database split with sweepers helped clean old data but wasn’t sustainable long-term.

Migrate data to DynamoDB

Lessons learned drove pipeline changes, moving raw usage events to DynamoDB to ease storage pressure.

However, migrating all data was halted, as it improved storage scalability but didn’t solve processing scalability issues from database round trips.

Simplify using OLAP and ELT

Canva switched to end-to-end calculations, processing entire months of data using Snowflake for large-scale analysis.

Usage data was extracted via a data replication pipeline, transformed with scheduled SQL jobs (using DBT), and aggregated with queries like:

select

day_id,

template_brand,

sum(usage_count) as usage_count

from

group by

day_id,

template_brand

Key Steps Involved:

Extracted JSON data into optimized SQL tables.
Deduplicated usage events.
Aggregated totals using GROUP BY queries.

These changes improved performance, scalability, and reduced operational complexity as Canva now tracks billions of content usages monthly.

New Architecture

Canva’s core tracking functionality is now built as a counting pipeline, divided into three stages:

Data collection: Usage events are gathered from various sources, validated, and filtered.
Deduplication: Duplicate events are removed, and classification rules are applied to track distinct usages.
Aggregation: The total deduplicated usages are calculated and grouped by dimensions like design template or brand.

Hope this was an insightful use case for your learning!

There’s a reason 400,000 professionals read this daily.

Join The AI Report, trusted by 400,000+ professionals at Google, Microsoft, and OpenAI. Get daily insights, tools, and strategies to master practical AI skills that drive results.

Looking to promote your company, product, service, or event to 16,000+ TechOps Professionals? Let's work together.