Databases 14 min read

How Canva Scaled Its Media Service: From MySQL Limits to DynamoDB Migration

Canva’s media service, handling billions of assets, evolved from a MySQL‑based microservice architecture to a DynamoDB‑backed solution by incrementally migrating metadata, sharding tables, employing real‑time dual writes via SQS, and implementing zero‑downtime cut‑over, dramatically improving latency and scalability.

ITPUB

Aug 16, 2024

How Canva Scaled Its Media Service: From MySQL Limits to DynamoDB Migration

Canva’s Media Service Architecture

Canva’s design platform stores billions of photos and graphics. The media service manages each media item’s ID, owner, library flag, external resource info, status, rich metadata (title, creator, tags, colors), and storage location. Reads far outnumber writes, and most media are rarely updated after creation.

MySQL at Canva: Growing Pains

Initially, most resource‑oriented microservices used MySQL on AWS RDS. Vertical scaling with larger instances was followed by horizontal scaling and read‑only replicas. As the media table grew, schema changes took days, and online DDL caused severe performance drops, making it impossible to serve traffic while altering the schema.

GitHub’s online migration tool gh‑ost ( https://github.com/github/gh-ost) allowed safe online schema changes, but new problems emerged:

MySQL 5.6 replication speed capped write throughput to read replicas.

Even with gh‑ost, schema changes could take up to six weeks.

RDS MySQL EBS volume size approached its 16 TB limit.

Each increase in EBS size added slight I/O latency, hurting tail latency.

A hot buffer pool was required to handle production traffic, limiting instance restarts.

Ext3 snapshots limited table file size to 2 TB.

Exploring Alternative Solutions

By mid‑2017, media objects neared one billion and were growing exponentially. Canva evaluated several options and chose a gradual migration path to avoid a single‑point‑of‑failure technology.

Migrate frequently modified metadata (e.g., titles, tags) to a JSON column managed by the media service.

Denormalize certain tables to reduce lock contention and joins.

Remove duplicate data (e.g., shorten S3 bucket names).

Drop foreign‑key constraints.

Change media import workflow to reduce metadata update frequency.

A simple sharding scheme was added to bypass the 2 TB ext3 limit and replication throughput ceiling, optimizing for the common ID‑lookup request while handling less common list‑media queries with scattered scans.

Real‑Time Migration to DynamoDB

To move all existing, newly created, and updated media to DynamoDB while relieving MySQL load, Canva considered three approaches:

Dual writes at request time to both MySQL and DynamoDB.

Build and replay an ordered log of all create/update operations.

Use AWS Database Migration Service (DMS).

Canva selected a method that gave full control over data mapping, allowed incremental real‑time migration, and prioritized recent, frequently accessed media.

Send messages to an AWS SQS queue for each media create, update, or read event (without the payload).

Worker instances consume messages, fetch the current state from the MySQL primary, and write to DynamoDB as needed.

High‑priority queue handles writes; low‑priority queue handles reads.

Testing in Production

Before fully switching to DynamoDB for consistent reads, Canva ran dual‑read comparisons, matching MySQL results against the new DynamoDB service. After fixing replication issues, they gradually served individual media from DynamoDB while falling back to MySQL for media not yet migrated.

Because media were copied one by one, full‑catalog queries (e.g., list all user media) were unavailable until the scan completed. After the scan, reads were served entirely from DynamoDB.

Zero‑Downtime Cut‑Over and Rollback Strategy

The riskiest step was switching all writes to DynamoDB. Canva mitigated risk with several safeguards:

Integrated tests covering both mixed‑media (MySQL + DynamoDB) and pure DynamoDB writes.

Ported other integration tests to the DynamoDB implementation.

Local development testing of the new code.

End‑to‑end test suite validation.

Created a run‑book with a tagging system to revert reads to MySQL within seconds if needed.

Conducted run‑book rehearsals in staging before production rollout.

The production switch proceeded without downtime or errors, and media‑service latency improved significantly.

Lessons Learned

Prioritize lazy migration based on access patterns; move hot data first.

Operate directly in production to surface real‑world issues early.

Test migrations in the live environment whenever possible.

Is DynamoDB the Right Choice?

Since the migration, Canva’s monthly active users have more than doubled, and DynamoDB has scaled automatically with lower cost than the previous MySQL setup. Trade‑offs include the need for custom parallel scan code for back‑fills and loss of ad‑hoc SQL queries on replicas, mitigated by CDC pipelines for data‑warehouse needs.

Global secondary indexes are required to support existing access patterns, often built by concatenating attributes manually. Fortunately, core media metadata is now stable, and new access patterns are rare.

Canva now stores over 250 billion uploaded media items, with 50 million uploads per day.

Original article: https://www.canva.dev/blog/engineering/from-zero-to-50-million-uploads-per-day-scaling-media-at-canva/

Microservices scalability MySQL Database Migration Media Service DynamoDB Canva

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.