Databases 13 min read

Mastering Vitess: Scaling MySQL with Cloud‑Native Sharding and Resharding

This article introduces Vitess, a cloud‑native MySQL sharding middleware, explains its key features, architecture, core concepts such as cells, keyspaces, shards and vindexes, and provides a step‑by‑step guide to performing a reshard from two to four shards while highlighting operational challenges and practical recommendations.

dbaplus Community
dbaplus Community
dbaplus Community
Mastering Vitess: Scaling MySQL with Cloud‑Native Sharding and Resharding

Introduction

Vitess is an open‑source MySQL‑compatible database middleware that provides horizontal scalability similar to NoSQL systems. It originated in 2010, joined the CNCF in 2018 and graduated in 2019. Vitess is used in production at large‑scale services such as YouTube, Slack, Square and Pinterest.

Key Features

Scalability : Sharding is performed by Vitess itself, allowing unlimited shard expansion without any application‑level changes.

Performance : Reduces MySQL connection memory usage, can handle thousands of concurrent connections, rewrites expensive queries and caches results to avoid duplicate backend hits.

Operations : Automatic primary failover, backup support and a distributed metadata service hide topology changes from applications.

Cloud‑native : Fully containerized, dynamically orchestrated and designed for micro‑service environments, making it a natural fit for Kubernetes.

Architecture

Typical deployment runs the following containers in Kubernetes (or any other orchestrator):

vttablet : Wraps a MySQL instance and manages its primary/replica topology.

Topology server : Stores Vitess metadata (etcd, ZooKeeper or Consul).

vtgate : Stateless proxy that routes queries to the correct shard; it can be scaled horizontally.

vtctld : Web UI for inspecting metadata and managing workflows.

Vitess architecture diagram
Vitess architecture diagram

Core Concepts

Cell : A network‑isolated region (data‑center, availability zone or a Kubernetes cluster) that provides fault isolation.

Keyspace : Logical database. In an unsharded deployment it maps to a single MySQL cluster; when sharded it maps to a set of identical MySQL clusters.

Keyspace ID : Numeric identifier derived from row data; determines the shard that stores the row.

Shard : A range of Keyspace IDs (Begin, End) hosted by one primary and multiple replicas, possibly spanning several cells.

Vindex : Function that maps column values to Keyspace IDs. Defined by a sharding column and a sharding function (e.g., hash).

Sharding functions : Built‑in (hash, range, lookup) or custom functions used by vindexes to compute Keyspace IDs.

Resharding Process (2 → 4 shards)

Start with a 2‑shard keyspace (e.g., 00‑80 and 80‑FF) and add a replica to each existing shard.

Provision the two new shards (e.g., 00‑40, 40‑80, 80‑C0, C0‑FF) and stop replication on the old replicas to prepare for data copy.

Copy static data from the old shards to the new shards, routing each row according to its Keyspace ID (e.g., rows in 00‑80 are split between 00‑40 and 40‑80).

Start filtered binlog replication from the point where the static copy finished; the filter continues to route rows by Keyspace ID to the appropriate new shard.

Switch traffic: first redirect read traffic to the new replicas, then promote the new primary for writes.

After a monitoring period, decommission the old shard resources.

Resharding step 1
Resharding step 1
Resharding step 2
Resharding step 2
Resharding step 3
Resharding step 3
Resharding step 4
Resharding step 4
Resharding step 5
Resharding step 5
Resharding step 6
Resharding step 6

Production Deployment Tips

Management tooling must be able to operate both Vitess resources (vtgate, vttablet, vtctld) and the underlying Kubernetes objects.

Migration utilities should copy data from a vanilla MySQL cluster into Vitess and include verification steps (e.g., checksum comparison).

Deploy a binlog‑capture service such as Binlake to stream changes to downstream systems (Kafka, Pulsar) without exposing internal topology.

Challenges and Recommendations

Rolling upgrades of vtgate : Update the container image, adjust pod labels so the Service selector skips old pods, and let the ReplicaSet create new pods before terminating the old ones.

Complex SQL support : Validate that joins, prepared statements and stored procedures work as expected; some edge‑cases may require query rewriting rules.

High‑throughput workloads : Use dedicated vtgate pods and physical isolation (separate node pools) to avoid contention.

Etcd stability : Split large VSchema values into separate storage and move cell‑level VSchema handling to avoid OOM.

Observability : Instrument every Vitess role (vtgate, vttablet, vtctld) with metrics (Prometheus) and logs (ELK) and set alerts for latency, replication lag and topology changes.

Resharding familiarity : Practice the resharding workflow in a staging environment; know how to locate and fix data‑routing bugs.

Scheduler reliability : Leverage Kubernetes (or Nomad) for robust pod scheduling, health‑checking and automatic restarts.

Start with a pilot migration to gain hands‑on experience before scaling to production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeshardingmysqlDatabase MiddlewareVitessResharding
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.