Operations 15 min read

Alibaba’s Secret to Scaling GitLab: Distributed Sharding and Performance Boosts

This article details how Alibaba Group transformed its GitLab deployment from a single‑node bottleneck into a horizontally scalable, sharded architecture that handles millions of daily requests with high availability, improved performance, and robust data safety.

21CTO
21CTO
21CTO
Alibaba’s Secret to Scaling GitLab: Distributed Sharding and Performance Boosts

Background

Code is the starting point of DevOps, and code hosting ensures safety and availability while providing basic services such as MR and Issue.

Alibaba Group’s GitLab, based on GitLab Community Edition 8.3, supports tens of thousands of developers, hundreds of thousands of projects, daily requests in the tens of millions and storage in the terabytes, far exceeding the single‑node limits of the upstream version.

Challenges

GitLab’s design stores repository data on the local file system and its core components (libgit2, git, grit) operate directly on that file system, making horizontal scaling difficult.

In early 2015 the single‑node load surged. The initial mitigation copied all repositories to several machines, which reduced load but did not solve storage limits.

First Refactor Attempt

The team tried to replace the local file system with network‑shared storage, but performance suffered and the required C/C++ changes were costly.

New Refactor Strategy

The solution focused on sharding or slicing repositories based on their “namespace_path/repo_path” identifier, routing each repository and its requests to a dedicated machine.

Architecture diagram:

Key Components

1. Sharding‑Proxy‑Api records the mapping between repositories and target machines.

2. Proxy uses the API to route requests to the correct node.

3. Git Cluster consists of groups of three nodes (master, mirror, backup). Master handles write requests, mirror handles reads, backup serves as hot standby.

Ensuring Reliability

Sharding‑Proxy‑Api, built on the Martini framework, receives GitLab notifications in real time to keep mapping accurate; its round‑trip time is typically under 5 ms.

Sharding uses weighted hashing of namespace_path and repository size to balance load and storage across nodes.

Cross‑shard operations (project transfer, fork, merge request) are handled by fetching required data from other nodes via SSH/HTTP, with a future plan to use RPC.

Performance Improvements

SSH protocol was rewritten in Go and deployed on proxy and GitLab nodes, reducing server load, eliminating bugs, and keeping SSH access functional even when the native sshd fails.

CPU usage after the new SSH service:

Other high‑traffic requests (authentication, SSH‑key lookup) were optimized with MD5 hashing and indexing, with plans to rewrite in Go or Java.

Data Safety

Each shard has one master and two backups; even if one or two machines fail, data can be recovered.

Cross‑data‑center backup uses system hooks, a hook‑receiver service, Alibaba Cloud MNS, and a Go‑implemented RPC service based on grpc‑go and protobuf to achieve near‑real‑time synchronization with 99.9‑99.99 % consistency.

Backup‑data‑center architecture:

Availability Enhancements

Log inspection with Alibaba’s monitoring tools alerts on 5xx errors; issues are categorized (sharding bugs, high‑concurrency faults, database errors).

Comprehensive server monitoring includes CPU, memory, load, network checks, port health, message‑queue length, DB connections, error request rates, and data‑consistency checks.

Automatic failover swaps master/backup roles within five minutes when a node is detected down.

Modular Deployment

Adopting a cell‑based architecture isolates each shard’s services, allowing seamless data‑center failover and easier integration with Alibaba Cloud services or acquisitions.

Future Work

Address occasional massive cache releases that cause temporary performance drops.

Automate release and scaling processes for the growing number of machines.

Complete the RPC replacement to separate web‑service resource consumption from Git operations.

Conclusion

Monitoring shows a four‑fold increase in request volume, 130 % growth in projects, and 56 % growth in users, while success rates rose from 99.5 % to over 99.99 %.

Alibaba’s GitLab architecture now supports a million‑scale user base and will continue to evolve to serve cloud developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsshardingGitLabdistributed-systemsscaling
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.