Scaling Alibaba Group GitLab: Distributed Architecture, Sharding, and Operational Practices
This article describes how Alibaba Group transformed its GitLab platform from a single‑node deployment to a distributed, sharded architecture with proxy services, performance optimizations, multi‑datacenter backup, and automated operations to support millions of users and dramatically increase request throughput.
Background – Code is the foundation of DevOps, and Alibaba's GitLab, built on GitLab Community Edition 8.3, serves tens of thousands of developers with millions of daily requests and terabytes of storage, far exceeding the single‑node limits of the upstream project.
Challenges – The monolithic design stores repositories on local disks, making horizontal scaling impossible. Early attempts to sync repositories across multiple machines only alleviated load, not storage limits, and introduced synchronization overhead.
Transformation Strategy – Starting in 2015, the team moved away from local‑disk dependence toward a distributed solution based on sharding and slicing. Repository names (namespace_path/repo_path) are used to route requests to specific machines, enabling horizontal scaling.
Architecture Overview – The new design introduces a Sharding‑Proxy‑Api that maps repositories to target nodes, a Proxy that forwards requests based on that mapping, and a Git Cluster composed of master, mirror, and backup nodes per shard. Master handles writes, mirror handles reads, and backup provides hot standby.
Ensuring Correctness – The Sharding‑Proxy‑Api, built on the Martini framework, receives real‑time GitLab notifications to keep mapping data accurate, with response times under 5 ms. Weighted sharding balances storage and request load, while cross‑shard operations (e.g., project transfer, fork) are handled via SSH/HTTP calls, with a future RPC‑based solution in progress.
Performance Improvements – The native SSH service was rewritten in Go and deployed on proxy and cluster nodes, eliminating high‑CPU bugs and reducing load. Additional high‑traffic endpoints (authentication, SSH‑key lookup) were optimized with MD5 hashing and indexing, with plans for further rewrites in Go or Java.
Data Safety – Each shard maintains a master‑mirror‑backup trio (one‑primary‑three‑replicas) and supports cross‑datacenter backup. A disaster‑recovery drill demonstrated sub‑minute alerting and sub‑5‑minute traffic switchover, with near‑real‑time synchronization using GitLab system hooks, internal hook receivers, Alibaba Cloud MNS, and gRPC‑based RPC services.
Operational Reliability – Comprehensive logging, metric collection, and alerting detect 5xx errors, cross‑shard bugs, and database issues. Automated health checks trigger role swaps between master and backup within five minutes of a failure, and automated datacenter failover is already in place.
Unit‑Based Deployment – Inspired by parallel computing, the system adopts a cell (unit) model where each shard is a self‑contained unit, facilitating isolated deployment and simplifying multi‑datacenter operations.
Future Work – Ongoing efforts include mitigating large cache releases that cause temporary performance drops, automating release and scaling processes, and fully implementing a global RPC layer to separate web‑service load from Git operations.
Conclusion – Over the past year, Alibaba GitLab’s request volume grew fourfold, projects increased by 130 %, and user count by 56 %, while request success rates rose from 99.5 % to over 99.99 %, proving the scalability and reliability of the new architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
