R&D Management 26 min read

What It Was Really Like Working at GitLab: Lessons on Scaling, Performance, and Culture

The author recounts six years at GitLab, detailing the challenges of scaling a rapidly growing remote company, building performance monitoring tools and a database load balancer, confronting cultural and management issues, and sharing hard‑won lessons on scalability, deployment speed, product strategy, and remote work dynamics.

dbaplus Community

Mar 24, 2024

What It Was Really Like Working at GitLab: Lessons on Scaling, Performance, and Culture

Background

From October 2015 to December 2021 the author worked at GitLab (employee #28). The role focused on improving the performance of a large‑scale Ruby on Rails application while allocating 20 % of time to the Rubinius VM project.

Performance‑monitoring tooling

GitLab required all internal tools to be open‑source and self‑hostable, which led to the creation of two key components:

GitLab Performance Monitoring – a built‑in dashboard that aggregates request latency, database query times, and system‑level metrics across 15‑20 servers. It replaced a trial New Relic setup that could only monitor one or two hosts.

Sherlock – a heavyweight analyzer used in development environments to profile Ruby code, detect N+1 queries, and generate per‑request call graphs.

These tools were later incorporated into the official GitLab product and formed the basis of a dedicated “performance” team.

Database load balancer

A custom load‑balancing layer was added to route read‑only queries to replica databases and write queries to the primary. The balancer implements “sticky” sessions: after a write, subsequent reads are forced to the primary until replication lag falls below a configurable threshold. This functionality is exposed as a Rails‑compatible module that can be added to any GitLab installation with a single configuration change, eliminating the need for developers to manually manage connection pools.

Production incident (31 January 2020)

During a performance‑tuning session the production database was accidentally dropped. Contributing factors:

No recent backups – the backup‑notification system was broken.

Reliance on a temporary copy made six hours earlier for recovery.

Recovery required restoring the temporary copy and replaying ~24 hours of write traffic, resulting in approximately six hours of data loss. The incident highlighted the risks of introducing sharding in a workload with a read‑to‑write ratio of roughly 10:1, where sharding adds complexity without clear performance benefits.

Deployment pipeline improvements (2018‑2021)

The author helped streamline the release process:

Measured the end‑to‑end time from commit on master to deployment on GitLab.com, finding a typical range of several days to three weeks.

Identified the split between Community Edition (CE) and Enterprise Edition (EE) repositories as a major bottleneck; merged the two codebases into a single repository, reducing manual merge conflicts and synchronization effort.

Implemented automated CI pipelines that reduced the longest deployment window from three weeks to a single day, and later to a few hours for most changes.

These changes enabled a more predictable release cadence and laid the groundwork for the goal of deploying any change within one hour.

Key technical takeaways

Scalability must be a cultural priority. Without explicit focus, performance work is deprioritized despite frequent user complaints.

Data‑driven decision making. Metrics from the performance monitoring stack are essential to define a “minimum viable change”.

Separate SaaS and self‑hosted paths. The differing load patterns (high‑read SaaS vs. smaller self‑hosted instances) make a single code path inefficient.

Ruby on Rails can hide N+1 queries. Proactive use of tools like Sherlock and explicit eager‑loading patterns are required to keep query counts low.

Rapid, reliable deployments. Aim for sub‑hour deployment cycles; this reduces incident impact and improves developer feedback loops.

Performance GitLab ProductManagement RemoteWork

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Performance‑monitoring tooling

Database load balancer

Production incident (31 January 2020)

Deployment pipeline improvements (2018‑2021)

Key technical takeaways

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Production incident (31 January 2020)