Databases 25 min read

What GitLab’s 34k‑Line Postgres Schema Reveals About Scalable Database Design

This article examines GitLab’s extensive PostgreSQL schema, explaining why the project uses a pure‑SQL structure file, how primary‑key choices, internal versus external IDs, naming conventions, timestamps, text types, partitioning, GIN indexes, JSONB columns, foreign‑key strategies and other data‑type decisions impact performance, maintainability, and scalability.

ITPUB

May 5, 2024

What GitLab’s 34k‑Line Postgres Schema Reveals About Scalable Database Design

In 2022 the author analyzed GitLab’s PostgreSQL schema, which is stored in structure.sql (over 34,000 lines) instead of the usual Rails schema.rb to allow PostgreSQL‑specific features such as triggers, partitioning, and materialized views.

Primary‑Key Types

GitLab uses bigserial for 380 of its 573 tables, serial for 170 tables, and composite keys for the remaining 23. Experiments show that switching a table from integer to bigint on a 10‑million‑row table takes about 11 seconds, while altering a varchar length from 300 to 100 can take over 35 seconds, illustrating the storage and speed impact of key type choices.

Internal vs. External IDs

Many tables have both an internal id (never exposed) and an external iid (shared with users). The issues table, for example, stores the public identifier in iid while keeping id internal, and a unique index index_issues_on_project_id_and_iid enforces uniqueness per project.

Naming Conventions

Table names are plural (e.g., issues, projects).

Modules prefix tables ( merge_request_*).

Columns use snake_case.

Boolean columns follow one of three patterns: feature switches ( create_issue), entity status ( archived), or is_ / has_ prefixes.

Indexes follow index_#{table}_on_#{column}_and_#{column}_#{condition} (e.g., index_services_on_type_and_id_and_template_when_active).

Timestamps

GitLab stores both timezone‑aware ( timestamp with time zone) and timezone‑agnostic ( timestamp without time zone) columns. System‑generated actions typically use the latter, while user‑visible events (e.g., closed_at) use the former.

Text Types and Check Constraints

Most textual columns are text with explicit CHECK length constraints (e.g., char_length(entity_path) <= 5500). This avoids the lock‑escalation issues of varchar(n) while still limiting size.

Partitioning

Large tables are partitioned using:

Range partitioning (e.g., audit_events by created_at).

List partitioning (e.g., loose_foreign_keys_deleted_records).

Hash partitioning (e.g., product_analytics_events_experimental).

Further details are in the PostgreSQL documentation.

GIN Trigram Indexes for LIKE Searches

GitLab creates GIN indexes with gin_trgm_ops on columns that require fast pattern matching, such as issues.title and issues.description. The article demonstrates the performance gain by comparing a sequential scan with a bitmap heap scan after adding the trigram index.

JSONB Columns

Several tables store flexible data in jsonb, primarily for request payloads and extensible fields. Examples include error_tracking_error_events.payload, operations_strategies.parameters, and vulnerability_finding_evidences.data. JSONB allows efficient querying while preserving schema flexibility.

Foreign‑Key Constraints

Most tables enforce foreign‑key constraints with actions such as ON DELETE CASCADE, RESTRICT, or SET NULL. Immutable tables (e.g., audit_events) omit foreign keys to avoid performance penalties on massive inserts.

Other Data Types

inet

is used for IP addresses in tables like audit_events and authentication_events. bytea stores binary data such as hashes and encrypted tokens.

Arrays (e.g., integer[]) hold multi‑value fields when a separate table would be overkill.

Enumerations are stored as smallint to save space.

Optimistic locking uses a lock_version integer column in tables such as ci_builds.

Conclusion

GitLab’s schema demonstrates that a one‑size‑fits‑all approach is unsuitable for large, evolving systems. Each table’s design reflects its specific access patterns, data volume, and functional requirements, offering practical lessons for anyone designing PostgreSQL schemas at scale.

References

GitLab schema structure.sql: https://gitlab.com/gitlab-org/gitlab/-/blob/master/db/structure.sql

Issue 29465 – Use structure.sql instead of schema.rb: https://gitlab.com/gitlab-org/gitlab/-/issues/29465

Choosing Primary Key Type in Postgres: https://shekhargulati.com/2022/06/23/choosing-a-primary-key-type-in-postgres/

GitHub’s Path to 128M public repositories: https://towardsdatascience.com/githubs-path-to-128m-public-repositories-f6f656ab56b1

PostgreSQL Character Types Documentation: https://www.postgresql.org/docs/current/datatype-character.html

Difference between text and varchar: https://stackoverflow.com/questions/4848964/difference-between-text-and-varchar-character-varying

CHAR(x) vs. VARCHAR(x) vs. VARCHAR vs. TEXT: https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

PostgreSQL indexes schema design partitioning JSONB Primary Keys

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.