Fundamentals 18 min read

Unveiling Git’s Hidden Architecture: Objects, Packfiles, and Storage Models Explained

This article systematically explores Git’s underlying architecture, detailing the lifecycle of objects, core data structures, packfile formats, indexing mechanisms, and the algorithms used for object retrieval, while providing practical command examples and visual diagrams to demystify Git’s storage and retrieval models.

Alibaba Cloud Developer

May 20, 2020

Unveiling Git’s Hidden Architecture: Objects, Packfiles, and Storage Models Explained

State Model

Git objects reside in different storage locations throughout their lifecycle. Commands move objects between the workspace, the index (staging area), the local repository, and the remote repository.

Workspace – the files you see in your local directory. When the workspace is clean, its contents match the index; after modifications, the workspace diverges until changes are staged.

Index (staging area) – a temporary cache where files are placed before a commit. All staged files are committed together to the local repository.

Local repository – a complete Git repository stored on your machine, enabling fully offline operations such as log, history, commit, and diff.

Remote repository – a centralized repository that synchronizes with the local repository, allowing sharing and collaboration.

Below is a diagram illustrating the storage locations of a file at different stages:

Object Model

Git stores four primary object types, all identified by SHA‑1 hashes:

Blob – stores the raw content of a file (binary data) without metadata such as filename.

Tree – represents a directory structure, containing references to blobs (files) and sub‑trees (sub‑directories) along with mode and permission information.

Commit – a snapshot of the entire project at a point in time, linking to a tree object and optionally parent commits; it records author, committer, and message.

Tag – a named reference to a specific commit. Tags can be lightweight (direct pointer) or annotated (full tag object with metadata).

Common commands to inspect objects:

git cat-file -t <sha1>   # show object type
git cat-file -p <sha1>   # pretty‑print object content

Example output for a tree object:

100644 blob 36a982c504eb92330573aa901c7482f7e7c9d2e6    .cise.yml
100644 blob c439a8da9e9cca4e7b29ee260aea008964a00e9a    .eslintignore
100644 blob 245b35b9162bec4ef798eb05b533e6c98633af5c    .eslintrc
100644 blob 10123778ec5206edcd6e8500cc78b77e79285f6d    .gitignore
100644 blob 1a48aa945106d7591b6342585b1c29998e486bf6    README.md
100644 blob 514f7cb2645f44dd9b66a87f869d42902174fe40    abc.json
040000 tree 8955f46834e3e35d74766639d740af922dcaccd3    cli_list
... (additional entries omitted for brevity)

Storage Model

Git’s on‑disk storage consists of loose objects in .git/objects and packed objects in packfiles. Packfiles are compressed collections of objects designed to reduce size and improve transfer efficiency.

Packfile structure (version 2) includes three parts:

Header – 4‑byte signature "PACK", 4‑byte version number, and 4‑byte object count.

Body – a sequence of objects, each stored as a Zlib‑compressed blob or delta. The header of each object encodes its type (blob, tree, commit, tag, or delta) and the size of the uncompressed data.

Trailer – SHA‑1 checksum of the packfile and the checksum of the associated index file.

Example of a packfile header diagram:

Index Model

Each packfile has a corresponding .idx file that enables fast object lookup. The index is layered as follows:

Fanout Table – 256 entries that cumulatively count objects whose SHA‑1 prefixes fall within each hexadecimal bucket, allowing rapid narrowing of the search range.

SHA Layer – sorted list of all object SHA‑1 values (20 bytes each) for binary search.

CRC Layer – CRC‑32 checksums for integrity verification.

Offset Layer – 4‑byte offsets locating objects within the packfile; if the offset exceeds 2 GB, the high bit signals that the actual offset is stored in the Big File Offset layer (8 bytes).

Trailer – checksums of the packfile and the index file.

Illustration of the index lookup process:

Retrieval Algorithm

To locate an object:

Use the fanout table to find the range of SHA‑1 entries.

Binary‑search the SHA layer to obtain the index of the target SHA‑1.

Read the corresponding offset (or big‑file offset) from the offset layer.

Seek to that position in the packfile, read the object header, and decompress the body.

If the object is stored as a delta, recursively retrieve the base object and apply the delta data.

Diagram of the retrieval steps:

Reference Materials

https://stackoverflow.com/questions/8198105/how-does-git-store-files https://www.npmjs.com/package/git-apply-delta https://git-scm.com/book/en/v2/Git-Internals-Packfiles https://codewords.recurse.com/issues/three/unpacking-git-packfiles http://shafiulazam.com/gitbook/7_the_packfile.html http://wiki.jikexueyuan.com/project/git-community-book/packfile.html http://www.runoob.com/git/git-workspace-index-repo.html http://shafiulazam.com/gitbook/1_the_git_object_model.html http://eagain.net/articles/git-for-computer-scientists/ https://www.kernel.org/pub/software/scm/git/docs/user-manual.html#object-details

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

indexing Git Storage object model packfile git internals

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.