Understanding Git’s Core Data Structures and Architecture
The article explains Git’s fundamental architecture, describing how trees, blobs, and commits are stored as objects identified by hashes, how version history forms a linked list, and how optimizations such as directory sharding and garbage collection fit into this design.
Originally using Git’s basic commands as a black box, the author examined Git’s underlying data model to clarify how a long patch set can be managed efficiently.
Git’s core requirement is a file‑tree version management tool, which means persisting a directory tree and its versions on disk. The basic data structures needed are tree objects for directories and blob objects for file contents, with fast index access.
A version is represented by a linked list where each node records a commit, essentially another indexing problem.
The central object management system stores each object (file or directory) as a file whose name is the SHA‑1 hash of its contents; this makes the system portable across any file‑storage platform.
Example directory structure:
./dir ./dir/file
Recorded in Git as:
040000 tree e1b8ecbb1f19709f3a4867a0ffe08bb2e07acf19 dir 100644 blob 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 file
When a file’s content changes, a new blob with a new hash is created while the old blob remains, causing a new tree object and thus a new commit object. Advanced users can construct these objects manually using commands like git hash-object, git update-index, git write-tree, and git commit-tree.
The architecture emphasizes preserving the entire modification chain rather than merely storing deltas; optimizations are secondary to this core requirement.
Commit objects link to their tree, which links to other trees and blobs, forming a complete, immutable history.
Hash algorithms are representational; the design works with any hash, not just SHA‑1.
Two common optimizations are:
Splitting objects into directories based on the first two characters of the hash to reduce filesystem lookup overhead.
Implementing a garbage‑collection (gc) mechanism that packs related objects, effectively compressing duplicate data.
These optimizations are independent of the core logical model.
With this foundation, higher‑level features such as git checkout <sha1>, branches, tags, and HEAD pointers become straightforward mappings to the underlying objects.
Good architectural control keeps the design simple, extensible, and resilient to future complexity, avoiding premature mixing of unrelated concerns such as thread safety in unrelated modules.
Source: https://zhuanlan.zhihu.com/p/38245039
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
