How Git Actually Stores Your Code: The Hidden Architecture Behind Every Commit

On April 7, 2005, Linus Torvalds made the first commit to a new version control system. He had started coding it just four days earlier, on April 3rd, after the proprietary tool he had been using for Linux kernel development became unavailable. The kernel community needed something fast, distributed, and capable of handling thousands of contributors. What Torvalds built in those frantic days wasn’t just another version control system—it was a content-addressable filesystem disguised as one.

That distinction matters. Most developers use Git daily without understanding what’s actually happening when they type git commit. Beneath the familiar commands lies an elegant data structure where every piece of content—every file, every directory listing, every commit message—is addressed by its cryptographic hash. Understanding this architecture transforms Git from a mysterious black box into a logical, debuggable system.

The Content-Addressable Foundation

Git’s object database operates on a deceptively simple principle: store everything as objects, identify each by its SHA-1 hash. When you add a file to Git, it doesn’t just copy the file somewhere. Instead, it computes the SHA-1 hash of the file’s content (prefixed with a header), uses that 40-character hexadecimal string as the filename, and stores the compressed content in the .git/objects directory.

$ echo "Hello, world!" | git hash-object -w --stdin
af5626b4a114abcb82d63db7c8082c3c4756e51b

The resulting object lives at .git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b. The first two characters become a subdirectory name, and the remaining 38 characters become the filename. This isn’t arbitrary—grouping objects by their first two characters prevents any single directory from becoming too large, which would slow down filesystem operations.

This approach has profound implications. Identical content always produces the same hash, so Git never stores duplicates. If you commit the same file ten times, only one blob object exists in the database, referenced ten different ways. The hash also serves as integrity verification: change one byte, and the hash changes completely, making it impossible to silently corrupt Git objects.

Four Object Types, One Unified Model

Git recognizes four object types, each building on the others.

Blobs store raw file content. A blob contains no filename, no permissions, no metadata—just bytes. The filename “readme.md” lives elsewhere, as we’ll see. This design choice means that if you rename a file without changing its contents, Git doesn’t need to store a new blob; it just updates the reference.

Trees represent directory snapshots. A tree object contains a list of entries, each with a mode (permissions), type, SHA-1 hash, and name. An entry might point to a blob (a file) or another tree (a subdirectory). When you examine a tree object:

$ git cat-file -p main^{tree}
100644 blob a906cb2a4a904a152e80877d4088654daad0c859    README.md
100644 blob 8f94139338f9404f26296befa88755fc2598c289    Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0    lib

The tree shows that the root directory contains two files and one subdirectory. The subdirectory entry points to another tree object, creating a recursive structure that can represent arbitrarily complex directory hierarchies.

Commits wrap trees with metadata. A commit object contains:

A reference to the root tree (the project snapshot)
Zero or more parent commits (one for normal commits, two for merges, none for the initial commit)
Author information (name, email, timestamp)
Committer information (often the same as author)
The commit message

$ git cat-file -p HEAD
tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579
parent 1a410efbd13591db07496601ebc7a059dd55cfe9
author Linus Torvalds <[email protected]> 1243040974 -0700
committer Linus Torvalds <[email protected]> 1243040974 -0700

The commit message goes here

The commit hash is computed from all this data, which is why changing anything—even the commit timestamp—produces a completely different hash.

Tags point to specific commits with a human-readable name. An annotated tag is a full object containing a message, tagger information, and a cryptographic signature. A lightweight tag is simply a reference file containing a commit hash.

Git object model showing how commits reference trees, which reference blobs and other trees

Image source: Pro Git Book - Git Objects

The Snapshot Model

Here’s where Git diverges from earlier version control systems like CVS and Subversion. Those systems stored each revision as a delta—a set of changes from the previous version. Git stores complete snapshots.

Every commit contains a full tree object, which references every file in your project at that moment. If you have 10,000 files and change only one line in one file, the new commit still references all 10,000 files. The changed file gets a new blob; the other 9,999 files reference the same blobs as the previous commit.

This sounds wasteful until you consider the implications. Checking out any commit is simply loading its tree structure into the working directory—no need to apply a chain of deltas. Comparing two commits means comparing their trees directly. Branching costs almost nothing: a branch is just a 41-byte file containing a commit hash.

The Efficiency Paradox: Packfiles and Deltas

If Git stores complete snapshots, why doesn’t repository size explode? The answer lies in packfiles.

Loose objects—individual files in .git/objects—are fine for small repositories. But as the object count grows, Git consolidates them into packfiles stored in .git/objects/pack/. A packfile is a single binary file containing multiple objects, compressed both individually and against each other.

This is where delta compression enters the picture. Within a packfile, Git can store an object as a delta against another object: “start with blob X, remove bytes 10-15, insert ’new text’ at position 20.” This dramatically reduces storage for files that change incrementally, like source code.

Packfile structure showing objects with delta compression

Image source: GitHub Blog - Git’s database internals

The key insight is that deltas are purely a storage optimization. Logically, Git still works with snapshots. The deltification happens during git gc (garbage collection) and is transparent to all Git operations. You can even force a repository to use only loose objects by deleting its packfiles—Git will still function correctly, just less efficiently.

References: Naming the Unnameable

SHA-1 hashes are precise but impractical. Nobody memorizes 40-character hex strings. Git’s reference system provides human-readable names that point to commits.

Branches live in .git/refs/heads/. The main branch is simply a file at .git/refs/heads/main containing a commit hash:

$ cat .git/refs/heads/main
378b51993aa022c432b23b7f1bafd921b7c43835

Tags live in .git/refs/tags/. Remote-tracking branches (like origin/main) live in .git/refs/remotes/. When you run git branch feature, Git creates a new file in refs/heads/. When you commit, Git updates the current branch’s file to point to the new commit.

The HEAD file determines your current position:

$ cat .git/HEAD
ref: refs/heads/main

HEAD usually contains a symbolic reference to a branch. When you make a commit, Git resolves HEAD to find the branch file, then updates that file. In a “detached HEAD” state, HEAD contains a raw commit hash instead:

$ cat .git/HEAD
378b51993aa022c432b23b7f1bafd921b7c43835

The Index: Where Commits Are Born

Between your working directory and the repository sits the index (also called the staging area). This is .git/index, a binary file that represents what your next commit will contain.

When you run git add file.txt, Git:

Creates a blob object for the file’s current content
Updates the index entry for file.txt to reference that blob
Records the file’s mode and path

The index doesn’t store file content—it stores references to blobs, along with stat information for quick comparison. Running git status compares the working directory against the index, and the index against the HEAD commit, showing you three distinct states: untracked, staged, and committed.

Running git commit reads the index, creates tree objects for the staged state, creates a commit object pointing to the root tree, and updates the current branch. The index is what allows you to carefully craft commits, staging some changes while leaving others for later.

The Reflog: Your Safety Net

Every time HEAD moves—whether through commits, checkouts, rebases, or resets—Git records the previous position in the reflog:

$ git reflog
378b519 HEAD@{0}: commit: Add new feature
1a410ef HEAD@{1}: checkout: moving from feature to main
fdf4fc3 HEAD@{2}: commit: Fix bug

The reflog is local to your repository; it’s never pushed or fetched. It defaults to retaining entries for 90 days, providing a recovery mechanism for “lost” commits. Even if you run git reset --hard and seemingly destroy work, the commits remain accessible through the reflog until they’re garbage collected.

Garbage Collection: The Cleanup Crew

Git never immediately deletes objects. When you amend a commit, rebase a branch, or delete a branch, the old commits become unreachable—they’re not referenced by any branch, tag, or the reflog. But they remain in the object database until git gc runs.

Garbage collection performs several tasks:

Expires old reflog entries
Removes unreachable objects (after a grace period, defaulting to two weeks)
Consolidates loose objects into packfiles
Repacks existing packfiles for better compression

The two-week grace period is a safety mechanism. If you accidentally delete a branch, you have time to recover it. Objects with no references that are older than the grace period are permanently deleted.

The Distributed Database

Git’s architecture becomes more impressive when you consider its distributed nature. Every clone is a complete repository with full history. There’s no central database that must be consulted for every operation.

When you run git clone, Git copies all reachable objects from the remote, creates local branches tracking the remote branches, and configures a “remote” named origin. The remote configuration tells Git where to push and fetch from:

$ cat .git/config
[remote "origin"]
    url = https://github.com/user/repo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

The fetch refspec says: “download all references matching refs/heads/* on the remote, and store them in refs/remotes/origin/* locally.” This is why origin/main tracks the remote’s main branch.

Push and fetch operations transfer packfiles containing the objects the other side doesn’t have. Git computes the objects reachable from your refs, the remote computes its reachable objects, and they negotiate what needs to be transferred. This exchange uses a sophisticated protocol that minimizes bandwidth by leveraging the fact that both sides share most of the same objects.

Plumbing Commands: Working with Objects Directly

Git exposes low-level “plumbing” commands that let you interact with objects directly:

# Create a blob from stdin
$ echo "content" | git hash-object -w --stdin

# Examine any object
$ git cat-file -t <hash>  # type
$ git cat-file -s <hash>  # size
$ git cat-file -p <hash>  # pretty-print

# Create a tree from the index
$ git write-tree

# Create a commit from a tree
$ git commit-tree <tree> -p <parent>

These commands are what Git’s own high-level commands use internally. Understanding them demystifies Git’s behavior and enables debugging when things go wrong.

Implications for Developers

Understanding Git’s object model changes how you think about common operations. That branch you created and deleted? It cost almost nothing—just a 41-byte file that’s now gone. That file you renamed? Git detected it through content similarity, not by tracking renames explicitly. That rebase you’re nervous about? It’s just creating new commits with different parents; the old commits remain until garbage collection.

The content-addressable nature also enables interesting workflows. You can verify repository integrity with git fsck. You can search for specific content across all history with git log -S. You can construct commits programmatically by building objects directly.

Git’s design reflects Torvalds’ priorities: speed, data integrity, and distributed workflows. Every decision—from the SHA-1 identifiers to the snapshot model to the reference structure—serves these goals. Twenty years later, as Git approaches its third decade, that architecture has proven remarkably robust. The .git directory in your project isn’t just configuration—it’s a complete, self-contained database, a cryptographically verified history of everything you’ve built.

References

Torvalds, L. (2005). Git: Fast Version Control System. Initial release April 7, 2005.
Chacon, S. & Straub, B. (2014). Pro Git, 2nd Edition. Apress. Git Internals - Git Objects
Derrick Stolee. (2022). Git’s database internals I: packed object store. GitHub Blog
Julia Evans. (2024). How HEAD works in git. jvns.ca
Ken Muse. (2025). Understanding How Git Stores Data. kenmuse.com
Git Documentation. Transfer Protocols
Git Documentation. Git References
Stack Overflow. Are Git’s pack files deltas rather than snapshots?
The Linux Foundation. (2015). 10 Years of Git: An Interview with Linus Torvalds.
GitHub Blog. (2025). Git turns 20: A Q&A with Linus Torvalds.

The Content-Addressable Foundation#

Four Object Types, One Unified Model#

The Snapshot Model#

The Efficiency Paradox: Packfiles and Deltas#

References: Naming the Unnameable#

The Index: Where Commits Are Born#

The Reflog: Your Safety Net#

Garbage Collection: The Cleanup Crew#

The Distributed Database#

Plumbing Commands: Working with Objects Directly#

Implications for Developers#

References#