PageRenderTime 38ms CodeModel.GetById 19ms app.highlight 10ms RepoModel.GetById 1ms app.codeStats 1ms

Plain Text | 214 lines | 149 code | 65 blank | 0 comment | 0 complexity | db6f0b23c5c7a0a0e085fa0c69ae4e52 MD5 | raw file
Possible License(s): GPL-3.0
  1== Secrets Revealed ==
  3We take a peek under the hood and explain how Git performs its miracles. I will skimp over details. For in-depth descriptions refer to[the user manual].
  5=== Invisibility ===
  7How can Git be so unobtrusive? Aside from occasional commits and merges, you can work as if you were unaware that version control exists. That is, until you need it, and that's when you're glad Git was watching over you the whole time.
  9Other version control systems force you to constantly struggle with red tape and bureaucracy. Permissions of files may be read-only unless you explicitly tell a central server which files you intend to edit. The most basic commands may slow to a crawl as the number of users increases. Work grinds to a halt when the network or the central server goes down.
 11In contrast, Git simply keeps the history of your project in the `.git` directory in your working directory. This is your own copy of the history, so you can stay offline until you want to communicate with others. You have total control over the fate of your files because Git can easily recreate a saved state from `.git` at any time.
 13=== Integrity ===
 15Most people associate cryptography with keeping information secret, but another equally important goal is keeping information safe. Proper use of cryptographic hash functions can prevent accidental or malicious data corruption.
 17A SHA1 hash can be thought of as a unique 160-bit ID number for every string of bytes you'll encounter in your life. Actually more than that: every string of bytes that any human will ever use over many lifetimes.
 19As a SHA1 hash is itself a string of bytes, we can hash strings of bytes containing other hashes. This simple observation is surprisingly useful: look up 'hash chains'. We'll later see how Git uses it to efficiently guarantee data integrity.
 21Briefly, Git keeps your data in the `.git/objects` subdirectory, where instead of normal filenames, you'll find only IDs. By using IDs as filenames, as well as a few lockfiles and timestamping tricks, Git transforms any humble filesystem into an efficient and robust database.
 23=== Intelligence ===
 25How does Git know you renamed a file, even though you never mentioned the fact explicitly? Sure, you may have run *git mv*, but that is exactly the same as a *git rm* followed by a *git add*.
 27Git heuristically ferrets out renames and copies between successive versions. In fact, it can detect chunks of code being moved or copied around between files! Though it cannot cover all cases, it does a decent job, and this feature is always improving. If it fails to work for you, try options enabling more expensive copy detection, and consider upgrading.
 29=== Indexing ===
 31For every tracked file, Git records information such as its size, creation time and last modification time in a file known as the 'index'. To determine whether a file has changed, Git compares its current stats with those cached in the index. If they match, then Git can skip reading the file again.
 33Since stat calls are considerably faster than file reads, if you only edit a
 34few files, Git can update its state in almost no time.
 36We stated earlier that the index is a staging area. Why is a bunch of file
 37stats a staging area? Because the add command puts files into Git's database
 38and updates these stats, while the commit command, without options, creates a
 39commit based only on these stats and the files already in the database.
 41=== Git's Origins ===
 43This[Linux Kernel Mailing List post] describes the chain of events that led to Git. The entire thread is a fascinating archaeological site for Git historians.
 45=== The Object Database ===
 47Every version of your data is kept in the 'object database', which lives in the
 48subdirectory `.git/objects`; the other residents of `.git/` hold lesser data:
 49the index, branch names, tags, configuration options, logs, the current
 50location of the head commit, and so on. The object database is elementary yet
 51elegant, and the source of Git's power.
 53Each file within `.git/objects` is an 'object'. There are 3 kinds of objects
 54that concern us: 'blob' objects, 'tree' objects, and 'commit' objects.
 56=== Blobs ===
 58First, a magic trick. Pick a filename, any filename. In an empty directory:
 60 $ echo sweet > YOUR_FILENAME
 61 $ git init
 62 $ git add .
 63 $ find .git/objects -type f
 65You'll see +.git/objects/aa/823728ea7d592acc69b36875a482cdf3fd5c8d+.
 67How do I know this without knowing the filename? It's because the
 68SHA1 hash of:
 70 "blob" SP "6" NUL "sweet" LF
 72is aa823728ea7d592acc69b36875a482cdf3fd5c8d,
 73where SP is a space, NUL is a zero byte and LF is a linefeed. You can verify
 74this by typing:
 76  $ printf "blob 6\000sweet\n" | sha1sum
 78Git is 'content-addressable': files are not stored according to their filename,
 79but rather by the hash of the data they contain, in a file we call a 'blob
 80object'. We can think of the hash as a unique ID for a file's contents, so
 81in a sense we are addressing files by their content. The initial `blob 6` is
 82merely a header consisting of the object type and its length in bytes; it
 83simplifies internal bookkeeping.
 85Thus I could easily predict what you would see. The file's name is irrelevant:
 86only the data inside is used to construct the blob object.
 88You may be wondering what happens to identical files. Try adding copies of
 89your file, with any filenames whatsoever. The contents of +.git/objects+ stay
 90the same no matter how many you add. Git only stores the data once.
 92By the way, the files within +.git/objects+ are compressed with zlib so you
 93should not stare at them directly. Filter them through
 94[zpipe -d], or type:
 96 $ git cat-file -p aa823728ea7d592acc69b36875a482cdf3fd5c8d
 98which pretty-prints the given object.
100=== Trees ===
102But where are the filenames? They must be stored somewhere at some stage.
103Git gets around to the filenames during a commit:
105 $ git commit  # Type some message.
106 $ find .git/objects -type f
108You should now see 3 objects. This time I cannot tell you what the 2 new files are, as it partly depends on the filename you picked. We'll proceed assuming you chose ``rose''. If you didn't, you can rewrite history to make it look like you did:
110 $ git filter-branch --tree-filter 'mv YOUR_FILENAME rose'
111 $ find .git/objects -type f
113Now you should see the file
114+.git/objects/05/b217bb859794d08bb9e4f7f04cbda4b207fbe9+, because this is the
115SHA1 hash of its contents:
117 "tree" SP "32" NUL "100644 rose" NUL 0xaa823728ea7d592acc69b36875a482cdf3fd5c8d
119Check this file does indeed contain the above by typing:
121 $ echo 05b217bb859794d08bb9e4f7f04cbda4b207fbe9 | git cat-file --batch
123With zpipe, it's easy to verify the hash:
125 $ zpipe -d < .git/objects/05/b217bb859794d08bb9e4f7f04cbda4b207fbe9 | sha1sum
127Hash verification is trickier via cat-file because its output contains more
128than the raw uncompressed object file.
130This file is a 'tree' object: a list of tuples consisting of a file
131type, a filename, and a hash. In our example, the file type is 100644, which
132means `rose` is a normal file, and the hash is the blob object that contains
133the contents of `rose'. Other possible file types are executables, symlinks or
134directories. In the last case, the hash points to a tree object.
136If you ran filter-branch, you'll have old objects you no longer need. Although
137they will be jettisoned automatically once the grace period expires, we'll
138delete them now to make our toy example easier to follow:
140 $ rm -r .git/refs/original
141 $ git reflog expire --expire=now --all
142 $ git prune
144For real projects you should typically avoid commands like this, as you are
145destroying backups. If you want a clean repository, it is usually best to make
146a fresh clone. Also, take care when directly manipulating +.git+: what if a Git
147command is running at the same time, or a sudden power outage occurs?
148In general, refs should be deleted with *git update-ref -d*,
149though usually it's safe to remove +refs/original+ by hand.
151=== Commits ===
153We've explained 2 of the 3 objects. The third is a 'commit' object. Its
154contents depend on the commit message as well as the date and time it was
155created. To match what we have here, we'll have to tweak it a little:
157 $ git commit --amend -m Shakespeare  # Change the commit message.
158 $ git filter-branch --env-filter 'export
159     GIT_AUTHOR_DATE="Fri 13 Feb 2009 15:31:30 -0800"
160     GIT_AUTHOR_NAME="Alice"
162     GIT_COMMITTER_DATE="Fri, 13 Feb 2009 15:31:30 -0800"
164     GIT_COMMITTER_EMAIL=""'  # Rig timestamps and authors.
165 $ find .git/objects -type f
167You should now see
169which is the SHA1 hash of its contents:
171 "commit 158" NUL
172 "tree 05b217bb859794d08bb9e4f7f04cbda4b207fbe9" LF
173 "author Alice <> 1234567890 -0800" LF
174 "committer Bob <> 1234567890 -0800" LF
175 LF
176 "Shakespeare" LF
178As before, you can run zpipe or cat-file to see for yourself.
180This is the first commit, so there are no parent commits, but later commits
181will always contain at least one line identifying a parent commit.
183=== Indistinguishable From Magic ===
185Git's secrets seem too simple. It looks like you could mix together a few shell scripts and add a dash of C code to cook it up in a matter of hours: a melange of basic filesystem operations and SHA1 hashing, garnished with lock files and fsyncs for robustness. In fact, this accurately describes the earliest versions of Git. Nonetheless, apart from ingenious packing tricks to save space, and ingenious indexing tricks to save time, we now know how Git deftly changes a filesystem into a database perfect for version control.
187For example, if any file within the object database is corrupted by a disk
188error, then its hash will no longer match, alerting us to the problem. By
189hashing hashes of other objects, we maintain integrity at all levels. Commits
190are atomic, that is, a commit can never only partially record changes: we can
191only compute the hash of a commit and store it in the database after we already
192have stored all relevant trees, blobs and parent commits. The object
193database is immune to unexpected interruptions such as power outages.
195We defeat even the most devious adversaries. Suppose somebody attempts to
196stealthily modify the contents of a file in an ancient version of a project. To
197keep the object database looking healthy, they must also change the hash of the
198corresponding blob object since it's now a different string of bytes. This
199means they'll have to change the hash of any tree object referencing the file,
200and in turn change the hash of all commit objects involving such a tree, in
201addition to the hashes of all the descendants of these commits. This implies the
202hash of the official head differs to that of the bad repository. By
203following the trail of mismatching hashes we can pinpoint the mutilated file,
204as well as the commit where it was first corrupted.
206In short, so long as the 20 bytes representing the last commit are safe,
207it's impossible to tamper with a Git repository.
209What about Git's famous features? Branching? Merging? Tags?
210Mere details. The current head is kept in the file +.git/HEAD+,
211which contains a hash of a commit object. The hash gets updated during a commit
212as well as many other commands. Branches are almost the same: they are files in
213+.git/refs/heads+. Tags too: they live in +.git/refs/tags+ but they
214are updated by a different set of commands.