/blog/2012/06/18/finding-oldest-firefox-code/index.html
HTML | 256 lines | 215 code | 30 blank | 11 comment | 0 complexity | 7545e3e5fa27d376c0e139f6d25e88ac MD5 | raw file
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
- "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- <!--
- Design by Free CSS Templates
- http://www.freecsstemplates.org
- Released for free under a Creative Commons Attribution 2.5 License
- Name : Pollinating
- Description: A two-column, fixed-width design with dark color scheme.
- Version : 1.0
- Released : 20101114
- -->
- <html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
-
- <title>Gregory Szorc's Digital Home
- | Finding Oldest Firefox Code
- </title>
- <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/blog/feed" />
- <link rel="alternate" type="application/atom+xml" title="Atom 1.0"
- href="/blog/feed/atom" />
- <link rel="stylesheet" href="/style/style.css" type="text/css" />
- <link rel="stylesheet" href="/css/pygments_murphy.css" type="text/css" />
- </head>
- <body>
- <div id="wrapper">
-
- <div id="menu">
- <ul>
- <li><a href="/">Home</a></li>
- <li><a href="/blog/">Blog</a></li>
- <li><a href="/notes">Notes</a></li>
- <li><a href="/work.html">Work</a></li>
- <li><a href="/skills.html">Skills</a></li>
- <li><a href="/thoughts.html">Thoughts</a></li>
- <li><a href="/resume.pdf">Resume</a></li>
- </ul>
- </div>
- <div id="page">
- <div id="page-bgtop">
- <div id="page-bgbtm">
- <div id="content">
-
- <div class="blog_post">
- <a name="finding-oldest-firefox-code"></a>
- <h2 class="blog_post_title"><a href="/blog/2012/06/18/finding-oldest-firefox-code" rel="bookmark" title="Permanent Link to Finding Oldest Firefox Code">Finding Oldest Firefox Code</a></h2>
- <small>June 18, 2012 at 10:10 AM | categories:
- <a href='/blog/category/mozilla'>Mozilla</a>
- </small><p/>
- <div class="post_prose">
-
- <p>On Twitter the other night, Justin Dolske
- <a href="https://twitter.com/dolske/status/214241105179451393">posed a question</a>:</p>
- <pre><code>Weekend challenge: what is the oldest line of code still shipping
- in Firefox (tiebreaker: largest contiguous chunk)?
- </code></pre>
- <p>Good question and good challenge!</p>
- <h2>Technical Approach</h2>
- <p>To solve this problem, I decided my first task would be to produce a
- database holding line-by-line metadata for files <em>currently</em> in the
- Firefox repository. This sounded like a difficult problem at first,
- especially considering the Mercurial repository doesn't contain CVS
- history and this would be needed to identify code older than Mercurial
- that is <em>still</em> active.</p>
- <p>Fortunately, there exists a
- <a href="https://github.com/mozilla/mozilla-central/">Git repository</a> with full
- history of mozilla-central, including CVS and Mercurial! Armed with a
- clone of this repository, I wrote a quick shell <em>one-liner</em> to ascertain
- the history of every line in the repository:</p>
- <pre><code>for f in `git ls-tree -r --name-only HEAD`; do \
- echo "BEGIN_RECORD $f"; \
- git blame -l -t -M -C -n -w -p $f; \
- echo "END_RECORD $f"; \
- done
- </code></pre>
- <p>The <em>git ls-tree</em> command prints the names of every file in
- the current tree. This is basically doing <em>find . -type f</em> except for
- files under version control by Git. <em>git blame</em> attempts to ascertain
- the history of each line in a file. It is worth pointing out arguments
- <em>-M</em> and <em>-C</em>. These attempt to find moves/copies of the line from
- within the same commit. If these are omitted, simple refactoring such as
- renaming a file or reordering code within a file would result in a
- improper commit attribution. Basically, Git would associate the line
- with the commit that changed it. With these flags, Git attempts to
- complete the chain and find the true origin of the line (to some
- degree).</p>
- <p>Now, something I thought was really cool is <em>git blame</em>'s porcelain
- output format (<em>-p</em>). Not only does it allow for relatively simple machine
- readability of the output (yay), but it also compacts the output
- so adjacent lines sharing the same history metadata share the
- same metadata/header block. In other words, it solves Dolske's <em>largest
- contiguous chunk</em> tiebreaker for free! Thanks, Git!</p>
- <p>I should also say that <em>git blame</em> isn't perfect for attributing code.
- But I think it is good enough to solve a Twitter challenge.</p>
- <p>I piped the output of the above command into a file so I could have the
- original data available to process. After all, this data is idempotent,
- so it makes no sense to <em>not</em> save it. After running for a while, I noticed
- things were running slower than I'd like. I think it took about 2 hours
- to obtain info for ~5000 files. No good. I played around a bit and
- realized the <em>-M</em> and <em>-C</em> flags were slowing things down. This is
- expected. But, I really wanted this data for a comprehensive data
- analysis.</p>
- <p>I re-discovered <a href="https://www.gnu.org/software/parallel/">GNU Parallel</a>
- and modified my one-liner to use <strong>all the cores</strong>:</p>
- <pre><code>git ls-tree -r --name-only HEAD | \
- parallel 'echo "BEGIN_RECORD {}"; git blame -l -t -M -C -n -w -p {}; echo "END_RECORD {}"'
- </code></pre>
- <p>This made things run <em>substantially</em> faster since I was now running on
- all 8 cores, not just 1. With GNU Parallel, this simple kind of
- parallelism is almost too easy. Now, I say <em>substantially</em> faster, but
- overall execution is still slow. How slow? Well, on my Sandy Bridge Macbook
- Pro:</p>
- <pre><code>real 525m49.149s
- user 3592m15.862s
- sys 201m4.482s
- </code></pre>
- <p>8:45 wall time and nearly 60 hours of CPU time. Yeah, I'm surprised
- my laptop didn't explode too! The output file was 1,099,071,448 bytes
- uncompressed and 155,354,423 bzipped.</p>
- <p>While Git was obtaining data, I set about writing a consumer and data
- processor. I'm sure the wheel of parsing this <em>porcelain</em> output format
- has been invented before. But, I hadn't coded any Perl in a while and
- figured this was as good of an excuse as any!</p>
- <p>The Perl script to do the parsing and data analysis is available at
- <a href="https://gist.github.com/2945604">https://gist.github.com/2945604</a>.
- The core parsing function simply calls a supplied callback whenever a
- new <em>block</em> of code from the same commit/file is encountered.</p>
- <p>I implemented a function that records a mapping of commit times to
- blocks. Finally, I wrote a simple function to write the results.</p>
- <h2>Results</h2>
- <p>What did 60 hours of CPU time tell us? Well, the oldest recorded line
- dates from 1998-03-28. This is actually the <em>Free the Lizard</em> commit -
- the first commit of open source Gecko to CVS. From this commit (Git
- commit 781c48087175615674 for those playing at home), a number of lines
- linger, including code for <em>mkdepend</em> and <em>nsinstall</em>.</p>
- <p>But, Dolske's question was about <em>shipping</em> code. Well, as far as I can
- tell, the oldest shipping code in the tree honor is shared by the
- following:</p>
- <ul>
- <li>js/jsd/jsd1640.rc (pretty much the whole file)</li>
- <li>js/jsd/jsd3240.rc (ditto)</li>
- <li>js/jsd/jsd_atom.c:47-73</li>
- <li>js/src/jsfriendapi.cpp:388-400</li>
- <li>js/src/yarr/YarrParser.h:295-523</li>
- <li>media/libjpeg/jdarith.c:21-336</li>
- <li>media/libjpeg/jctrans.c:20-178</li>
- <li>media/libjpeg (misc files all have large blocks)</li>
- <li>xpcom/components/nsComponentManager.cpp:897-901</li>
- <li>gfx/src/nsTransform2D.h (a couple 3-4 line chunks)</li>
- <li>toolkit/library/nsDllMain.cpp:22-31</li>
- <li>widget/windows/nsUXThemeData.cpp:213-257</li>
- <li>widget/windows/nsWindowGfx.cpp:756-815</li>
- <li>xpcom/ds/nsUnicharBuffer.cpp:14-33</li>
- <li>xpcom/glue/nsCRTGlue.cpp:128-174</li>
- </ul>
- <p>There are also a few small chunks of 1 or 2 lines in a couple dozen
- other files from that initial commit.</p>
- <h2>Further Analysis</h2>
- <p>If anyone is interested in performing additional analysis, you can just
- take my Gist and install your own <em>onBlock</em> and output formatting
- functions! Of course, you'll need a data set. My code is written so it
- will work with any Git repository. If you want to analyze Firefox, it
- will take hours of CPU time to extract the data from Git. To save some
- time, you can obtain a copy of the raw data from commit
- <a href="http://people.mozilla.org/~gszorc/mc-blame-0b961fb702a9576cb456.bz2">0b961fb702a9576cb456809410209adbbb956bc8</a>.</p>
- <p>There is certainly no shortage of interesting analysis that
- can be performed over this data set. Some that come to mind are a
- <em>scoreboard</em> for most lines of code in the current tree (friendly
- competition, anyone?) and a breakdown of active lines by the period they
- were introduced.</p>
- <p>I'll leave more advanced analysis as an exercise for the reader.</p>
- </div>
- </div>
- </div>
-
- <div id="sidebar">
- <ul>
- <li>
- <h2>Categories</h2>
- <ul>
- <li><a href="/blog/category/apple">Apple</a></li>
- <li><a href="/blog/category/bugzilla">Bugzilla</a></li>
- <li><a href="/blog/category/ci">CI</a></li>
- <li><a href="/blog/category/clang">Clang</a></li>
- <li><a href="/blog/category/docker">Docker</a></li>
- <li><a href="/blog/category/firefox">Firefox</a></li>
- <li><a href="/blog/category/git">Git</a></li>
- <li><a href="/blog/category/javascript">JavaScript</a></li>
- <li><a href="/blog/category/mercurial">Mercurial</a></li>
- <li><a href="/blog/category/mozreview">MozReview</a></li>
- <li><a href="/blog/category/mozilla">Mozilla</a></li>
- <li><a href="/blog/category/personal">Personal</a></li>
- <li><a href="/blog/category/programming">Programming</a></li>
- <li><a href="/blog/category/puppet">Puppet</a></li>
- <li><a href="/blog/category/pyoxidizer">PyOxidizer</a></li>
- <li><a href="/blog/category/python">Python</a></li>
- <li><a href="/blog/category/review-board">Review Board</a></li>
- <li><a href="/blog/category/rust">Rust</a></li>
- <li><a href="/blog/category/sync">Sync</a></li>
- <li><a href="/blog/category/browsers">browsers</a></li>
- <li><a href="/blog/category/build-system">build system</a></li>
- <li><a href="/blog/category/code-review">code review</a></li>
- <li><a href="/blog/category/compilers">compilers</a></li>
- <li><a href="/blog/category/internet">internet</a></li>
- <li><a href="/blog/category/logging">logging</a></li>
- <li><a href="/blog/category/mach">mach</a></li>
- <li><a href="/blog/category/make">make</a></li>
- <li><a href="/blog/category/misc">misc</a></li>
- <li><a href="/blog/category/movies">movies</a></li>
- <li><a href="/blog/category/packaging">packaging</a></li>
- <li><a href="/blog/category/pymake">pymake</a></li>
- <li><a href="/blog/category/security">security</a></li>
- <li><a href="/blog/category/sysadmin">sysadmin</a></li>
- <li><a href="/blog/category/testing">testing</a></li>
- </ul>
- </li>
- </ul>
- </div>
- <div style="clear: both;"> </div>
- </div>
- </div>
- </div>
- <div id="footer">
-
- <hr/>
- <p>Copyright (c) 2012- Gregory Szorc. All rights reserved. Design by <a href="http://www.freecsstemplates.org/"> CSS Templates</a>.</p>
- </div>
- </div>
- </body>
- </html>