PageRenderTime 478ms CodeModel.GetById 19ms RepoModel.GetById 0ms app.codeStats 0ms

/blog/2012/06/18/finding-oldest-firefox-code/index.html

https://github.com/indygreg/indygreg.github.com
HTML | 256 lines | 215 code | 30 blank | 11 comment | 0 complexity | 7545e3e5fa27d376c0e139f6d25e88ac MD5 | raw file
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  2. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3. <!--
  4. Design by Free CSS Templates
  5. http://www.freecsstemplates.org
  6. Released for free under a Creative Commons Attribution 2.5 License
  7. Name : Pollinating
  8. Description: A two-column, fixed-width design with dark color scheme.
  9. Version : 1.0
  10. Released : 20101114
  11. -->
  12. <html xmlns="http://www.w3.org/1999/xhtml">
  13. <head>
  14. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  15. <title>Gregory Szorc's Digital Home
  16. | Finding Oldest Firefox Code
  17. </title>
  18. <link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="/blog/feed" />
  19. <link rel="alternate" type="application/atom+xml" title="Atom 1.0"
  20. href="/blog/feed/atom" />
  21. <link rel="stylesheet" href="/style/style.css" type="text/css" />
  22. <link rel="stylesheet" href="/css/pygments_murphy.css" type="text/css" />
  23. </head>
  24. <body>
  25. <div id="wrapper">
  26. <div id="menu">
  27. <ul>
  28. <li><a href="/">Home</a></li>
  29. <li><a href="/blog/">Blog</a></li>
  30. <li><a href="/notes">Notes</a></li>
  31. <li><a href="/work.html">Work</a></li>
  32. <li><a href="/skills.html">Skills</a></li>
  33. <li><a href="/thoughts.html">Thoughts</a></li>
  34. <li><a href="/resume.pdf">Resume</a></li>
  35. </ul>
  36. </div>
  37. <div id="page">
  38. <div id="page-bgtop">
  39. <div id="page-bgbtm">
  40. <div id="content">
  41. <div class="blog_post">
  42. <a name="finding-oldest-firefox-code"></a>
  43. <h2 class="blog_post_title"><a href="/blog/2012/06/18/finding-oldest-firefox-code" rel="bookmark" title="Permanent Link to Finding Oldest Firefox Code">Finding Oldest Firefox Code</a></h2>
  44. <small>June 18, 2012 at 10:10 AM | categories:
  45. <a href='/blog/category/mozilla'>Mozilla</a>
  46. </small><p/>
  47. <div class="post_prose">
  48. <p>On Twitter the other night, Justin Dolske
  49. <a href="https://twitter.com/dolske/status/214241105179451393">posed a question</a>:</p>
  50. <pre><code>Weekend challenge: what is the oldest line of code still shipping
  51. in Firefox (tiebreaker: largest contiguous chunk)?
  52. </code></pre>
  53. <p>Good question and good challenge!</p>
  54. <h2>Technical Approach</h2>
  55. <p>To solve this problem, I decided my first task would be to produce a
  56. database holding line-by-line metadata for files <em>currently</em> in the
  57. Firefox repository. This sounded like a difficult problem at first,
  58. especially considering the Mercurial repository doesn't contain CVS
  59. history and this would be needed to identify code older than Mercurial
  60. that is <em>still</em> active.</p>
  61. <p>Fortunately, there exists a
  62. <a href="https://github.com/mozilla/mozilla-central/">Git repository</a> with full
  63. history of mozilla-central, including CVS and Mercurial! Armed with a
  64. clone of this repository, I wrote a quick shell <em>one-liner</em> to ascertain
  65. the history of every line in the repository:</p>
  66. <pre><code>for f in `git ls-tree -r --name-only HEAD`; do \
  67. echo "BEGIN_RECORD $f"; \
  68. git blame -l -t -M -C -n -w -p $f; \
  69. echo "END_RECORD $f"; \
  70. done
  71. </code></pre>
  72. <p>The <em>git ls-tree</em> command prints the names of every file in
  73. the current tree. This is basically doing <em>find . -type f</em> except for
  74. files under version control by Git. <em>git blame</em> attempts to ascertain
  75. the history of each line in a file. It is worth pointing out arguments
  76. <em>-M</em> and <em>-C</em>. These attempt to find moves/copies of the line from
  77. within the same commit. If these are omitted, simple refactoring such as
  78. renaming a file or reordering code within a file would result in a
  79. improper commit attribution. Basically, Git would associate the line
  80. with the commit that changed it. With these flags, Git attempts to
  81. complete the chain and find the true origin of the line (to some
  82. degree).</p>
  83. <p>Now, something I thought was really cool is <em>git blame</em>'s porcelain
  84. output format (<em>-p</em>). Not only does it allow for relatively simple machine
  85. readability of the output (yay), but it also compacts the output
  86. so adjacent lines sharing the same history metadata share the
  87. same metadata/header block. In other words, it solves Dolske's <em>largest
  88. contiguous chunk</em> tiebreaker for free! Thanks, Git!</p>
  89. <p>I should also say that <em>git blame</em> isn't perfect for attributing code.
  90. But I think it is good enough to solve a Twitter challenge.</p>
  91. <p>I piped the output of the above command into a file so I could have the
  92. original data available to process. After all, this data is idempotent,
  93. so it makes no sense to <em>not</em> save it. After running for a while, I noticed
  94. things were running slower than I'd like. I think it took about 2 hours
  95. to obtain info for ~5000 files. No good. I played around a bit and
  96. realized the <em>-M</em> and <em>-C</em> flags were slowing things down. This is
  97. expected. But, I really wanted this data for a comprehensive data
  98. analysis.</p>
  99. <p>I re-discovered <a href="https://www.gnu.org/software/parallel/">GNU Parallel</a>
  100. and modified my one-liner to use <strong>all the cores</strong>:</p>
  101. <pre><code>git ls-tree -r --name-only HEAD | \
  102. parallel 'echo "BEGIN_RECORD {}"; git blame -l -t -M -C -n -w -p {}; echo "END_RECORD {}"'
  103. </code></pre>
  104. <p>This made things run <em>substantially</em> faster since I was now running on
  105. all 8 cores, not just 1. With GNU Parallel, this simple kind of
  106. parallelism is almost too easy. Now, I say <em>substantially</em> faster, but
  107. overall execution is still slow. How slow? Well, on my Sandy Bridge Macbook
  108. Pro:</p>
  109. <pre><code>real 525m49.149s
  110. user 3592m15.862s
  111. sys 201m4.482s
  112. </code></pre>
  113. <p>8:45 wall time and nearly 60 hours of CPU time. Yeah, I'm surprised
  114. my laptop didn't explode too! The output file was 1,099,071,448 bytes
  115. uncompressed and 155,354,423 bzipped.</p>
  116. <p>While Git was obtaining data, I set about writing a consumer and data
  117. processor. I'm sure the wheel of parsing this <em>porcelain</em> output format
  118. has been invented before. But, I hadn't coded any Perl in a while and
  119. figured this was as good of an excuse as any!</p>
  120. <p>The Perl script to do the parsing and data analysis is available at
  121. <a href="https://gist.github.com/2945604">https://gist.github.com/2945604</a>.
  122. The core parsing function simply calls a supplied callback whenever a
  123. new <em>block</em> of code from the same commit/file is encountered.</p>
  124. <p>I implemented a function that records a mapping of commit times to
  125. blocks. Finally, I wrote a simple function to write the results.</p>
  126. <h2>Results</h2>
  127. <p>What did 60 hours of CPU time tell us? Well, the oldest recorded line
  128. dates from 1998-03-28. This is actually the <em>Free the Lizard</em> commit -
  129. the first commit of open source Gecko to CVS. From this commit (Git
  130. commit 781c48087175615674 for those playing at home), a number of lines
  131. linger, including code for <em>mkdepend</em> and <em>nsinstall</em>.</p>
  132. <p>But, Dolske's question was about <em>shipping</em> code. Well, as far as I can
  133. tell, the oldest shipping code in the tree honor is shared by the
  134. following:</p>
  135. <ul>
  136. <li>js/jsd/jsd1640.rc (pretty much the whole file)</li>
  137. <li>js/jsd/jsd3240.rc (ditto)</li>
  138. <li>js/jsd/jsd_atom.c:47-73</li>
  139. <li>js/src/jsfriendapi.cpp:388-400</li>
  140. <li>js/src/yarr/YarrParser.h:295-523</li>
  141. <li>media/libjpeg/jdarith.c:21-336</li>
  142. <li>media/libjpeg/jctrans.c:20-178</li>
  143. <li>media/libjpeg (misc files all have large blocks)</li>
  144. <li>xpcom/components/nsComponentManager.cpp:897-901</li>
  145. <li>gfx/src/nsTransform2D.h (a couple 3-4 line chunks)</li>
  146. <li>toolkit/library/nsDllMain.cpp:22-31</li>
  147. <li>widget/windows/nsUXThemeData.cpp:213-257</li>
  148. <li>widget/windows/nsWindowGfx.cpp:756-815</li>
  149. <li>xpcom/ds/nsUnicharBuffer.cpp:14-33</li>
  150. <li>xpcom/glue/nsCRTGlue.cpp:128-174</li>
  151. </ul>
  152. <p>There are also a few small chunks of 1 or 2 lines in a couple dozen
  153. other files from that initial commit.</p>
  154. <h2>Further Analysis</h2>
  155. <p>If anyone is interested in performing additional analysis, you can just
  156. take my Gist and install your own <em>onBlock</em> and output formatting
  157. functions! Of course, you'll need a data set. My code is written so it
  158. will work with any Git repository. If you want to analyze Firefox, it
  159. will take hours of CPU time to extract the data from Git. To save some
  160. time, you can obtain a copy of the raw data from commit
  161. <a href="http://people.mozilla.org/~gszorc/mc-blame-0b961fb702a9576cb456.bz2">0b961fb702a9576cb456809410209adbbb956bc8</a>.</p>
  162. <p>There is certainly no shortage of interesting analysis that
  163. can be performed over this data set. Some that come to mind are a
  164. <em>scoreboard</em> for most lines of code in the current tree (friendly
  165. competition, anyone?) and a breakdown of active lines by the period they
  166. were introduced.</p>
  167. <p>I'll leave more advanced analysis as an exercise for the reader.</p>
  168. </div>
  169. </div>
  170. </div>
  171. <div id="sidebar">
  172. <ul>
  173. <li>
  174. <h2>Categories</h2>
  175. <ul>
  176. <li><a href="/blog/category/apple">Apple</a></li>
  177. <li><a href="/blog/category/bugzilla">Bugzilla</a></li>
  178. <li><a href="/blog/category/ci">CI</a></li>
  179. <li><a href="/blog/category/clang">Clang</a></li>
  180. <li><a href="/blog/category/docker">Docker</a></li>
  181. <li><a href="/blog/category/firefox">Firefox</a></li>
  182. <li><a href="/blog/category/git">Git</a></li>
  183. <li><a href="/blog/category/javascript">JavaScript</a></li>
  184. <li><a href="/blog/category/mercurial">Mercurial</a></li>
  185. <li><a href="/blog/category/mozreview">MozReview</a></li>
  186. <li><a href="/blog/category/mozilla">Mozilla</a></li>
  187. <li><a href="/blog/category/personal">Personal</a></li>
  188. <li><a href="/blog/category/programming">Programming</a></li>
  189. <li><a href="/blog/category/puppet">Puppet</a></li>
  190. <li><a href="/blog/category/pyoxidizer">PyOxidizer</a></li>
  191. <li><a href="/blog/category/python">Python</a></li>
  192. <li><a href="/blog/category/review-board">Review Board</a></li>
  193. <li><a href="/blog/category/rust">Rust</a></li>
  194. <li><a href="/blog/category/sync">Sync</a></li>
  195. <li><a href="/blog/category/browsers">browsers</a></li>
  196. <li><a href="/blog/category/build-system">build system</a></li>
  197. <li><a href="/blog/category/code-review">code review</a></li>
  198. <li><a href="/blog/category/compilers">compilers</a></li>
  199. <li><a href="/blog/category/internet">internet</a></li>
  200. <li><a href="/blog/category/logging">logging</a></li>
  201. <li><a href="/blog/category/mach">mach</a></li>
  202. <li><a href="/blog/category/make">make</a></li>
  203. <li><a href="/blog/category/misc">misc</a></li>
  204. <li><a href="/blog/category/movies">movies</a></li>
  205. <li><a href="/blog/category/packaging">packaging</a></li>
  206. <li><a href="/blog/category/pymake">pymake</a></li>
  207. <li><a href="/blog/category/security">security</a></li>
  208. <li><a href="/blog/category/sysadmin">sysadmin</a></li>
  209. <li><a href="/blog/category/testing">testing</a></li>
  210. </ul>
  211. </li>
  212. </ul>
  213. </div>
  214. <div style="clear: both;">&nbsp;</div>
  215. </div>
  216. </div>
  217. </div>
  218. <div id="footer">
  219. <hr/>
  220. <p>Copyright (c) 2012- Gregory Szorc. All rights reserved. Design by <a href="http://www.freecsstemplates.org/"> CSS Templates</a>.</p>
  221. </div>
  222. </div>
  223. </body>
  224. </html>