PageRenderTime 3900ms CodeModel.GetById 64ms RepoModel.GetById 1ms app.codeStats 1ms

/html/developer.html

https://github.com/gigablast/open-source-search-engine
HTML | 2729 lines | 1894 code | 231 blank | 604 comment | 0 complexity | 56e98ebb44660c0e58b2b99d862d7032 MD5 | raw file
Possible License(s): Apache-2.0
  1. <br><br>
  2. <h1>Developer Documentation</h1>
  3. FAQ is <a href=/admin.html>here</a>.
  4. <br><br>
  5. A work-in-progress <a href=/compare.html>comparison to SOLR</a>.
  6. <br><br>
  7. <b>CAUTION: This documentation is old and a lot of it is out of date. -- Matt, May 2014</b>
  8. <br><br>
  9. <!--- TABLE OF CONTENTS-->
  10. <h1>Table of Contents</h1>
  11. <table cellpadding=3 border=0>
  12. <tr><td>I.</td><td><a href="#started">Getting Started</a> - Setting up your PC for Gigablast development.</td></tr>
  13. <!--subtable-->
  14. <tr><td></td><td><table cellpadding=3>
  15. <tr><td valign=top>A.</td><td><a href="#ssh">SSH</a> - A brief introduction to setting up ssh</td></tr>
  16. </table></td></tr>
  17. <tr><td>II.</td><td><a href="#shell">Using the Shell</a> - Some common shell commands.</td></tr>
  18. <tr><td>III.</td><td><a href="#git">Using GIT</a> - Source code management.</td></tr>
  19. <tr><td>IV.</td><td><a href="#hardware">Hardware Administration</a> - Gigablast hardware resources.</td></tr>
  20. <tr><td>V.</td><td><a href="#dirstruct">Directory Structure</a> - How files are laid out.</td></tr>
  21. <tr><td>VI.</td><td><a href="#kernels">Kernels</a> - Kernels used by Gigablast.</td></tr>
  22. <tr><td>VII.</td><td><a href="#coding">Coding Conventions</a> - The coding style used at Gigablast.</td></tr>
  23. <tr><td>VIII.</td><td><a href="#debug">Debugging Gigablast</a> - How to debug gb.</td></tr>
  24. <tr><td>IX.</td><td><a href="#code">Code Overview</a> - Basic layers of the code.</td></tr>
  25. <!--subtable1-->
  26. <tr><td></td><td><table cellpadding=3>
  27. <tr><td valign=top>A.</td><td><a href="#loop">The Control Loop Layer</a><br>
  28. <tr><td valign=top>B.</td><td><a href="#database">The Database Layer</a><br>
  29. <!--subtable2-->
  30. <tr><td></td><td><table cellpadding=3>
  31. <tr><td>1.</td><td><a href="#dbclasses">Associated Classes</a></td></tr>
  32. <tr><td>2.</td><td><a href="#database">Rdb</a> - The core database class.</td></tr>
  33. <!--subtable3-->
  34. <tr><td></td><td><table cellpadding=3>
  35. <td><td>a.</td><td><a href="#addingrecs">Adding Records</a></td></tr>
  36. <td><td>b.</td><td><a href="#mergingfiles">Merging Rdb Files</a></td></tr>
  37. <td><td>c.</td><td><a href="#deletingrecs">Deleting Records</a></td></tr>
  38. <td><td>d.</td><td><a href="#readingrecs">Reading Records</a></td></tr>
  39. <td><td>e.</td><td><a href="#errorcorrection">Error Correction</a> - How Rdb deals with corrupted data on disk.</td></tr>
  40. <td><td>f.</td><td><a href="#netadds">Adding Records over the Net</a></td></tr>
  41. <td><td>g.</td><td><a href="#netreads">Reading Records over the Net</a></td></tr>
  42. <td><td>i.</td><td><a href="#varkeysize">Variable Key Sizes</a></td></tr>
  43. <td><td>j.</td><td><a href="#rdbcache">Caching Records</a></td></tr>
  44. </table></tr>
  45. <!--subtable3end-->
  46. <tr><td>3. a.</td><td><a href="#posdb">Posdb</a> - The new word-position storing search index Rdb.</td></tr>
  47. <tr><td>3. b.</td><td><a href="#indexdb">Indexdb</a> - The retired/unused tf/idf search index Rdb.</td></tr>
  48. <tr><td>4.</td><td><a href="#datedb">Datedb</a> - For returning search results constrained or sorted by date.</td></tr>
  49. <tr><td>5.</td><td><a href="#titledb">Titledb</a> - Holds the cached web pages.</td></tr>
  50. <tr><td>6.</td><td><a href="#spiderdb">Spiderdb</a> - Holds the urls to be spidered, sorted by spider date.</td></tr>
  51. <!--<tr><td>7.</td><td><a href="#urldb">Urldb</a> - Tells us if a url is in spiderdb or where it is in titledb.</td></tr>-->
  52. <tr><td>8.</td><td><a href="#checksumdb">Checksumdb</a> - Each record is a hash of the document content. Used for deduping.</td></tr>
  53. <tr><td>9.</td><td><a href="#sitedb">Sitedb</a> - Used for mapping a url or site to a ruleset.</td></tr>
  54. <tr><td>10.</td><td><a href="#clusterdb">Clusterdb</a> - Used for doing site clustering and duplicate search result removal.</td></tr>
  55. <tr><td>11.</td><td><a href="#catdb">Catdb</a> - Used to hold DMOZ.</td></tr>
  56. <tr><td>12.</td><td><a href="#robotdb">Robotdb</a> - Just a cache for robots.txt files.</td></tr>
  57. </table></tr>
  58. <!--subtable2end-->
  59. <tr><td valign=top>C.</td><td><a href="#networklayer">The Network Layer</a><br>
  60. <!--subtable2-->
  61. <tr><td></td><td><table cellpadding=3>
  62. <tr><td>1.</td><td><a href="#netclasses">Associated Classes</a></td></tr>
  63. <tr><td>2.</td><td><a href="#udpserver">Udp server</a> - Used for internal communication.</td></tr>
  64. <!--subtable3-->
  65. <tr><td></td><td><table cellpadding=3>
  66. <td><td>a.</td><td><a href="#multicasting">Multicasting</a></td></tr>
  67. <td><td>b.</td><td><a href="#msgclasses">Message Classes</a></td></tr>
  68. </table></tr>
  69. <!--subtable3end-->
  70. <tr><td>3.</td><td><a href="#tcpserver">TCP server</a> - Used by the HTTP server.</td></tr>
  71. <tr><td>4.</td><td><a href="#tcpserver">HTTP server</a> - The web server.</td></tr>
  72. <tr><td>5.</td><td><a href="#dns">DNS Resolver</a></td></tr>
  73. </table></tr>
  74. <!--subtable2end-->
  75. <tr><td valign=top>D.</td><td><a href="#buildlayer">The Build Layer</a><br>
  76. <!--subtable2-->
  77. <tr><td></td><td><table cellpadding=3>
  78. <tr><td>1.</td><td><a href="#buildclasses">Associated Files</a></td></tr>
  79. <tr><td>2.</td><td><a href="#spiderloop">Spiderdb.cpp</a> - Most of the spider code.</td></tr>
  80. <!--
  81. <tr><td>3.</td><td><a href="#spidercache">The Spider Cache</a> - Caches urls from spiderdb.</td></tr>
  82. <tr><td>4.</td><td><a href="#msg14">Msg14</a> - Indexes a url.</td></tr>
  83. <tr><td>5.</td><td><a href="#msg15">Msg15</a> - Sets the old IndexList.</td></tr>
  84. <tr><td>6.</td><td><a href="#msg16">Msg16</a> - Sets the new IndexList.</td></tr>
  85. <tr><td>7.</td><td><a href="#doc">The Doc Class</a> - Converts document to an IndexList.</td></tr>
  86. -->
  87. <tr><td>3.</td><td><a href="#linkanalysis">Link Analysis</a></td></tr>
  88. <!--<tr><td>9.</td><td><a href="#robotdb">Robots.txt</a></td></tr>
  89. <tr><td>10.</td><td><a href="#termtable">The TermTable</a></td></tr>
  90. <tr><td>11.</td><td><a href="#indexlist">The IndexList</a></td></tr>-->
  91. <tr><td>4.</td><td><a href="#samplevector">The Sample Vector</a> - Used for deduping at spider time.</td></tr>
  92. <tr><td>5.</td><td><a href="#summaryvector">The Summary Vector</a> - Used for removing similar search results.</td></tr>
  93. <tr><td>6.</td><td><a href="#gigabitvector">The Gigabit Vector</a> - Used for clustering results by topic.</td></tr>
  94. <tr><td>7.</td><td><a href="#deduping">Deduping</a> - Preventing the index of duplicate pages.</td></tr>
  95. <tr><td>8.</td><td><a href="#indexcode">Indexing Error Codes</a> - Reasons why a document failed to be indexed.</td></tr>
  96. <tr><td>9.</td><td><a href="#docids">DocIds</a> - How DocIds are assigned.</td></tr>
  97. <tr><td>10.</td><td><a href="#docidscaling">Scaling To Many DocIds</a> - How we scale to many DocIds.</td></tr>
  98. </table></tr>
  99. <!--subtable2end-->
  100. <tr><td valign=top>E.</td><td><a href="#searchresultslayer">The Search Results Layer</a><br>
  101. <!--subtable2-->
  102. <tr><td></td><td><table cellpadding=3>
  103. <tr><td>1.</td><td><a href="#queryclasses">Associated Classes</a></td></tr>
  104. <tr><td>1.</td><td><a href="#query">The Query Parser</a></td></tr>
  105. <tr><td>2.</td><td>Getting the Termlists</td></tr>
  106. <tr><td>3.</td><td>Intersecting the Termlists</a></td></tr>
  107. <tr><td>4.</td><td><a href="#raiding">Raiding Termlist Intersections</a></td></tr>
  108. <tr><td>5.</td><td>The Search Results Cache</a></td></tr>
  109. <tr><td>6.</td><td>Site Clustering</a></td></tr>
  110. <tr><td>7.</td><td>Deduping</a></td></tr>
  111. <tr><td>8.</td><td>Family Filter</a></td></tr>
  112. <tr><td>9.</td><td>Gigabits</a></td></tr>
  113. <tr><td>10.</td><td>Topic Clustering</a></td></tr>
  114. <tr><td>11.</td><td>Related and Reference Pages</a></td></tr>
  115. <tr><td>12.</td><td><a href="#spellchecker">The Spell Checker</a></td></tr>
  116. </table></tr>
  117. <!--subtable2end-->
  118. <tr><td valign=top>F.</td><td><a href="#adminlayer">The Administration Layer</a><br>
  119. <!--subtable2-->
  120. <tr><td></td><td><table cellpadding=3>
  121. <tr><td>1.</td><td><a href="#adminclasses">Associated Classes</a></td></tr>
  122. <tr><td>2.</td><td><a href="#collections">Collections</a></td></tr>
  123. <tr><td>3.</td><td><a href="#parms">Parameters and Controls</a></td></tr>
  124. <tr><td>4.</td><td><a href="#log">The Log System</a></td></tr>
  125. <tr><td>5.</td><td><a href="#clusterswitch">Changing Live Clusters</a></td></tr>
  126. </table>
  127. <tr><td valign=top>G.</td><td><a href="#corelayer">The Core Functions Layer</a><br>
  128. <!--subtable2-->
  129. <tr><td></td><td><table cellpadding=3>
  130. <tr><td>1.</td><td><a href="#coreclasses">Associated Classes</a></td></tr>
  131. <tr><td>2.</td><td><a href="#files"></a>Files</td></tr>
  132. <tr><td>3.</td><td><a href="#threads"></a>Threads</td></tr>
  133. </table>
  134. <tr><td valign=top>H.</td><td><a href="#filelist">File List</a> - List of all source code files with brief descriptions.</a><br>
  135. </table>
  136. <tr><td>XIV.</td><td><a href="#spam">Fighting Spam</a> - Search engine spam identification.</td></tr>
  137. </table>
  138. <br><br>
  139. <a name="started"></a>
  140. <hr><h1>Getting Started</h1>
  141. Welcome to Gigablast. Thanks for supporting this open source software.
  142. <a name="ssh"></a>
  143. <br><br>
  144. <h2>SSH - A Brief Introduction</h2>
  145. Gigablast uses SSH extensively to administer and develop its software. SSH is a secure shell
  146. connection that helps protect sensitive information over an insecure network. Internally, we
  147. access our development machines with SSH, but being prompted
  148. for a password when logging in between machines can become tiresome as well as development gb
  149. processes will not execute unless you can ssh without a password. This is where our authorized
  150. keys file comes in. In your user directory you have a .ssh directory and inside of that directory
  151. there lies an authorized_keys and known_hosts file. Now what we first have to do is generate a
  152. key for your user on the current machine you are on. Remember to press &lt;Enter&gt; when asked for a
  153. passphrase, otherwise you will still have to login.<br>
  154. <code><i> ssh-keygen -t dsa </i></code><br>
  155. Now that should have generated an id_dsa.pub and an id_dsa file. Since you now have a key, we need to
  156. let all of our trusted hosts know that key. Luckily ssh provides a simple way to copy your key into
  157. the appropriate files. When you perform this command the machine will ask you to login.<br>
  158. <code><i> ssh-copy-id -i ~/.ssh/id_dsa.pub destHost</i></code><br>
  159. Perform this operation to all of our development machines by replacing destHost with the appropriate
  160. machine name. Once you have done this, try sshing into each machine. You should no longer be
  161. prompted for a password. If you were requested to enter a password/phrase, go through this section again,
  162. but DO NOT ENTER A PASSPHRASE when ssh-keygen requests one. <b>NOTE: Make sure to copy the id to the
  163. same machine you generated the key on. This is important for running gigablast since it requires to ssh
  164. freely.</b>
  165. <br>
  166. <br>
  167. <i>Host Key Has Changed and Now I Can Not Access the Machine</i><br>
  168. This is easily remedied. If a machine happens to crash and the OS needs to be replaced for whatever
  169. reason, the SSH host key for that machine will change, but your known_hosts file will still have the
  170. old host key. This is to prevent man-in-the-middle attacks, but when we know this is a rebuilt machine
  171. we can simply correct the problem. Just open the ~/.ssh/known_hosts file with your favorite editor.
  172. Find the line with the name or ip address of the offending machine and delete the entire line (the
  173. known_hosts file separates host keys be newlines).<br>
  174. <br><br>
  175. <a name="shell"></a>
  176. <hr><h1>Using the Shell</h1>
  177. <!-- I'd like to nominate to remove a majority of this section - if you need help with bash at this point, you probably
  178. shouldn't be poking around in the code... Maybe just keep the dsh and export command descriptions? SCC -->
  179. Everyone uses the bash shell. The navigation keystrokes work in emacs, too. Here are the common commands:
  180. <br><br>
  181. <table cellpadding=3 border=1>
  182. <tr><td>Command</td><td>Description</td></tr>
  183. <tr><td><nobr>export LD_PATH=/home/mwells/dynamicslibs/</nobr></td><td>Export the LD_PATH variable, used to tell the OS where to look for dynamic libraries.</td></tr>
  184. <tr><td>Ctrl+p</td><td>Show the previously executed command.</td></tr>
  185. <tr><td>Ctrl+n</td><td>Show the next executed command.</td></tr>
  186. <tr><td>Ctrl+f</td><td>Move cursor forward.</td></tr>
  187. <tr><td>Ctrl+b</td><td>Move cursor backward.</td></tr>
  188. <tr><td>Ctrl+a</td><td>Move cursor to start of line.</td></tr>
  189. <tr><td>Ctrl+e</td><td>Move cursor to end of line.</td></tr>
  190. <tr><td>Ctrl+k</td><td>Cut buffer from cursor forward.</td></tr>
  191. <tr><td>Ctrl+y</td><td>Yank (paste) buffer at cursor location.</td></tr>
  192. <tr><td>Ctrl+Shift+-</td><td>Undo last keystrokes.</td></tr>
  193. <tr><td>history</td><td>Show list of last commands executed. Edit /home/username/.bashrc to change the number of commands stored in the history. All are stored in /home/username/.history file.</td></tr>
  194. <tr><td>!xxx</td><td>Execute command #xxx, where xxx is a number shown from the 'history' command.</td></tr>
  195. <tr><td>ps auxww</td><td>Show all processes.</td></tr>
  196. <tr><td>ls -latr</td><td>Show all files reverse sorted by time.</td></tr>
  197. <tr><td>ls -larS</td><td>Show all files reverse sorted by size.</td></tr>
  198. <tr><td>ln -s &lt;x&gt; &lt;y&gt;</td><td>Make directory or file y a symbolic link to x.</td></tr>
  199. <tr><td>cat xxx | awk -F":" '{print $1}'</td><td>Show contents of file xxx, but for each line, use : as a delimiter and print out the first token.</td></tr>
  200. <tr><td>dsh -c -f hosts 'cat /proc/scsi/scsi'</td><td>Show all hard drives on all machines listed in the file <i>hosts</i>. -c means to execute this command concurrently on all those machines. dsh must be installed with <i>apt-get install dsh</i> for this to work. You can use double quotes in a single quoted dsh command without problems, so you can grep for a phrase, for instance.</td></tr>
  201. <tr><td>apt-cache search xxx</td><td>Search for a package to install. xxx is a space separated list of keywords. Debian only.</td></tr>
  202. <tr><td>apt-cache show xxx</td><td>Show details of the package named xxx. Debian only.</td></tr>
  203. <tr><td>apt-get install xxx</td><td>Installs a package named xxx. Must be root to do this. Debian only.</td></tr>
  204. <tr><td>adduser xxx</td><td>Add a new user to the system with username xxx.</td></tr>
  205. </table>
  206. <br>
  207. <a name="git"></a>
  208. <hr><h1>Using Git</h1>
  209. Git is what we use to do source code control. Git allows us to have many engineers making changes to a common set of files. The basic commands are the following:<br><br>
  210. <table border=1 cellpadding=3>
  211. <tr>
  212. <td><nobr>git clone &lt;srcdir&gt; &lt;destDit&gt;</nobr></td>
  213. <td>This will copy the git repository to the destination directory.</td>
  214. </tr>
  215. </table>
  216. <p>More information available at <a href=http://www.github.com/>github.com</a></p>
  217. <br/>
  218. <h3>Setting up GitHub on a Linux Box</h3>
  219. To make things easier, you can set up github access via ssh. Here's a quick list of commands to run:
  220. (assuming Ubuntu/Debian)
  221. <pre>
  222. sudo apt-get install git-core git-doc
  223. git config --global user.name "Your Name"
  224. git config --global user.email "your@email.com"
  225. git config --global color.ui true
  226. ssh-keygen -t rsa -C "your@email.com" -f ~/.ssh/git_rsa
  227. cat ~/.ssh/git_rsa.pub
  228. </pre>
  229. Copy and paste the ssh-rsa output from the above command into your Github profile's list of SSH Keys.
  230. <pre>
  231. ssh-add ~/.ssh/git_rsa
  232. </pre>
  233. If that gives you an error about inability to connect to ssh agent, run:
  234. <pre>
  235. eval `ssh-agent -a`
  236. </pre>
  237. Then test and clone!
  238. <pre>
  239. ssh -T git@github.com
  240. git clone git@github.com:gigablast/open-source-search-engine
  241. </pre>
  242. <br><br>
  243. <a name="hardware"></a>
  244. <hr><h1>Hardware Administration</h1>
  245. <h2>Hardware Failures</h2>
  246. If a machine has a bunch of I/O errors in the log or the gb process is at a standstill, login and type "dmesg" to see the kernel's ring buffer. The error that means the hard drive is fried is something like this:<br>
  247. <pre>
  248. mwells@gf36:/a$ dmesg | tail -3
  249. scsi5: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 13 89 68 82 00 00 18 00
  250. Info fld=0x1389688b, Current sd08:34: sense key Medium Error
  251. I/O error: dev 08:34, sector 323627528
  252. </pre>
  253. If you do a <i>cat /proc/scsi/scsi</i> you can see what type of hard drives are in the server:<br>
  254. <pre>
  255. mwells@gf36:/a$ cat /proc/scsi/scsi
  256. Attached devices:
  257. Host: scsi2 Channel: 00 Id: 00 Lun: 00
  258. Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
  259. Type: Direct-Access ANSI SCSI revision: 03
  260. Host: scsi3 Channel: 00 Id: 00 Lun: 00
  261. Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
  262. Type: Direct-Access ANSI SCSI revision: 03
  263. Host: scsi4 Channel: 00 Id: 00 Lun: 00
  264. Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
  265. Type: Direct-Access ANSI SCSI revision: 03
  266. Host: scsi5 Channel: 00 Id: 00 Lun: 00
  267. Vendor: Hitachi Model: HDS724040KLSA80 Rev: KFAO
  268. Type: Direct-Access ANSI SCSI revision: 03
  269. </pre>
  270. <!-- Should most of the following be kept internally at GB? Not sure it would help with open-source users... SCC -->
  271. So for this error we should replace the <b>rightmost</b> hard drive with a spare hard drive. Usually we have spare hard drives floating around. You will know by looking at colo.html where all the equipment is stored.
  272. <br><br>
  273. If a drive is not showing up in /proc/scsi/scsi when it should be, it may be a bad sata channel on the motherboard, but sometimes you can see it again if you reboot a few times.
  274. <br><br>
  275. Often, if a motherboard failures, or a machine just stops responding to pings and cannot be revived, we designate it as a junk machine and replace its good hard drives with bad ones from other machines. Then we just need to fix the junk machine.
  276. <br><br>
  277. At some point we email John Wang from TwinMicro, or Nick from ACME Micro to fix/RMA the gb/g/gf machines and gk machines respectively. We have to make sure there are plenty of spares to deal with spikes in the hardware failure rate.
  278. <br><br>
  279. Sometimes you can not ssh to a machine but you can rsh to it. This is usually because the a drive is bad and ssh can not access some files it needs to log you in. And sometimes you can ping a machine but cannot rsh or ssh to it, for the same reason. Recently we had some problems with some machines not coming back up after reboot, that was due to a flaky 48-port NetGear switch.
  280. <br><br>
  281. <h2>Replacing a Hard Drive</h2>
  282. When a hard drive goes bad, make sure it is labelled as such so we don't end up reusing it. If it is the leftmost drive slot (drive #0) then the OS will need to be reinstalled using a boot cd. Otherwise, we can just <b>umount /dev/md0</b> as root and rebuild ext2 using the command in /home/mwells/gbsetup2, <b>mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0</b>. And to test the hard drives thoroughly, after the mke2fs, <i>nohup badblocks -b4096 -p0 -c10000 -s -w -o/home/mwells/badblocks /dev/md0 >& /home/mwells/bbout </i>. Sometimes you might even need to rebuild the software raid with <b>mkraid -c /etc/raidtab --really-force /dev/md0</b> before you do that. Once the raid is repaired, mount /dev/md0 and copy over the data from its twin to it. On the newly repaired host use <b>nohup rcp -r gbXX:/a/* /a/ &</b> where gbXX is its twin that has all the data still. DO NOT ERASE THE TWIN'S DATA. You will also have to do <b>rcp -r gbXX:/a/.antiword /a/</b> because rcp does not pick that directory up.
  283. <br><br>
  284. <h2>Measuring Drive Performance</h2>
  285. Slow drives are sometimes worse than broken drives. Run <b>./gb thrutest &lt;directoryToWriteTo&gt; 10000000000</b> to make gigablast write out 10GB of files and then read from them. This is also a good test for checking a drive for errors. Each drive should have a write speed of about 39MB/s for a HITACHI, so the whole raid should be doing 150MB/s write, and close to 200MB/s for reading. Fuller drives will not do so well because of fragmentation. If the performance is less than what it should, then try mounting each drive individually and seeing what speed you get to isolate the culprit. Sometimes you can have the drive re-seated (just pulling it out and plugging it back in) and that does the trick. A soft reboot does not seem to work. I don't know if a power cycle works yet or not.
  286. <br><br>
  287. <h2>Measuring Memory Performance</h2>
  288. Run <b>membustest 50000000 50</b> to read 50MB from main memory 50 times to see how fast the memory bus is. It should be at least 2000MB/sec, and 2500MB/sec for the faster servers. Partap also wrote <b>gb memtest</b> to see the maximum amount of memory that can be allocated, in addition to testing memory bus speeds. The max memory that can be allocated seems to be about 3GB on our 2.4.31 kernel.
  289. <br><br>
  290. <h2>Measuring Network Performance</h2>
  291. Use <b>thunder 50000000</b> on the server with ip address a.b.c.d and <b>thunder 50000000 a.b.c.d</b> on the client to see how fast the client can read UDP packets from the server. Gigablast needs about a Gigabit between all machines, and they must all be on a Gigabit switch. Sometimes the 5013CM-T resets the Gigabit to 100Mbps and causes Gigablast to basically come to a hault. You can detect that by doing a <b>dmesg | grep e1000 | tail -1</b> on each machine to see what speed its network interface is running at.
  292. <br><br>
  293. <a name="kernels"></a>
  294. <hr><h1>Kernels</h1>
  295. Gigablast runs well on the Linux 2.4.31 kernel which resides in /gb/kernels/2.4.31/. The .config file is in that directory and named <b>dotconfig</b>. It does have the marvel patch, mvsata340_2.4.patch.bz2 applied so Linux can see the Marvel chip that drives the 4 sata devices on our SuperMicro 5013CM-T 1U servers. In addition we comment out the 3 'offline' statements in drives/scsi/scsi_error.c in the kernel source, and raised the 3 TIMEOUT to 30. These changes allow the kernel to deal with the 'drive offlined' messages that the Hitachi drives SCSI interface give every now and then. We still set the 'ghost lun' think in the kernel source to 10, but I am not sure if this actually helps, though.
  296. <br><br>
  297. Older kernels had problems with unnecessary swapping and also with crashing, most likely due to the e1000 driver, which drives the 5013CM-T's Gigabit ethernet functionality. All issues seem to be worked out at this point, so any kernel crash is likely due to a hardware failure, like bad memory, or disk.
  298. <br><br>
  299. I would like to have larger block sizes on the ext2fs filesystem, because 4KB is just not good for fragmentation purposes. We use mostly huge files on our partitions, raid-0'ed across 4 400GB Sata drives, so it would help if we could use a block size of a few Megabytes even. I've contacted Theodore T'so, the guy he wrote e2fs, and he may do it for a decent chunk of cash.
  300. <br><br>
  301. <a name="coding"></a>
  302. <hr><h1>Coding Conventions</h1>
  303. When editing the source please do not wrap lines passed 80 columns. Also,
  304. please make an effort to avoid excessive nesting, it makes for hard-to-read
  305. code and can almost always be avoided.
  306. <br><br>
  307. Classes and files are almost exclusively named using Capitalized letters for
  308. the beginning of each word, like MyClass.cpp. Global variables start with g_,
  309. static variables start with s_ and member variables start with m_. In order
  310. to save vertical space and make things more readable, left curly brackets ({'s)
  311. are always placed on the same line as the class, function or loop declaration.
  312. In pointer declarations, the * sticks to the variable name, not the type as
  313. in void *ptr. All variables are to be named in camel case as in lowerCamelCase.
  314. Underscores in variable names are forbidden, with the exception of the g_, s_ and m_ prefixes, of course.
  315. <br><br>
  316. And try to avoid lots of nested curly brackets. Usually you do not have to nest
  317. more than 3 levels deep if you write the code correctly.
  318. <br><br>
  319. Using the *p, p++ and p < pend mechanisms is generally preferred to using the
  320. p[i], i++ and i < n mechanisms. So stick to that whenever possible.
  321. <br><br>
  322. Using ?'s in the code, like a = b ? d : e; is also forbidden because it is too cryptic and easy to mess up.
  323. <br><br>
  324. Upon error, g_errno is set and the conditions causing the error are logged.
  325. Whenever possible, Gigablast should recover after error and continue running
  326. gracefully.
  327. <br><br>
  328. Gigablast uses many non-blocking functions that return false if they block,
  329. or true if they do not block. These type of functions will often set the
  330. g_errno global error variable on error. These type of functions almost always
  331. take a pointer to the callback function and the callback state as arguments.
  332. The callback function will be called upon completion of the function.
  333. <br><br>
  334. Gigablast no longer uses Xavier Leroy's pthread library (LinuxThreads) because
  335. it was buggy. Instead, the Linux clone() system call is used. That function
  336. is pretty much exclusively Linux, and was the basis for pthreads anyway. It
  337. uses an older version of libc which does not contain thread support for the
  338. errno variable, so, like the old pthreads library, Gigablast's Threads.cpp
  339. class has implemented the _errno_location() function to work around that. This
  340. function provides a pointer to a separate thread's (process's) own errno
  341. location so we do not contend for the same errno.
  342. <br><br>
  343. We also make an effort to avoid deep subroutine calls. Having to trace the
  344. IP stack down several functions is time-consuming and should be avoided.
  345. <br><br>
  346. Gigablast also tries to make a consistent distinction between size and length.
  347. Length usually refers to a string length (like strlen()) and does not include
  348. the NULL terminator, if any. Size is the overall size of a data blob,
  349. and includes everything.
  350. <br><br>
  351. When making comments please do not insert the "//" at the very first character
  352. of the line in the first column of the screen. It should be indented just
  353. like any other line.
  354. <br><br>
  355. Whenever possible, please mimice the currently used styles and conventions
  356. and please avoid using third party libraries. Even the standard template
  357. library has serious performance/scalability issues and should be avoided.
  358. <br><br>
  359. <a name="debug"></a>
  360. <hr><h1>Debugging Gigablast</h1>
  361. <h2>General Procedure</h2>
  362. When Gigablast cores you generally want to use gdb to look at the core and see where it happened. You can use BitKeeper's <b>bk revtool &lt;filename&gt;</b> to see if the class or file that cored was recently edited. You may be looking at a bug that was already fixed and only need to update to the latest version.
  363. <h2>Using GDB</h2>
  364. <p>We generally use gdb to debug Gigablast. If you are running gb, the gigablast process, under gdb, and it receives a signal, gdb will break and you have to tell it to ignore the signal from now now by typing <b>handle SIG39 nostop noprint</b> for signal 39, at least, and then continue execution by typing 'c' and enter. When debugging a core on a customer's machine you might have to copy your versino of gdb over to it, if they don't have one installed.</p>
  365. <p>There is also a /gb/bin/gdbserver that you can use to debug a remote gb process, although no one really uses this except Partap used it a few times. </p>
  366. <p>The most common way to use gdb is:
  367. <ul>
  368. <li><b>gdb ./gb</b> to start up gdb.
  369. <li><b>run -c ./hosts.conf 0</b> to run gb under gdb.
  370. <li><b>Ctrl-C</b> to break gb
  371. <li><b>where</b> to make gdb print out the ip stack
  372. <li><b>break MyClass.cpp:123</b> to break gb at line 123 in MyClass.cpp.
  373. <li><b>print &lt;variableName&gt;</b> to print out the value of a variable.
  374. <li><b>watch &lt;variableName&gt;</b> to set a watchpoint so gdb breaks if the variable changes value.
  375. <li><b>set &lt;variableName&gt; &lt;value&gt;</b> to set the value of a variable.
  376. <li><b>set print elements 100000</b> to tell gdb to print out the first 100000 bytes when printing strings, as opposed to limiting it to the default 200 or so bytes.
  377. </ul>
  378. </p>
  379. <p>You can also use gdb to do <i>poor man's profiling</i> by repeatedly attaching gdb to the gb pid like <b>gdb ./gb &lt;pid&gt;</b> and seeing where it is spending its time. This is a fairly effective random sampling technique.</p>
  380. <p>If a gb process goes into an infinite loop you can get it to save its in-memory data by attaching gdb to its pid and typing <b>print mainShutdown(1)</b> which will tell gdb to run that function which will save all gb's data to disk so you don't end up losing data.</p>
  381. <p>To debug a core type <b>gdb ./gb &lt;coreFilename&gt;</b> Then you can examine why gb core dumped. Please copy the gb binary and move the core to another filename if you want to preserve the core for another engineer to look at. You need to use the exact gb that produced the core in order to analyze it properly.
  382. </p>
  383. <p>It is useful to have the following .gdbinit file in your home directory:
  384. <pre>
  385. set print elements 100000
  386. handle SIG32 nostop noprint
  387. handle SIG35 nostop noprint
  388. handle SIG39 nostop noprint
  389. set overload-resolution off
  390. </pre>
  391. The overload-resolution gets in the way when tring to print the return value of some functions, like uccDebug() for instance.
  392. </p>
  393. <h2>Using Git to Help Debug</h2>
  394. <a href="#git">Git</a> can be a useful debugging aid. By doing <b>git log</b> You can see what the latest changes made to gb were. This may allow you to isolate the bug. <b>bk diff HEAD~1 HEAD~0</b> will actually list the files that were changed from the last commit, instead of just the changesets. Substituting '1' with '2' will show the changes from the last two commits, etc.
  395. <h2>Memory Leaks</h2>
  396. All memory allocations, malloc, calloc, realloc and new, go through Gigablast's Mem.cpp class. That class has a hash table of pointers to any outstanding allocated memory, including the size of the allocation. When Gigablast allocates memory it will allocate a little more than requested so it can pad the memory with special bytes, called magic characters, that are verified when the memory is freed to see if the buffer was overwritten or underwritten. In addition to doing these boundary checks, Gigablast will also print out all the unfreed memory at exit. This allows you to detect memory leaks. The only way we can leak memory without detection now is if it is leaked in a third party library which Gigablast links to.
  397. <h2>Random Memory Writes</h2>
  398. This is the hardest problem to track down. Where did that random memory write come from? Usually, RdbTree.cpp, uses the most memory, so if the write happens it will corrupt the tree. To fight this you can turn <i>protect</i> the RdbTree by forcing RdbTree::m_useProtection to true in RdbTree::set(). This will make Gigablast protect the memory when not being accessed explicitly by RdbTree.cpp code. It does this by calling mprotect(), which can slow things down quite a bit.
  399. <h2>Corrupted Data In Memory</h2>
  400. Try to reproduce it under gdb immediately before the corruption occurs, then set a watchpoint on some memory that was getting corrupted. Try to determine the start of the corruption in memory. This may lead you to a buffer that was overflowing. Often times, a gb process will core in RdbTree.cpp which never happens unless the machine has bad RAM. Bad RAM also causes gb to go into an infinite loop in Msg3.cpp's RdbMap logic. In these cases try replacing the memory or using another machine.
  401. <h2>Common Core Dumps</h2>
  402. <table>
  403. <tr><td>Problem</td><td>Cause</td></tr>
  404. <tr><td>Core in RdbTree</td><td>Probably bad ram</td></tr>
  405. </table>
  406. <h2>Common Coding Mistakes</h2>
  407. <table>
  408. <tr><td>1.</td><td>Calling atoip() twice or more in the same printf() statement. atoip() outputs into a single static buffer and can not be shared this way.</td></tr>
  409. <tr><td>2.</td><td>Not calling class constructors and destructors when mallocing/freeing class objects. Need to allow class to initialize properly after allocation and free any allocated memory before it is freed.</td></tr>
  410. </table>
  411. <br><br>
  412. <a name="sa"></a>
  413. <hr><h1>System Administration</h1>
  414. <a name="code"></a>
  415. <hr><h1>Code Overview</h1>
  416. Gigablast consists of about 500,000 lines of C++ source. The latest gb version being developed as of this writing is /gb/datil/. Gigablast is a single executable, gb, that contains all the functionality needed to run a search engine. Gigablast attempts to uniformly distribute every function amongst all hosts in the network. Please see <a href="overview.html">overview.html</a> to get a general overview of the search engine from a USER perspective.
  417. <br><br>
  418. Matt started writing Gigablast in 2000, with the intention of making it the most efficient search engine in the world. A lot of attention was paid to making every piece of the code fast and scalable, while still readable. In July, 2013, the majority of the search engine source code was open sourced. Gigablast continues to develop and expand the software, as well as offers consulting services for companies wishing to run their own search engine.
  419. <br><br>
  420. <hr>
  421. <a name="loop"></a>
  422. <h1>The Control Loop Layer</h1>
  423. Gigablast uses a <b>signal-based</b> architecture, as opposed to a thread-based architecture. At the heart of the Gigablast process is the I/O control loop, contained in the <a href="../latest/src/Loop.cpp">Loop.cpp</a> file. Gigablast sits in an idle state until it receives a signal on a file or socket descriptor, or a "thread" exits and sends a signal to be cleaned up. Threads are only used when necessary, which is currently for performing disk I/O, intersecting long lists of docIds to generate search results (which can block the CPU for a while) and merging database data from multiple files into a single list.
  424. <br><br>
  425. <a href="#threads">Threads</a> are not only hard to debug, but can make everything more latent. If you have 10 requests coming in, each one requiring 1 second of CPU time, if you thread them all it will take 10 seconds until each query can be satisfied and the results returned, but if you handle them FIFO (one at a time) then the first request is satisfied in 1 second, the second in two seconds, etc. The Threads.cpp class handles the use of threads and uses the Linux clone() call. Xavier Leroy's pthreads were found to be too buggy at the time.
  426. <br><br>
  427. Loop.cpp also allows other classes to call its registerCallback() function which takes a file or socket descriptor and when call a given callback when data is ready for reading or writing on that file descriptor. Gigablast accomplishes this with Linux's fcntl() call which you can use to tell the kernel to connect a particular file descriptor to a signal. The heart of gb is really sigtimedwait() which sits there idle waiting for signals. When one arrives it breaks out and the callback responsible for the associated file descriptor is called.
  428. <br><br>
  429. In the event that the queue of ready file descriptors is full, the kernel will deliver a SIGIO to the gb process, which performs a poll over all the registered file descriptors.
  430. <br><br>
  431. Gigablast will also catch other signals, like segmentation faults and HUP signals and save the data to disk before exiting. That way we do not lose data because of a core dump or an accidental HUP signal.
  432. <br><br>
  433. Using a signal based architecture means that we have to work with callbacks a lot. Everything is triggered by an <i>event</i> similar to GUIs, but our events are not mouse related, but disk and network. This naturally leads to the concept of a callback, which is a function to be called when a task is completed.
  434. So you might call a function like <b>readData ( myCallback , myState , readbuf , bytesToRead )</b> and gigablast will perform the requested read operation and call <b>myCallback ( myState )</b> when it is done. This means you have to worry about state, whereas, in a thread-based architecture you do not.
  435. <br><br>
  436. Furthermore, all callbacks must be expressed as pointers to <b>C</b> functions, not C++ functions. So we call these C functions wrapper functions and have them immediately call their C++ counterparts. You'll see these wrapper functions clearly illustrated throughout the code.
  437. <br><br>
  438. Gigablast also has support for asynchronous signals. You can tell the kernel to deliver you an asynchronous signal in an I/O event, and that signal, also known as an interrupt, will interrupt whatever the CPU is doing and call gb's signal handler, sigHandlerRT(). Be careful, you cannot allocate memory when in a real-time/asynchronous signal handler. You should only call function names ending in _ass, which stands for asychronous-signal safe. The global instance of the UdpServer called g_udpServer2 is based on these asynchronous signals, mostly set up for doing fast pings without having to worry about what the CPU is doing at the time, but it really does not interrupt like it should most likely due to kernel issues. It doesn't seem any faster than the non-asynchronouys UDP server, g_udpServer.
  439. <br><br>This signal-based architecture also makes a lot of functions return false if they block, meaning they never completed because they have to wait on I/O, or return true and set g_errno on error. So if a function returns false on you, generally, you should return false, too.
  440. <br><br>
  441. The Loop class is primarily responsible for calling callbacks when an event
  442. occurs on a file/socket descriptor. Loop.cpp also calls callbacks that
  443. registered with it from calling registerSleepCallback(). Those sleep callbacks
  444. are called every X milliseconds, or more than X milliseconds if load is
  445. extreme. Loop also responds to SIGCHLD signals which are called when a thread
  446. exits. The only socket descriptor related callbacks are for TcpServer and
  447. for UdpServer. UdpServer and TcpServer each have their own routine for calling
  448. callbacks for an associated socket descriptor. But when the system is under
  449. heavy load we must be careful with what callbacks we call. We should call the
  450. highest priority (lowest niceness) callbacks before calling any of the
  451. lowest priority (highest niceness) callbacks.
  452. To solve this problem we require that every callback registered with Loop.cpp
  453. return after 2 ms or more have elapsed if possible. If they still need to
  454. be called by Loop.cpp to do more work they should return true at that time,
  455. otherwise they should return false. We do this in order to give higher priority
  456. callbacks a chance to be called. This is similar to the low latency patch in
  457. the Linux kernel. Actually, low priority callbacks will try to break out after
  458. 2 ms or more, but high priority callbacks will not. Loop::doPoll() just calls
  459. low priority
  460. <a name="database"></a>
  461. <hr><h1>The Database Layer</h2>
  462. Gigablast has the following databases which each contain an instance of the
  463. Rdb class:
  464. <pre>
  465. Indexdb - Used to hold the index.
  466. Datedb - Like indexdb, but its scores are dates.
  467. Titledb - Used to hold cached web pages.
  468. Spiderdb - Used to hold urls sorted by their scheduled time to be spidered.
  469. Checksumdb - Used for preventing spidering of duplicate pages.
  470. Sitedb - Used for classifying webpages. Maps webpages to rulesets.
  471. Clusterdb - Used to hold the site hash, family filter bit, language id of a document.
  472. Catdb - Used to classify a document using DMOZ.
  473. </pre>
  474. <a name="rdb"></a>
  475. <h2>Rdb</h2>
  476. <b>Rdb.cpp</b> is a record-based database. Each record has a key, and optional
  477. blob of data and an optional long (4 bytes) that holds the size of the blob of
  478. data. If present, the size of the blob is specified right after the key. The
  479. key is 12 bytes and of type key_t, as defined in the types.h file.
  480. <a name="dbclasses"></a>
  481. <h3>Associated Classes (.cpp and .h files)</h3>
  482. <table cellpadding=3 border=1>
  483. <tr><td>Checksumdb</td><td>DB</td><td>Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time.</td></tr>
  484. <tr><td>Clusterdb</td><td>DB</td><td>Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time. </td></tr>
  485. <tr><td>Datedb</td><td>DB</td><td>Like indexdb, but its <i>scores</i> are 4-byte dates.</td></tr>
  486. <tr><td>Indexdb</td><td>DB</td><td>Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb.</td></tr>
  487. <tr><td>MemPool</td><td>DB</td><td>Used by RdbTree to add new records to tree without having to do an individual malloc.</td></tr>
  488. <tr><td>MemPoolTree</td><td>DB</td><td>Unused. Was our own malloc routine.</td></tr>
  489. <tr><td>Msg0</td><td>DB</td><td>Fetches an RdbList from across the network.</td></tr>
  490. <tr><td>Msg1</td><td>DB</td><td>Adds all the records in an RdbList to various hosts in the network.</td></tr>
  491. <tr><td>Msg3</td><td>DB</td><td>Reads an RdbList from several consecutive files in a particular Rdb.</td></tr>
  492. <!--<tr><td>Msg34</td><td>DB</td><td>Determines least loaded host in a group (shard) of hosts.</td></tr>
  493. <tr><td>Msg35</td><td>DB</td><td>Merge token management functions. Currently does not work.</td></tr>-->
  494. <tr><td>Msg5</td><td>DB</td><td>Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList.</td></tr>
  495. <tr><td>MsgB</td><td>DB</td><td>Unused. A distributed cache for caching anything.</td></tr>
  496. <tr><td>Rdb</td><td>DB</td><td>The core database class from which all are derived.</td></tr>
  497. <tr><td>RdbBase</td><td>DB</td><td>Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection.</td></tr>
  498. <tr><td>RdbCache</td><td>DB</td><td>Can cache RdbLists or individual Rdb records.</td></tr>
  499. <tr><td>RdbDump</td><td>DB</td><td>Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file.</td></tr>
  500. <tr><td>RdbList</td><td>DB</td><td>A list of Rdb records.</td></tr>
  501. <tr><td>RdbMap</td><td>DB</td><td>Maps an Rdb key to an offset into an RdbFile.</td></tr>
  502. <tr><td>RdbMem</td><td>DB</td><td>Memory manager for RdbTree so it does not have to allocate space for every record in the three.</td></tr>
  503. <tr><td>RdbMerge</td><td>DB</td><td>Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively.</td></tr>
  504. <tr><td>RdbScan</td><td>DB</td><td>Reads an RdbList from an RdbFile, used by Msg3.</td></tr>
  505. <tr><td>RdbTree</td><td>DB</td><td>A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree.</td></tr>
  506. <tr><td>SiteRec</td><td>DB</td><td>A record in Sitedb.</td></tr>
  507. <tr><td>Sitedb</td><td>DB</td><td>An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url.</td></tr>
  508. <tr><td>SpiderRec</td><td>DB</td><td>A record in spiderdb.</td></tr>
  509. <tr><td>Spiderdb</td><td>DB</td><td>An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is <i>old</i> or <i>new</i> to the index, and the priority of the url, currently from 0 to 7.</td></tr>
  510. <tr><td>TitleRec</td><td>DB</td><td>A record in Titledb.</td></tr>
  511. <tr><td>Titledb</td><td>DB</td><td>An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class.</td></tr>
  512. <!--<tr><td>Urldb</td><td>DB</td><td>An Rdb whose records indicate if a url is in spiderdb or what particular Titledb BigFile contains the url.</td></tr>-->
  513. </table>
  514. <br><br>
  515. <a name="addingrecs"></a>
  516. <h3>Adding a Record to Rdb</h3>
  517. <a name="rdbtree"></a>
  518. When a record is added to an Rdb it is housed in a binary tree,
  519. <b>RdbTree.cpp</b>,
  520. whose size is configurable via gb.conf. For example, &lt;indexdbMaxTreeSize&gt;
  521. specified how much memory in bytes that the tree for Indexdb can use. When the
  522. tree is 90% full it is dumped to a file on disk. The records are dumped in
  523. order of their keys. When enough files have accrued on disk, a merge is
  524. performed to keep the number of files down and responsiveness up.
  525. <br><br>
  526. Each file, called a BigFile and defined by BigFile.h, can actually consist of
  527. multiple physical files, each limited in size to 512MB. In this manner
  528. Gigablast overcomes the 2GB file limit imposed by some kernels or disk systems.
  529. Each physical file in a BigFile, after the first file, has a ".partX"
  530. extension added to its filename, where X ranges from 1 to infinity. Throughout
  531. this document, "BigFile" is used interchangeably with "file".
  532. <br><br>
  533. If the tree can not accommodate a record add, it will return an ETRYAGAIN error.
  534. Typically, most Gigablast routines will wait one second before retrying the
  535. add.
  536. <br>
  537. <h3>Dumping an Rdb Tree</h3>
  538. The RdbDump class is responsible for dumping the RdbTree class to disk. It
  539. dumps a little bit of the tree at a time because it can take a few hundred
  540. milliseconds seconds to gather a large list of records from the tree,
  541. especially when the records are dataless (just keys). Since RdbTree::getList()
  542. is not threaded it is important that RdbDump gets a little bit at a time so
  543. it does not block other operations.
  544. <br>
  545. <a name="mergingfiles"></a>
  546. <a name="rdbmerge"></a>
  547. <h3>Merging Rdb Files</h3>
  548. Rdb merges just enough files as to keep the number of files at or below the
  549. threshold specified in gb.conf, which is &lt;indexdbMaxFilesToMerge&gt; for
  550. Indexdb, for example. <b>RdbMerge.cpp</b> is used to control the merge logic.
  551. The merge chooses which files to merge so as to minimize
  552. the amount of disk writing for the long term. It is important to keep the
  553. number of files low, because any time a key range of records is requested, the
  554. number of disk seeks will be low and the database will be responsive.
  555. <br><br>
  556. When too many files have accumulated on disk, the database will enter "urgent
  557. merge mode". It will show this in the log when it happens. When that happens,
  558. Gigablast will not dump the corresponding RdbTree to disk because
  559. it will create too many files. If the RdbTree is full then any
  560. attempts to add data to it when Gigablast is in "urgent merge mode" will fail
  561. with an ETRYAGAIN error. These error replies are counted on a per host basis
  562. and displayed in the Hosts table. At this point the host is considered to be
  563. a spider bottleneck, and most spiders will be stuck waiting to add data to
  564. the host.
  565. <br><br>
  566. If Gigablast is configured to "use merge tokens", (which no longer works for some reason and has since been disabled) then any file merge operation
  567. may be postponed if another instance of Gigablast is performing a merge on the
  568. same computer's IDE bus, or if the twin of the host is merging. This is done
  569. mostly for performance reasons. File merge operations tend to decrease disk
  570. access times, so having a handy twin and not bogging down an IDE bus allows
  571. Gigablast's load balancing algorithms to redirect requests to hosts that are
  572. not involved with a big file merge.
  573. <br>
  574. <a name="deletingrecs"></a>
  575. <h3>Deleting Rdb Records</h3>
  576. The last bit of the 12 bytes key, key_t, is called the delbit. It is 0 if the
  577. key is a "negative key", it is 1 if the key is a "positive key". When
  578. performing a merge, a negative key may collide with its positive counterpart
  579. thus annihilating one another. Therefore, deletes are performed by taking the
  580. key of the record you want to delete, changing the low bit from a 1 to a 0,
  581. and then adding that key to the database. Negative keys are always dataless.
  582. <br>
  583. <a name="readingrecs"></a>
  584. <h3>Reading a List of Rdb Records</h3>
  585. When a user requests a list of records, Gigablast will read the records
  586. in the key range from each file. If no more than X bytes of records are
  587. requested, then Gigablast will read no more than X bytes from each file. After
  588. the reading, it merges the lists from each file into a final list. During this
  589. phrase it will also annihilate positive keys with their negative counterparts.
  590. <br><br>
  591. <a name="rdbmap"></a>
  592. In order to determine where to start reading in each file for the database,
  593. Gigablast uses a "map file". The map file, encompassed by <b>RdbMap.cpp</b>,
  594. records the key of the first record
  595. that occurs on each disk page. Each disk page is PAGE_SIZE bytes, currently,
  596. 16k. In addition to recording the key of the first record that start on
  597. each page, it records a 2 byte offset of the key into that page. The map file
  598. is represented by the RdbMap class.
  599. <br><br>
  600. The Msg3::readList() routine is used retrieve a list of Rdb records from
  601. a local Rdb. Like most Msg classes, it allows you to specify a callback
  602. function and callback data in case it blocks. The Msg5 class contains the Msg3
  603. class and extends it with the error correction (discussed below) and caching
  604. capabilities. Calling Msg5::getList() also allows you to incorporate the
  605. lists from the RdbTree in order to get realtime data. Msg3 is used mostly
  606. by the file merge operation.
  607. <br>
  608. <a name="errorcorrection"></a>
  609. <h3>Rdb Error Correction</h3>
  610. Every time a list of records is read, either for answering a query or for
  611. doing a file merge operation, it is checked for corruption. Right now
  612. just the order of the keys of the records, and if those keys are out of the
  613. requested key range, are checked. Later, checksums may
  614. be implemented as well. If some keys are found to be out of order or out
  615. of the requested key range, the request
  616. for the list is forwared to that host's twin. The twin is a mirror image.
  617. If the twin has the data intact, it returns its data across the network,
  618. but the twin will return all keys it has in that range, not just necessarily from a single file, thus we can end up patching the corrupted data with a list that is hundreds of times bigger.
  619. Since this
  620. procedure also happens during a file merge operation, merging is a way of
  621. correcting the data.
  622. <br><br>
  623. If there is no twin available, or the twin's data is corrupt as well, then
  624. Gigablast attempts to excise the smallest amount of data so as to regain
  625. integrity. Any time corruption is encountered it is noted in the log with
  626. a message like:
  627. <pre>
  628. 1145413139583 81 WARN db [31956] Key out of order in list of records.
  629. 1145413139583 81 WARN db [31956] Corrupt filename is indexdb1215.dat.
  630. 1145413139583 81 WARN db [31956] startKey.n1=af6da55 n0=14a9fe4f69cd6d46 endKey.n1=b14e5d0 n0=8d4cfb0deeb52cc3
  631. 1145413139729 81 WARN db [31956] Removed 0 bytes of data from list to make it sane.
  632. 1145413139729 81 WARN db [31956] Removed 6 recs to fix out of order problem.
  633. 1145413139729 81 WARN db [31956] Removed 12153 recs to fix out of range problem.
  634. 1145413139975 81 WARN net Encountered a corrupt list.
  635. 1145413139975 81 WARN net Getting remote list from twin instead.
  636. 1145413139471 81 WARN net Received good list from twin. Requested 5000000 bytes and got 5000010.
  637. startKey.n1=af6cc0e n0=ae7dfec68a44a788 endKey.n1=ffffffffffffffff n0=ffffffffffffffff
  638. </pre>
  639. <br>
  640. <a name="netreads"></a>
  641. <a name="msg0"></a>
  642. <h3>Getting a List of Records from Rdb in a Network</h3>
  643. <b>Msg0.cpp</b> is used to retrieve a list of Rdb records from an arbitrary
  644. host. Indexdb is currently the only database where lists of records are
  645. needed. All the other databases pretty much just do single key lookups. Each
  646. database chooses how to partition its records based on the key across the
  647. many hosts in a network. For the most part, the most significant bits of the
  648. key are used to determine which host is responsible for storing that record.
  649. This is why Gigablast is currently limited to running on 2, 4, 8, 16 ... or
  650. 2^n machines. Msg0 should normally be used to get records, not Msg5 or Msg3.
  651. <br>
  652. <a name="netadds"></a>
  653. <a name="msg1"></a>
  654. <h3>Adding a Record to Rdb in a Network</h3>
  655. You can add record to a local instance of Rdb by using Rdb::addRecord() and
  656. you can use Rdb::deleteRecord() to delete records. But in order to add a
  657. record to a remote host you must use <b>Msg1.cpp</b>. This class
  658. will use the key of the record to determine which host or hosts in the network
  659. are responsible for storing the record. It will continue to send the add
  660. request to each host until it receives a positive reply. If the host receiving
  661. the record is down, the add operation will not complete, and will keep
  662. retrying forever.
  663. <br>
  664. <a name="varkeysize"></a>
  665. <h3>Variable Key Size</h3>
  666. As of June 2005, Rdb supports a variable key size. Before, the key was always of class <i>key_t</i> as defined in types.h as being 12 bytes. Now the key can be 16 bytes as well. The key size, either 12 or 16 bytes, is passed into the Rdb::init() function. MAX_KEY_BYTES is currently #define'd as being 16, so if you ever use a key bigger than 16 bytes that must be updated. Furthermore, the functions for operating on key_t's, while still available, have been generalized to accept a key size parameter. These functions are in types.h as well, and the more popular ones are KEYSET() and KEYCMP() for setting and comparing variable sized keys.
  667. <br><br>
  668. To convert the Rdb files from a fixed-size key to a variable-sized one required modifying the files of the Associated Classes (listed above). Essentially, all variables of type key_t were changed to character pointers (char *), and all key assignment and comparison operators (and others) were changed to use the more general KEYSET() and KEYCMP() functions in types.h. To maintain backwards compatibility and ease migration, all Rdb associated classes still accept key_t parameters, but now they also accept character pointer parameters for passing in keys.
  669. <br><br>
  670. Some of the more common bugs from this change: since the keys are now character pointers, data owned by one class often got overwritten by another, therefore you have to remember to copy the key using KEYSET() rather than just operator on the key that the pointer points to. Another common mistake is using the KEYCMP() function without having a comparison operator immediately following in, such as < > = or !=. Also tossing the variable key size, m_ks, around and keeping it in agreement is another weak point. There were some cases of leaving a statement like <i>(char *)&key</i> alone when it should have been changed to just <i>key</i> since it was made into a <i>(char *)</i> from a <i>key_t</i>. And a case of not dealing with the 6-byte compression correctly, like replacing <i>6</i> with <i>m_ks-6</i> when we should not have, like in RdbList::constrain_r(). Whenever possible, the original code was left intact and simply commented out to aid in future debugging.
  671. <br>
  672. <a name="posdb"</a>
  673. <h2>Posdb</h2>
  674. Indexdb was replaced with Posdb in 2012 in order to store <b>word position</b> information as well as what <b>field</b> the word was contained in. Word position information is basically the position of the word in the document and starts at position <i>0</i>. A sequence of whitespace is counted as one, and a sequence of punctuation containing a comma or something else is counted as 2. An alphanumeric word is counted as one. So in the sentence "The quick, brown" the word <i>brown</i> would have a word position of 5. The <b>field</b> of the word in the document could be the title, a heading, a meta tag, the text of an inlink or just the plain document body.
  675. <br><br>
  676. The 18-byte key for an Posdb record has the following bitmap:
  677. <pre>
  678. tttttttt tttttttt tttttttt tttttttt t = termId (48bits)
  679. tttttttt tttttttt dddddddd dddddddd d = docId (38 bits)
  680. dddddddd dddddddd dddddd0r rrrggggg r = siterank, g = langid
  681. wwwwwwww wwwwwwww wwGGGGss ssvvvvFF w = word position , s = wordspamrank
  682. pppppb1M MMMMLZZD v = diversityrank, p = densityrank
  683. M = unused, b = in outlink text
  684. L = langIdShiftBit (upper bit for langid)
  685. Z = compression bits. can compress to
  686. 12 or 6 bytes keys.
  687. G: 0 = body
  688. 1 = intitletag
  689. 2 = inheading
  690. 3 = inlist
  691. 4 = inmetatag
  692. 5 = inlinktext
  693. 6 = tag
  694. 7 = inneighborhood
  695. 8 = internalinlinktext
  696. 9 = inurl
  697. F: 0 = original term
  698. 1 = conjugate/sing/plural
  699. 2 = synonym
  700. 3 = hyponym
  701. </pre>
  702. <br>
  703. Posdb.cpp tries to rank documents highest that have the query terms closest together. If most terms are close together in the body, but one term is in the title, then there is a slight penalty. This penalty as well as the weights applied to the different density ranks, siteranks, etc are in the Posdb.h and Posdb.cpp files.
  704. <br><br>
  705. <a name="indexdb"></a>
  706. <a name="indexlist"></a>
  707. <h2>Indexdb</h2>
  708. Indexdb has been replaced by <a href="#posdb">Posdb</a>, but the key for an Indexdb record has the following bitmap:
  709. <pre>
  710. tttttttt tttttttt tttttttt tttttttt t = termid (48bits)
  711. tttttttt tttttttt ssssssss dddddddd s = ~score
  712. dddddddd dddddddd dddddddd dddddd0Z d = docId (38 bits) Z = delbit
  713. </pre>
  714. When Rdb::m_useHalfKeys is on and the preceeding key as the same 6 bytes as
  715. the following key, then the following key, called a half key, only requires
  716. 6 bytes, therefore, has the following bitmap:
  717. <pre>
  718. ssssssss dddddddd dddddddd dddddddd d = docId, s = ~score
  719. dddddddd dddddd1Z Z = delbit
  720. </pre>
  721. Every term that Gigablast indexes, be it a word or phrase, is hashed using
  722. the hash64() routine in the hash.h. This is a very fast and effective hashing
  723. function. The resulting hash of the term is called the termid. It is
  724. constrained to 48 bits.
  725. <br><br>
  726. Next, the score of that term is computed. Generally, the more times the term
  727. occurs in the document the higher the score. Other factors, like incoming
  728. links and link text and quality of the document, may contribute to the score
  729. of the term. The score is 4 bytes but is logarithmically mapped into 8 bits,
  730. and complemented, so documents with higher scores appear above those with
  731. lower scores, since the Rdb is sorted in ascending order.
  732. <br><br>
  733. The &lt;indexdbTruncationLimit&gt; parameter in gb.conf, reflected as
  734. Conf::m_indexdbTruncationLimit, allows you to specify the maximum
  735. number of docids allowed per termid. This allows you to save disk space. Right
  736. now it is set to 5 million, a fairly hefty number. The merge routine,
  737. RdbList::indexMerge_r() will ensure that no returned Indexdb list violates
  738. this truncation limit. This routine is used for reading data for resolving
  739. queries as well as doing file merge operations.
  740. A list of Indexdb records all with the same termid, is called a termlist or
  741. indexlist.
  742. <br><br>
  743. <a name="datedb"></a>
  744. <h2>Datedb</h2>
  745. The key for an Datedb record has the following 16-byte bitmap:
  746. <pre>
  747. tttttttt tttttttt tttttttt tttttttt t = termId (48bits)
  748. tttttttt tttttttt DDDDDDDD DDDDDDDD D = ~date
  749. DDDDDDDD DDDDDDDD ssssssss dddddddd s = ~score
  750. dddddddd dddddddd dddddddd dddddd0Z d = docId (38 bits)
  751. And, similar to indexdb, datedb also has a half bit for compression down to 10 bytes:
  752. DDDDDDDD DDDDDDDD D = ~date
  753. DDDDDDDD DDDDDDDD ssssssss dddddddd s = ~score
  754. dddddddd dddddddd dddddddd dddddd1Z d = docId (38 bits)
  755. </pre>
  756. Datedb was added along with variable-sized keys (mentioned above). It is basically the same as indexdb, but has a 4-byte date field inserted. IndexTable.cpp was modified slightly to treat dates as scores in order to provide sort by date functionality. By setting the sdate=1 cgi parameter, Gigablast should limit the termlist lookups to Datedb. Using date1=X and date2=Y cgi parameters will tell Gigablast to constrain the termlist by those dates. date1 and date2 are currently seconds since the epoch. Gigablast will search for the date of a document in this order, stopping at the first non-zero date value:<br>
  757. 1. &lt;meta name=datenum content=123456789&gt;<br>
  758. 2. &lt;meta name=date content="Dec 1 2006"&gt;<br>
  759. 3. The last modified date in the HTTP mime that the web server returns.
  760. <br><br>
  761. <a name="titledb"></a>
  762. <a name="titlerec"></a>
  763. <h2>Titledb</h2>
  764. Titledb holds a cached copy of every web page in the index. The TitleRec
  765. class is used to serialize and deserialize the information from an Rdb record
  766. into something more useful. It also uses the freely available zlib library
  767. to compress web pages to about 1/5th of their size. Depending on the word
  768. density on a page and the truncation limit, Titledb is usually slightly bigger
  769. than Indexdb. See TitleRec.h to see what kind of information is contained
  770. in a TitleRec.
  771. <br><br>
  772. The key of a Titledb record, a TitleRec, has the following bitmap:
  773. <pre>
  774. dddddddd dddddddd dddddddd dddddddd d = docId
  775. dddddddd hhhhhhhh hhhhhhhh hhhhhhhh h = hash of site name
  776. hhcccccc cccccccc cccccccc cccccccD c = content hash, D = delbit
  777. </pre>
  778. The low bits of the top 31 bits of the docId are used to determine which
  779. host in the network stores the Titledb record. See Titledb::getGroupId().
  780. <br><br>
  781. The hash of the sitename is used for doing site clustering. The hash of the
  782. content is used for deduping identical pages from the search results.
  783. <br><br>
  784. <b>IMPORTANT:</b> Any time you change TITLEREC_VERSION in Titledb.h or make
  785. any parsing change to TitleRec.cpp, you must also change it in all parent bk
  786. repositories to keep things backwards compatible. We need to ensure that
  787. newer versions of gb can parse the Titledb records of older versions. Because
  788. if you increment TITLEREC_VERSION in the newer repository and then someone
  789. else increments it for a different reason in the parent repository, then you
  790. end up having a conflict as to what the latest version number actually stands
  791. for. So versions need to be immediately propagated both ways to avoid this.
  792. <a name="spiderdb"></a>
  793. <h2>Spiderdb</h2>
  794. NOTE: this description of spiderdb is very outdated! - matt Aug 2013.
  795. <br><br>
  796. Spiderdb is responsible for the spidering schedule. Every url in spiderdb is
  797. either old or new. If it is already in the index, it is old, otherwise, it is
  798. new. The urls in Spiderdb are further categorized by a priority from 0 to 7.
  799. If there are urls ready to be spidered in the priority 7 queue, then they
  800. take precedence over urls in the priority 6 queue.
  801. <br><br>
  802. All the urls in Spiderdb priority queue are sorted by the date they are
  803. scheduled to be spidered. Typically, a url will not be spidered before its
  804. time, but you can tweak the window in the Spider Controls page. Furthermore,
  805. you can turn the spidering of new or old urls on and off, in addition to
  806. disabling spidering based on the priority level. Link harvesting can also be
  807. toggled based on the spider queue. This way you can harvest links only from
  808. higher priority spider queues.
  809. <br><br>
  810. <a name="pagereindex"></a>
  811. Some spiderdb records are docid based only and will not have an associated url.
  812. These records were usually added from the <b>PageReindex.cpp</b> tool which
  813. allows the user to store the docids of a bunch of search results directly into
  814. spiderdb for respidering.
  815. <br><br>
  816. The bitmap of a key in Spiderdb:
  817. <pre>
  818. 00000000 00000000 00000000 pppNtttt t = time to spider, p = ~ of priority
  819. tttttttt tttttttt tttttttt ttttRRf0 R = retry #, f = forced?
  820. dddddddd dddddddd dddddddd dddddddD d = top 32 bits of docId, D = delbit
  821. N = 1 iff url not in titledb (isNew)
  822. </pre>
  823. Each Spiderdb record also records the number of times the url was tried and
  824. failed. In the Spider Controls you can specify how many until Gigablast gives
  825. up and deletes the url from Spiderdb, and possibly from the other databases
  826. if it was indexed.
  827. <br><br>
  828. If a url was "forced", then it was added to Spiderdb even though it was
  829. detected as already being in there. Urldb lets us know if a url is in Spiderdb
  830. or not. So if the url is added to Spiderdb even though it is in there already
  831. it is called "forced" url. Once a forced url is spidered then that Spiderdb
  832. record is discarded. Otherwise, if the url was spidered and was not forced,
  833. it is rescheduled for a future spidering, and the old Spiderdb record is
  834. deleted and a new one is added.
  835. <br><br>
  836. Gigablast attempts to determine if the url changed since its last spidering.
  837. If it has, then it will try to decrease the time between spiderings, otherwise
  838. it will try to increase that time. The min and max respider times can be
  839. specified in the Spider Controls page, so you can ensure Gigablast does not
  840. wait long enough or ensure it does not wait too long between respiderings.
  841. <br><br>
  842. For more about the spider process see <a href="#buildlayer">The Build Layer</a>.
  843. <br><br>
  844. <!--
  845. <a name="urldb"></a>
  846. <h2>Urldb</h2>
  847. Urldb is like a map file. A record in urldb has this bitmap:
  848. <pre>
  849. 00000000 00000000 00000000 00000000 H = half bit
  850. 00000000 dddddddd dddddddd dddddddd d = 36 bit docId
  851. dddddddd ddddddee eeeeefff fffffCHD e = url hash ext, f = title file # (tfn)
  852. C = clean bit , D = delbit
  853. </pre>
  854. When a caller requests the cached web page for a docid they are essentially requesting the TitleRec for that docid. We have to read the TitleRec from a Titledb file. Because there can be hundreds of titledb files, checking
  855. each one is a burden. Therefore, we ask urldb which one has it. If the title
  856. file number (tfn) is 255, that means the url is just in the Spiderdb OR the TitleRec is in Titledb's RdbTree. The tfn is a secondary numeric identifier present in the actual filename of each titledb file. Only titledb files have this secondary id and it is the number immediately following the hyphen.
  857. <br><br>
  858. The purpose of urldb originally grew from the fact that looking up an old titlerec for re-spidering purposes would require looking in a few files. This was a big bottleneck for the spidering process and also made it harder to handle large query volumes while spidering in the background. Furthermore, much merging was required to keep the number of titledb big files down to a respectable few. Urldb solves this problem by mapping each docid directly to a single titledb file.
  859. <br><br>
  860. Urldb also greatly complicates things. It has to remain in perfect sync with titledb. Every time titledb files are merged the tfn of each TitleRec involved is changed. So right after we dump out the titledb list we add a urldb list to urldb that has the new tfns for the titledbs in that list. This is done in RdbDump::updateUrldbLoop(). Furthermore, since we often merge titledb files in a somewhat random nature, we can not just override the urldb rec for each TitleRec involved in that merge, because a more up-to-date TitleRec for the same docid may be in another titledb file. Therefore, we must also read the corresponding urldb lists for the titledb lists involved in the merge, and lookup the up-to-date tfn of each TitleRec. If it is indeed a match for the file being merged, then we can do the urldb override.
  861. <br><br>
  862. When a new url is added to the spider queue, via Msg10, we first check urldb to see if it's already in there. We extend the probable docid of the url with the 7-bit hash extension (the number of bits is now #defined as URLDB_EXTBITS in Urldb.h so we can use 23 bits for massive collections) to help reduce collisions and/or false positives. We lookup a range of urldb records, corresponding to all the possible actual docids that that url might have taken if it was indexed earlier. This list is looked up in Msg10::addUrlLoop(). The keys defining the list are uk1 and uk2. If any urldb record in that list has the same hash extension as the url being added then we consider the url to already be in the index or in a spider queue, even though it may not. The probability is low.
  863. <br><br>
  864. If the url is determined not to be in urldb then it is added to spiderdb and also to urldb. We use the same probable docid and hash extension when adding it to urldb, and we use a tfn of 255 to indicate that it is in a spider queue only. (NOTE: this can also mean it is indexed, and its TitleRec is in the tree in memory as well). When we finally index a new url we actually do not add its urldb record to urldb since Msg22 will check the RdbTree first for any titleRec before even bothering with urldb. However, if the actual docid of the new url turns out to be different because it collided with the actual docid of an existing, indexed url, then we remove its old record from urldb, and we do add the new record with the actual docid (TODO). This logic is contained in Msg14::addUrlRecs().
  865. <br><br>
  866. If reindexing an old url, we do not re-add the urldb record with a tfn of 255 because it is not efficient and because Msg22 will check the RdbTree for the titleRec before consulting Titledb.
  867. <br><br>
  868. When merging urldb lists in RdbList::indexMerge_r() we ignore the tfn bits. That allows the tfns of newer urldb recs to override those of older urldb recs. The tfns are really not meant to be part of the key, they are just data that we never need to sort by, but we cram them into the key to save space.
  869. <br><br>
  870. We try to keep all of the Urldb records in memory. Not necessarily in the RdbTree, but in a disk page cache. Therefore, try to keep urldbMaxDiskPageCacheMem bigenough to hold the entire Urldb in memory. This speeds up things a good bit. Urldb also takes heavy advantage of using half keys for compression, as described above under Indexdb. Urldb's disk page cache is also biased, that is, lower docids are looked up on one host, and higher docids are looked up on that host's twin. This split, or biases, the cache making it twice as effective. If more than two hosts are in a group (shard), then it will split the cache across all of them equally. This logic is in Msg22.cpp. (search for "bias" in that file).
  871. <br><br>
  872. Every time Titledb files are merged, the affected docids must be updated in
  873. Urldb, but because Urldb is so much smaller than Titledb, it does not inhibit
  874. performance.
  875. <br><br>
  876. Msg14::addUrlRecs() will update urldb directly via Msg1 when Msg14 deletes, adds or updates a titledb record.
  877. <br><br>
  878. In conclusion, Urldb is for the performance benefit of Msg22 (used to lookup a TitleRec from a docid or url) and Msg10 (used to add a url to a spider queue). Without it, things would be *much* slower. Hooray for Urldb.
  879. <br><br>
  880. -->
  881. <a name="checksumdb"></a>
  882. <h2>Checksumdb</h2>
  883. <pre>
  884. cccccccc hhhhhhhh hhhhhhhh cccccccc h = host name hash
  885. cccccccc cccccccc cccccccc cddddddd c = content, collection and host hash
  886. dddddddd dddddddd dddddddd dddddddD d = docId , D = delbit
  887. </pre>
  888. <a name="sitedb"></a>
  889. <a name="siterec"></a>
  890. <a name="msg8"></a>
  891. <a name="msg9"></a>
  892. <h2>Sitedb</h2>
  893. <pre>
  894. dddddddd dddddddd dddddddd dddddddd d = domain hash (w/ collection)
  895. uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu u = special url hash
  896. uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu
  897. </pre>
  898. Sitedb maps a url to a site file number (sfn). The site file is now called
  899. a <a href="overview.html#ruleset">ruleset</a> file and has the name sitedbN.xml in the
  900. working directory. All rulesets must be archived in the Bitkeeper repository at /gb/conf/. Therefore, all gb clusters share the same ruleset name space. <b>Msg8.cpp</b> and <b>Msg9.cpp</b> are used to
  901. respectively get and set sitedb records.
  902. <a name="catdb"></a>
  903. <h2>Catdb</h2>
  904. <pre>
  905. dddddddd dddddddd dddddddd dddddddd d = domain hash
  906. uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu u = special url hash
  907. uuuuuuuu uuuuuuuu uuuuuuuu uuuuuuuu
  908. Data Block:
  909. . number of catides (1 byte)
  910. . list of catids (4 bytes each)
  911. . sitedb file # (3 bytes)
  912. . sitedb version # (1 byte)
  913. . siteUrl (remaining bytes)
  914. </pre>
  915. Catdb is a special implementation of Sitedb. While it is similar in that
  916. a single record is stored per url and the keys are created using the same
  917. hashes, the record stores additional category information about the url.
  918. This includes how many categories the url is in and which categories it is in
  919. (their ids). Like sitedb, Msg8 and Msg9 are used to get and set catdb. Msg2a
  920. is used to generate a full catdb using directory information and calling Msg9.
  921. <br><br>
  922. Catdb is only used at spider time for the spider to lookup urls
  923. and see if they are in the directory and which categories they are under.
  924. A url's category information is stored in its TitleRec and is also indexed
  925. using special terms (see below).
  926. <h3>dmozparse</h3>
  927. The program "dmozparse" is implemented individually and allows for the
  928. creation of proprietary data files using the RDF data available from DMOZ.
  929. Dmozparse will create two main data files, gbdmoz.content.dat and
  930. gbdmoz.structure.dat. gbdmoz.content.dat stores a list of url strings and
  931. their associated category IDs. This is used to populate catdb.
  932. gbdmoz.structure.dat stores and list of category names, their IDs, their
  933. parent IDs, their offsets into the original RDF files, and the number of urls
  934. present in each category. The offsets are used to lookup more complex
  935. information about the categories, such as sub-category lists and titles and
  936. summaries for urls.
  937. See the Overview for proper use of dmozparse.
  938. <h3>Msg2a</h3>
  939. <a name="msg2a"></a>
  940. Catdb is generated using Msg2a, which reads the gbdmoz.content.dat file
  941. generated by dmozparse and sets the Catdb records accordingly with Msg9.
  942. Msg2a is also used to update Catdb when new data is available.
  943. See the Overview for proper building and updating of Catdb.
  944. <h3>Categories</h3>
  945. The Categories class is used to store the Directory's hierarchy and provide
  946. general functionality when dealing with the directory. It is instanced
  947. as the global "g_categories". The hierarchy is loaded from the
  948. gbdmoz.structure.dat file when Gigablast starts. Categories provides an
  949. interface to lookup and print category names based on their ID. Titles
  950. and Summaries for a given url in a category may also be looked up. Categories
  951. also provides the ability to generate a list of sub-categories for a given
  952. categories, which is used by Msg2b (see below).
  953. <a name="msg2b"></a>
  954. <h3>Msg2b</h3>
  955. Msg2b is used to generate and sort a category's directory listing. This
  956. includes the sub-categories, related categories, and other languages of the
  957. given category. Input is given as the category ID to generate the directory
  958. listing for. Msg2b will store the listing internally. This is primarily used
  959. by PageResults to make a DMOZ style directory which can be browsed by users.
  960. <h3>Indexed Catids, Direct and Indirect</h3>
  961. When a url that is in the directory gets spidered, the IDs of the categories
  962. the url is in are indexed. For the exact categories the url appears in,
  963. the prefix "gbdcat" is hashed with the category id. For all parent
  964. categories of these categories, the prefix "gbpdcat" is used. The base
  965. categores the url is in will also be indexed with "gbpdcat". These are
  966. considered "direct" catids.
  967. <br><br>
  968. All urls which are under a sub-url that is in the directory will have
  969. category IDs indexed as "indirect" catids. All of the direct catids
  970. associated with the sub-url will be indexed using the prefixes "gbicat"
  971. and "gbpicat".
  972. <br><br>
  973. Indexing the direct catids allows for a fast lookup of all the urls under a
  974. category. Searches can also be restricted to single categories, either the
  975. base category itself or the inclusion of all sub-categories. Indirect
  976. catids allows pages under those listed in a category to be searched as well.
  977. <hr>
  978. <a name="networklayer"></a>
  979. <h1>The Network Layer</h1>
  980. The network code consists of 5 parts:
  981. <br>
  982. 1. The UDP Server<br>
  983. 2. The Multicast class<br>
  984. 3. The Message Classes<br>
  985. 4. The TCP Server<br>
  986. 5. The HTTP Server<br>
  987. <br>
  988. <a name="udpserver"></a>
  989. <a name="udpprotocol"></a>
  990. <a name="udpslot"></a>
  991. <br><br>
  992. <a name="netclasses"></a>
  993. <h2>Associated Classes (.cpp and .h files)</h2>
  994. <table cellpadding=3 border=1>
  995. <tr><td>Dns</td><td>Net</td><td>A DNS client built on top of the UdpServer class.</td></tr>
  996. <tr><td>DnsProtocol</td><td>Net</td><td>Uses UdpServer to make a protocol for talking to DNS servers. Used by Dns class.</td></tr>
  997. <tr><td>HttpMime</td><td>Net</td><td>Creates and parses an HTTP MIME header.</td></tr>
  998. <tr><td>HttpRequest</td><td>Net</td><td>Creates and parses an HTTP request.</td></tr>
  999. <tr><td>HttpServer</td><td>Net</td><td>Gigablast's highly efficient web server, contains a TcpServer class.</td></tr>
  1000. <tr><td>Multicast</td><td>Net</td><td>Used to reroute a request if it fails to be answered in time. Also used to send a request to multiple hosts in the cluster, usually to a group (shard) for data storage purposes.</td></tr>
  1001. <tr><td>TcpServer</td><td>Net</td><td>A TCP server which contains an array of TcpSockets.</td></tr>
  1002. <tr><td>TcpSockets</td><td>Net</td><td>A C++ wrapper for a TCP socket.</td></tr><tr><td>UdpServer</td><td>Net</td><td>A reliable UDP server that uses non-blocking sockets and calls handlers receiving a message. The handle called depends on that message's type. The handler is UdpServer::m_handlers[msgType].</td></tr>
  1003. <tr><td>UdpSlot</td><td>Net</td><td>Basically a "socket" for the UdpServer. The UdpServer contains an array of a few thousand of these. When none are available to conduct receive a request, the dgram is dropped and will later be resent by the requester in a back-off fashion.</td></tr>
  1004. </table>
  1005. <br><br>
  1006. <h2>The UDP Server</h2>
  1007. The <b>UdpServer.cpp</b>, <b>UdpSlot.cpp</b> and <b>UdpProtocol.h</b>
  1008. classes constitute the framework
  1009. for Gigablast's UDP server, a fast and reliable method for transmitting data
  1010. between hosts in a Gigablast network.
  1011. <br><br>
  1012. Gigablast uses two instances of the UdpServer class. One of them,
  1013. g_udpServer2, uses asynchronous signals to reply to received dgrams by
  1014. interrupting the current process. g_udpServer is not asynchronous. The asynchronous instance is not really used any more since the 2.4.31 kernel does not seem to really support it because ping replies were not instant under heavy load.
  1015. <br><br>
  1016. The way it works is that the caller calls g_udpServer.sendRequest ( ... )
  1017. to send a message. The receiving host will get a signal that the udp socket
  1018. descriptor is ready for reading. It will then peek at the datagram using the
  1019. read ( ... , MSG_PEEK , ... ) call to get the transid (transaction id) from
  1020. the datagram. The transid is either associated with an existing UdpSlot or a
  1021. new UdpSlot is created to handle the transaction. The UdpSlots are stored in
  1022. a fixed-size array, so if you run out of them, the incoming datagram will
  1023. be silently dropped and will be re-sent later.
  1024. <br><br>
  1025. When a UdpSlot is selected to receive a datagram, it then calls
  1026. UdpSlot::sendAck() to send back an ACK datagram packet. When the sender
  1027. receives the ACK it sets a bit in UdpSlot::m_readAckBits[] to indicate it has
  1028. done so. Likewise, UdpSlot::m_sendAckBits[] are updated on the machine
  1029. receiving the message, and m_readBits[] and m_sendBits[] are used to keep track
  1030. of what non-ACK datagrams have been read or sent respectively.
  1031. <br><br>
  1032. Gigablast will send up to ACK_WINDOW_SIZE (defined in UdpSlot.cpp as 12)
  1033. non-ACK datagrams before it requires an ACK to send more. If doing a local send
  1034. then Gigablast will send up to ACK_WINDOW_SIZE_LB (loopback) datagrams.
  1035. <br><br>
  1036. If the sending host has not received an ACK datagram within RESEND_0
  1037. (defined to be 33 in UdpSlot.cpp) milliseconds, it will resend the datagrams
  1038. that have not been ACKed. If the host is sending on g_udpServer, then RESEND_1
  1039. (defined to be 100 in UdpSlot.cpp) milliseconds is used. If sending a short
  1040. message (under one datagram in size) then RESEND_0_SHORT (33 ms) is used. The
  1041. resends are taken care of in in UdpServer::timePoll() which is called every
  1042. 20ms for g_udpServer2 (defined in main.cpp) and every 60ms for g_udpServer.
  1043. <br><br>
  1044. Each datagram can be up to DGRAM_SIZE bytes. You can make this up to 64k
  1045. and let the kernel deal with chopping the datagrams up into IP packets that
  1046. fit under the MTU, but unfortunately, the new 2.6 kernel does not allow you
  1047. to send datagrams larger than the MTU over the loopback device. Datagrams
  1048. sent over the loopback device can be up to DGRAM_SIZE_LB bytes, and datagrams
  1049. used by the Dns class can be up to DGRAM_SIZE_DNS bytes.
  1050. <br><br>
  1051. The UdpProtocol class is used to define and parse the header used by each
  1052. datagram. The Dns class derives its DnsProtocl class from UdpProtocol in
  1053. order to use the proper DNS headers.
  1054. <br><br>
  1055. The bitmap of the header used by the UdpProtocol class is the following:
  1056. <pre>
  1057. REtttttt ACNnnnnn nnnnnnnn nnnnnnnn R = is Reply?, E = hadError? N=nice
  1058. iiiiiiii iiiiiiii iiiiiiii iiiiiiii t = msgType, A = isAck?, n = dgram #
  1059. ssssssss ssssssss ssssssss ssssssss i = transId C = cancelTransAck
  1060. dddddddd dddddddd dddddddd dddddddd s = msgSize (iff !ack) (w/o hdrs!)
  1061. dddddddd ........ ........ ........ d = msg content ...
  1062. </pre>
  1063. The niceness (N bit) of a datagram can be either 0 or 1. If it is 0 (not nice)
  1064. then it will take priority over a datagram with a niceness of 1. This just
  1065. means that we will call the handlers for it first. It's not a very big deal.
  1066. <br><br>
  1067. The datagram number (n bits) are used to number the datagrams so we can
  1068. re-assemble them in the correct order.
  1069. <br><br>
  1070. The message type corresponds to all of the Msg*.cpp classes you see in the
  1071. source tree. The message type also maps to a pointer to a function, called the
  1072. handler function, which generates and sends back a reply. This map (array) of
  1073. pointers to handler functions is UdpServer::m_handlers[]. Each Msg*.cpp class
  1074. must call g_udpServer2.registerHandler() to register its handler function for
  1075. its message type. Its handler function will be called when a complete request of its message type (msgType) is received.
  1076. <br><br>
  1077. The handler function must set the UdpSlot's (UDP socket) m_requestBuf to NULL if it does not want UdpServer to free it. More likely, however, is that when receiving a <b>reply</b> from the UdpServer you will want to set UdpSlot::m_readBuf to NULL so the UdpServer doesn't free it, but then, of course, you are responsible for freeing it.
  1078. <br><br>
  1079. <a name="multicasting"></a>
  1080. <a name="multicast"></a>
  1081. <h2>Multicasting</h2>
  1082. The <b>Multicast.cpp</b> class is also considered part of the network layer.
  1083. It takes a request and broadcasts it to a set of twin hosts. It can do this in
  1084. one of two ways. One way is to send the request to a set of twin hosts and
  1085. not complete until all hosts have sent back a reply. This method
  1086. is used when adding data. Gigablast needs to add data to a host and all of
  1087. its mirrors for the add to be considered a success. So if one host crashes
  1088. and is not available to receive the data, the requester will keep re-sending
  1089. the request to that host until it comes back online.
  1090. <br><br>
  1091. The second broadcast method employed by Multicast.cpp is used mainly for
  1092. reading data. If m_loadBalancing is true, it will send an asynchronous
  1093. request to each host (using Msg34.cpp) in a set of twin hosts to determine the load on each host.
  1094. Once it knows the load of each host, the multicast will forward the request to
  1095. the least loaded host. If, for some reason, the selected host does not reply
  1096. within a set amount of time (depends on the message type -- see Multicast.cpp)
  1097. then the request will be re-routed to another host in the set of twins.
  1098. <a name="msgclasses"></a>
  1099. <h2>The Msg classes</h2>
  1100. All of the Msg*.cpp files correspond to network messages. Most of them are
  1101. requests. See the classes that start with Msg* for explanations in the
  1102. <a href="#filelist">File List</a>.
  1103. <br><br>
  1104. <a name="tcpserver"></a>
  1105. <h2>The TCP Server</h2>
  1106. <br>
  1107. <a name="httpserver"></a>
  1108. <h2>The HTTP Server</h2>
  1109. <br>
  1110. <a name="dns"></a>
  1111. <h2>The DNS Resolver</h2>
  1112. <br>
  1113. Originally Dns.cpp was built to interface with servers running the bind9
  1114. DNS program. A query containing the hostname would be sent to a bind9 process
  1115. (which was selected based on the hash of the hostname)
  1116. and the bind9 process would
  1117. perform the recursive DNS lookups and return the final answer, the IP address
  1118. of that hostname.
  1119. <br><br>
  1120. However, since the advent of the 256 node cluster, we've had to run multiple
  1121. instances of bind9, like about 10 of them, just to support the spider load.
  1122. Rather than have to worry about if these processes or servers go down, or
  1123. if the processes get terminated, or contend too much for memory or cpu with
  1124. the Gigablast processes they share the server with, we decided to just replace
  1125. bind9's recursive lookup functionality with about 400 lines of additional
  1126. code the Dns.cpp. This also serves to make the code base more independent.
  1127. <br><br>Dns::getIp() is the entry point to Dns.cpp's functionality. It will
  1128. create a UDP datagram query and send it to a nameserver (DNS server). It
  1129. will expect to get back the IP address of the hostname in the query or receive
  1130. an error packet, or timeout. Now, since bind9 is out of the loop, we have
  1131. to direct the request to a root DNS server. The root servers will refer us
  1132. to TLD root servers which will then refer us to others, etc. Sometimes
  1133. the nameservers we are referred to do not have their IPs listed in the reply
  1134. and so we have to look those up on the side. Also, we often are referred to
  1135. a list of servers to ask, and so if any of those timeout we have to move to
  1136. the next in the list. This explains the nature of the DnsState
  1137. class which is used to keep our information persistent as we wait for replies.
  1138. <br><br>
  1139. DnsState has an m_depth member which is 0 when we ask the root servers,
  1140. 1 when we ask the TLD root servers, etc. m_depth is used to offset us into
  1141. an array of DNS IPs, m_dnsIps[], and an array of DNS names, m_dnsNames[], which
  1142. unfortunately do not have IPs included and most be looked up should we need
  1143. to ask them. Furthermore, at each depth we are limited to MAX_DNS_IPS IP
  1144. addresses and MAX_DNS_IPS DNS names. So if we get referred to more than
  1145. that many IP addresses we will truncate the list. This value is a hefty 32 right
  1146. now, and that is per depth level. MAX_DEPTH is about 7. That is how many times
  1147. we can be referred to other nameservers before giving up and returning
  1148. ETIMEDOUT. We also avoid asking the same IP address twice by keeping a list
  1149. of m_triedIps in the DnsState. And DnsState has an m_buf buffer that is used
  1150. for allocating extra DnsStates for doing IP lookups on nameservers we are
  1151. referred to but have no IPs given in the response. We can use m_buf to
  1152. recursively create
  1153. up to 3 DnsStates before giving up. For instance, if we are trying to get
  1154. the IP of a nameserver, and we are referred to 2 more nameservers, without IPs,
  1155. so we must use the DnsState, etc.
  1156. <br><br>
  1157. To avoid duplicate parallel IP lookups we use s_table, which is a hashtable
  1158. using the HashTableT.cpp class in which each bucket is a CallbackEntry, which
  1159. consists of a callback function and a state. So Dns::getIp() will use the
  1160. hostname you want to get an IP for as the key into this hash table. If there
  1161. is a CallbackEntry for that key then you must wait in line because someone
  1162. has already sent out a request for that hostname. The line is kept in the form
  1163. of a linked list. Each CallbackEntry has an m_nextKey member, too, which
  1164. refers to the key of the next CallbackEntry in s_table waiting on that same
  1165. hostname. However, the people that are not first in line use a s_bogus for
  1166. their key, since all keys in the hash table must be unique. And s_bogus key
  1167. is a long long that is incremented each time so it is used, so it should never
  1168. ever wrap, since a long long is absolutely huge, like beyond astronomical.
  1169. <br><br>
  1170. We also cache timeouts into the RdbCache for about a day. That way we do not
  1171. bottleneck on them forever. Other errors were always cached, but not timeouts
  1172. until now.
  1173. We now also support a TTL in the cache. Every DNS reply has a TTL
  1174. which specifies for how long the IP is good for and we put that into the
  1175. cache. We actually limit it to MAX_DNS_CACHE_AGE which is currently set to
  1176. 2 days (it is in seconds).
  1177. We may want to cache some resource records, so, for instance, we do
  1178. not have to keep asking the root servers for the TLD servers if we know what
  1179. they are.
  1180. And along the lines of future performance enhancements,
  1181. we may want to consider asking
  1182. multiple nameservers, launching the requests within about a second of each
  1183. other. And, if a reply comes back for a slot that was timedout, we should
  1184. probably still entertain it as a late reply.
  1185. <br><br>
  1186. DNS debug messages can be turned on at any time from the log tab. It is often
  1187. helpful to use dig to assist you. like
  1188. <i>dig @1.2.3.4 xyz.com +norecurse</i> will show you DNS 1.2.3.4 server's
  1189. response to a request for the IP of xyz.com. And <i>dig xyz.com +trace</i>
  1190. will show you all the servers and replies dig communes with to resolve a
  1191. hostname to its IP address. dig also seems to get IPv6 IP addresses whereas
  1192. the Gigablast resolver does not work on. This may be something to support
  1193. later.
  1194. <a name="layer3"></a>
  1195. <h2></h2>
  1196. <a name="layer4"></a>
  1197. <h2></h2>
  1198. <a name="layer5"></a>
  1199. <h2></h2>
  1200. <a name="layer6"></a>
  1201. <h2></h2>
  1202. <a name="layer7"></a>
  1203. <h2></h2>
  1204. <br><br>
  1205. <a name="buildlayer"></a>
  1206. <a name="pageinject"></a>
  1207. <hr><h1>The Build Layer</h1>
  1208. This section describes how Gigablast downloads, parses, scores and indexes documents.
  1209. <!--<b>SpiderLoop.cpp</b>'s spiderUrl() routine is called every 30ms or so by Loop.cpp to spider a new url. If spiders are deactivated then it just returns, otherwise, it tries to get a url from <a href="#spiderdb">spiderdb</a> to spider. If successful, it allocates a new Msg14 class and passes the url to that. Msg14 downloads, parses and scores the document, and ultimately adds records to indexdb, titledb, clusterdb and checksumdb for that document. Alternately, urls and their corresponding content can be directly injected using the injection interface, <b>PageInject.cpp</b>, which calls a Msg14 directly.
  1210. <br><br>
  1211. This table illustrates the path of control:
  1212. <br><br>
  1213. <table cellpadding=3 border=1>
  1214. <td><td>1.</td><td>Loop.cpp calls SpiderLoop::spiderUrl() every ~30ms.</td></tr>
  1215. <td><td>2.</td><td>SpiderLoop::spiderUrl() calls SpiderCache::getNextSpiderRec() to get a <a href="#spiderdb">spiderdb</a> record which contains a url to spider.</td></tr>
  1216. <td><td>3.</td><td>SpiderLoop allocates a new Msg14 and stores the pointer in the array of Msg14 pointers.
  1217. </td></tr>
  1218. <td><td>4.</td><td>SpiderLoop calls Msg14::spiderUrl() with a pointer to that Spiderdb record.</td></tr>
  1219. <td><td>5.</td><td>Msg14 downloads, parses and scores the document.</td></tr>
  1220. <td><td>6.</td><td>Msg14 adds/deletes records to/from <a href="#indexdb">Indexdb</a>, <a href="#titledb">Titledb</a>, <a href="#checksumdb">Checksumdb</a> and <a href="#clusterdb">Clusterdb</a>.</td></tr>
  1221. <td><td>7.</td><td>Msg14 calls Msg10 to add links to spiderdb.</td></tr>
  1222. <td><td>8.</td><td>Msg14 calls SpiderLoop::doneSpider().</td></tr>
  1223. <td><td>9.</td><td>This loop is repeated starting at step 1.</td></tr>
  1224. </table>
  1225. <br><br>
  1226. -->
  1227. <a name="buildclasses"></a>
  1228. <h2>Associated Files</h2>
  1229. <table cellpadding=3 border=1>
  1230. <tr><td>AdultBit.cpp</td><td>Build</td><td>Used to detect if document content is naughty.</td></tr>
  1231. <tr><td>Bits.cpp</td><td>Build</td><td>Sets descriptor bits for each word in a Words class.</td></tr>
  1232. <tr><td>Categories.cpp</td><td>Build</td><td>Stores DMOZ categories in a hierarchy.</td></tr>
  1233. <!--<tr><td>DateParse</td><td>Build</td><td>Extracts the publish date from a document.</td></tr>-->
  1234. <tr><td>Lang.cpp</td><td>Build</td><td>Unused.</td></tr>
  1235. <tr><td>Language.cpp</td><td>Build</td><td>Enumerates the various languages supported by Gigablast's language detector.</td></tr>
  1236. <tr><td>LangList.cpp</td><td>Build</td><td>Interface to the language-specific dictionaries used for language identification by XmlDoc::getLanguage().</td></tr>
  1237. <tr><td>Linkdb.cpp</td><td>Build</td><td>Functions to perform link analysis on a docid/url. Computes a LinkInfo class for the docId. LinkInfo class contains Inlink classes serialized into it for each non-spammy inlink detected. Also contains a Links class that parses out all the outlinks in a document.</td></tr>
  1238. <!--<tr><td>Msg10</td><td>Build</td><td>Adds a list of urls to spiderdb for spidering.</td></tr>
  1239. <tr><td>Msg13</td><td>Build</td><td>Tells a server to download robots.txt (or get from cache) and report if Gigabot has permission to download it.</td></tr>
  1240. <tr><td>Msg14</td><td>Build</td><td>The core class for indexing a document.</td></tr>
  1241. <tr><td>Msg15</td><td>Build</td><td>Called by Msg14 to set the Doc class from the previously indexed TitleRec.</td></tr>
  1242. <tr><td>Msg16</td><td>Build</td><td>Called by Msg14 to download the document and create a new titleRec to set the Doc class with.</td></tr>
  1243. <tr><td>Msg18</td><td>Build</td><td>Unused. Was used for supporting soft banning.</td></tr>
  1244. <tr><td>Msg19</td><td>Build</td><td>Determine if a document is a duplicate of a document already indexed from that same hostname.</td></tr>
  1245. <tr><td>Msg23</td><td>Build</td><td>Get the link text in a document that links to a specified url. Also returns other info besides that link text.</td></tr>
  1246. <tr><td>Msg8</td><td>Build</td><td>Gets the Sitedb record given a url.</td></tr><tr><td>Msg9</td><td>Build</td><td>Adds a Sitedb record to Sitedb for a given site/url.</td></tr>
  1247. -->
  1248. <tr><td>PageAddUrl.cpp</td><td>Build</td><td>HTML page to add a url or file of urls to spiderdb.</td></tr>
  1249. <tr><td>PageInject.cpp</td><td>Build</td><td>HTML page to inject a page directly into the index.</td></tr>
  1250. <tr><td>Phrases.cpp</td><td>Build</td><td>Generates phrases for every word in a Words class. Uses the Bits class.</td></tr>
  1251. <tr><td>Pops.cpp</td><td>Build</td><td>Computes popularity for each word in a Words class. Uses the dictionary files in the dict subdirectory.</td></tr>
  1252. <tr><td>Pos.cpp</td><td>Build</td><td>Computes the character position of each word in a Words class. HTML entities count as a single character. So do back-to-back spaces.</td></tr>
  1253. <!--<tr><td>Robotdb.cpp</td><td>Build</td><td>Caches and parses robots.txt files. Used by Msg13.</td></tr>-->
  1254. <!--<tr><td>Scores</td><td>Build</td><td>Computes the score of each word in a Words class. Used to weight the final score of a term being indexed in TermTable::hash().</td></tr>-->
  1255. <tr><td>Spam</td><td>Build</td><td>Computes the probability a word is spam for every word in a Words class.</td></tr>
  1256. <!--<tr><td>SpamContainer</td><td>Build</td><td>Used to remove spam from the index using Msg1c and Msg1d.</td></tr>-->
  1257. <tr><td>Spider.cpp</td><td>Build</td><td>Has most of the code used by the spidering process. SpiderLoop is a class in there that is the heart of the spider. It is the control loop that launches spiders.</td></tr>
  1258. <!--<tr><td>Stemmer</td><td>Build</td><td>Unused. Given a word, computes its stem.</td></tr>-->
  1259. <tr><td>StopWords.cpp</td><td>Build</td><td>A table of stop words, used by Bits to see if a word is a stop word.</td></tr>
  1260. <!--<tr><td>TermTable</td><td>Build</td><td>A hash table of terms from a document. Consists of termIds and scores. Used to accumulate scores. TermTable::hash() is arguably a heart of the build process.</td></tr>-->
  1261. <!--<tr><td>Url2</td><td>Build</td><td>For hashing/indexing a url.</td></tr>-->
  1262. <tr><td>Words.cpp</td><td>Build</td><td>Breaks a document up into "words", where each word is a sequence of alphanumeric characters, a sequence of non-alphanumeric characters, or a single HTML/XML tag. A heart of the build process.</td></tr>
  1263. <tr><td>Xml.cpp</td><td>Build</td><td>Breaks a document up into XmlNodes where each XmlNode is a tag or a sequence of characters which are not a tag.</td></tr>
  1264. <tr><td>XmlDoc.cpp</td><td>Build</td><td>The main document parsing class. A huge file, pretty much does all the parsing.</td></tr>
  1265. <tr><td>XmlNode</td><td>Build</td><td>Xml classes has an array of these. Each is either a tag or a sequence of characters that are between tags (or beginning/end of the document).</td></tr>
  1266. <tr><td>dmozparse</td><td>Build</td><td>Creates the necessary dmoz files Gigablast needs from those files downloadable from DMOZ.</td></tr>
  1267. </table>
  1268. <a name="spiderloop"></a>
  1269. <h2>The Spider Loop</h2>
  1270. <b>SpiderLoop.cpp</b> finds what URLs in <a href="#spiderdb">spiderdb</a> to spider next. It is based on the URL filters table that you can specify in the admin interface. It calls XmlDoc::index() to download and index the URL.
  1271. <br><br>
  1272. <!--
  1273. <a name="spidercache"></a>
  1274. <a name="spidercollection"></a>
  1275. <a name="spiderqueue"></a>
  1276. <h2>The Spider Cache</h2>
  1277. To avoid doing a disk seek into spiderdb in an attempt to get a url to spider,
  1278. Gigablast preloads the urls from spiderdb into its Spider Cache, as defined
  1279. in <b>SpiderCache.cpp</b>. It uses a global instance, g_spiderCache, for all
  1280. collections. The Spider Cache will store the spider records it
  1281. loads in an instance of <a href="#rdbtree">RdbTree</a> called s_tree.
  1282. It basically uses the same
  1283. key in s_tree as it does in spiderdb, but it changes the timestamp in the key
  1284. so as to
  1285. fill the tree with as diverse a selection as domains as possible, to avoid
  1286. one domain from dominating the tree, and thereby blocking urls from the other
  1287. domains from getting indexed. A count of each domain added to s_tree is
  1288. recorded in a hash table, and that count is multiplied by the
  1289. <i>sameDomainWait</i> parm to come up with a new timestamp for the s_tree key.
  1290. Other than that, the s_tree key is the same as the spiderdb key.
  1291. <br><br>
  1292. <i>sameIpWait</i> is similar to <i>sameDomainWait</i> but is based on IP,
  1293. and is mainly used to prevent Gigabot from hammering websites with multiple
  1294. domains hounsed under a single IP address. Since IP addresses are not
  1295. currently stored
  1296. in a Spiderdb record, Gigablast will perform the IP lookup at spider time
  1297. and cache it in SpiderCache::s_localDnsCache so it can quickly check that
  1298. IP cache when it is loading the urls from disk. Additionally, SpiderCache
  1299. uses s_domWaitTable and s_ipWaitTable to record the last download times
  1300. of a url from a particular domain or IP.
  1301. <br><br>
  1302. SpiderCache is divided into a SpiderCollection class for every collection and
  1303. SpiderCollection is divided into 16 SpiderQueue classes, one for each possible
  1304. spider queue. Remember, there are 8 spider priorities and a spider queue can
  1305. contain either old or new urls. SpiderCache::sync() is called to sync up the
  1306. SpiderCollections with the actual collections should a collection be added
  1307. or deleted. SpiderCache::addSpiderRec() is called to try to add a Spiderdb
  1308. record to the cache, actually an RdbTree called s_tree.
  1309. SpiderCache::getNextSpiderRec() is called to a get a Spiderdb record for
  1310. spidering. And SpiderCache::doneSpidering() is called when a Spiderdb record
  1311. is done being spidered and should be removed from the cache.
  1312. SpiderCache::loadLoop() loads records from Spiderdb on disk for spider queues
  1313. that are active. It can only be loading one spider queue at a time; this is
  1314. ensured by the s_gettingList flag.
  1315. <br><br>
  1316. SpiderCache uses s_tree, an RdbTree, to hold all of the Spiderdb records that
  1317. it contains. SpiderCache does not use a record's Spiderdb key as the key
  1318. for s_tree,
  1319. rather it uses a specially computed cache key. This cache key is essentially
  1320. the same as the spiderdb record key, but the time stamp is changed to reflect a
  1321. score.
  1322. This score is essentially linear, starting with 0, but it is increased by
  1323. 1000 for every successive url it loads from the same domain or IP address.
  1324. This makes getting the next url to spider faster. Also, if a url is not
  1325. slated to be spidered until the future then the score is the scheduled time
  1326. in seconds since the epoch, the same value that was in the Spiderdb record.
  1327. The score is visible in the Spider Queue control.
  1328. <br><br>
  1329. When looking at the Spider Queue control in Gigablast's web GUI you will note
  1330. that there is a "in cache" or "not in cache" control that you can toggle.
  1331. Select it to "in cache" and then click on a <i>priority</i> and you will see
  1332. all of the urls in s_tree for that
  1333. spider queue. Once there you will note some parameters in the
  1334. first row of the table. One is the <i>water</i> parameter. That is how many
  1335. urls, max, were loaded into the cache for that spider queue during the last
  1336. load time. The <i>loading</i> flag is 1 if Gigablast is currently loading
  1337. urls for that spider queue, 0 otherwise. The <i>scanned</i> parameter is
  1338. how many bytes have been read trying to load urls for that spider queue, and
  1339. <i>elapsed</i> is how much time as elapsed since the start or end of the load,
  1340. depending on if the load is in progress or not. <i>bubbles</i> represents
  1341. how many attempts were made to get a url to spider from that spider queue,
  1342. but a url was not available from that spider queue. Typically, bubbles are bad
  1343. because you always want to give the spider something to do. When a spider queue
  1344. bubbles it is often because there are not enough urls available, a load is
  1345. going on and needs some more time to load more urls or most of the urls that
  1346. are available are all from the same few domains or IPs and Gigablast does not
  1347. want to hammer any particular server because of the "same domain wait" and
  1348. "same IP wait" controls. <i>hits</i> is the counterpart of <i>bubbles</i> and
  1349. is how many urls were returned for spidering from that spider queue. And
  1350. <i>cached</i> is how many urls are current in s_tree for that spider queue.
  1351. <br><br>
  1352. Gigablast will only attempt to reload a spider queue when it bubbles and is
  1353. unable to return a url to spider through a call to
  1354. SpiderQueue::getNextSpiderRec(). It will never attempt the reload that
  1355. spider queue if less
  1356. than SPIDER_RELOAD_RATE seconds have elapsed since the last reload time.
  1357. Currently this is #defined to be 8 minutes. But it can reload without
  1358. waiting SPIDER_RELOAD_RATE seconds if the number of urls currently in the
  1359. cache for that spider queue are less than half of the water mark for that
  1360. spider queue. These
  1361. criteria prevent Gigablast from constantly trying to reload a spider queue
  1362. just because it consists of, say, 100,000 urls from the same domain and is
  1363. constantly bubbling.
  1364. <br><br>
  1365. When loading spider records, SpiderCache::addSpiderRec() is called for every
  1366. spider record loaded. All of the filtering logic is in there. Each spider queue
  1367. can have up to MAX_CACHED_URLS_PER_QUEUE urls in its part of the s_tree cache.
  1368. Currently this is #defined to be 50000. If you try to add a Spiderdb record
  1369. to a spider queue cache that is at this limit, then Gigablast will score that
  1370. spider record and attempt to kick out a lower scoring spider record if
  1371. possible. When links are spidered, Msg10 will often call
  1372. g_spiderCache::addSpiderRec() to attempt
  1373. to add the new links to the appropriate cached spider queue and take advantage
  1374. of this kick-out logic.
  1375. <br><br>
  1376. When loading urls from spiderdb into s_tree,
  1377. Gigablast will scan all the urls in spiderdb
  1378. for that spider queue in order to get the best possible sampling of domains.
  1379. When adding new records to s_tree it may kick out old ones that have a lower
  1380. key in order to stay below the 30,000 max url limit per spider priority.
  1381. <br><br>
  1382. SpiderCache is responsible for adhering to the Conf::m_maxIncomingKbps
  1383. and Conf::m_maxPagesPerSecond parameters reflected in the gb.conf file and
  1384. the Master Controls web page. It keeps track of the average page size so
  1385. that urls that are being fetched, if they haven't already been downloaded,
  1386. are assumed to be downloading a page with that average page size. In this
  1387. manner Gigablast does a very good job of obeying bandwidth limitations. These
  1388. two constraints are also settable on a per collection basis as well.
  1389. <br><br>
  1390. SpiderCache juggles between various collections. When multiple collections
  1391. are being spidered SpiderCache will use the SpiderCache::getRank() function
  1392. to determine which collection should provide the next url. This is based on
  1393. how long it has been since the collection downloaded a url and on that
  1394. collection's CollectionRec::m_maxKbps and CollectionRec::m_maxPagesPerSecond
  1395. parameters. If either of these parameters is -2 it means it is unbounded.
  1396. <br><br>
  1397. -->
  1398. <!--
  1399. <a name="msg14"></a>
  1400. <h2>Msg14</h2>
  1401. After the Spider Cache provides a url to spider, Spider Loop allocates a new
  1402. Msg14 and passes the url to it for indexing.
  1403. The Spider Loop will call either Msg14::spiderUrl() or Msg14::injectUrl() to
  1404. kick off the indexing process. These routines are basically different start
  1405. points for the same process, they just initialize a few parameters differently.
  1406. <br><br>
  1407. <a name="msg15"></a>
  1408. <h2>Msg15</h2>
  1409. Next Msg14 will call Msg15 which calls Msg22::getTitleRec() which will
  1410. attempt to load the old <a href="#titledb">titledb</a> record
  1411. from the provided url or docid.
  1412. The titledb record, defined by <b>TitleRec.cpp</b>, is contained in
  1413. Msg14::m_oldDoc::m_titleRec.
  1414. Msg15 will also set Msg14::m_oldDoc::m_siteRec, a <a href="#sitedb">sitedb</a>
  1415. record, from the site file number contained in that titledb record.
  1416. The site file number, also known as the ruleset number,
  1417. corresponds to the <a href="overview.html#ruleset">ruleset</a> that was used to index the
  1418. document the last time. The same ruleset number from any gigablast cluster should always correspond to the same sitedb*.xml file in order to avoid namespace collisions. We keep all sitedb*.xml files in /gb/conf/ under Bitkeeper control.
  1419. <br><br>
  1420. If the <a href="#spiderdb">spiderdb</a> record of the url being spidered
  1421. is marked as "old" but
  1422. Msg15 failed to load the old titledb record, then Msg14 will complain in the
  1423. log, but keep going regardless. Similarly, if it was marked as "new" in the
  1424. spiderdb record, but was found in titledb.
  1425. For new urls, urls that are being spidered and are not already in the index,
  1426. Msg22 will contain the new actual docid for the url, as described in the
  1427. <a href="#docids">DocIds</a> section. Ultimately, Msg15 will set
  1428. Msg14::m_oldDoc, an instance of the <a href="#doc">Doc</a> class,
  1429. which is central to the parser.
  1430. <br><br>
  1431. <a name="msg16"></a>
  1432. <h2>Msg16</h2>
  1433. In a similar manner, Msg16 will set Msg14::m_newDoc, but Msg16 will actually
  1434. download the document, not take it from titledb, unless <i>recycle content</i>
  1435. is marked as true on the Spider Controls page. If Msg16 fails to download the
  1436. document because of a timeout or other similar error, Msg14 will delete the
  1437. document from the index and from spiderdb after <i>max retries</i> times, where
  1438. <i>max retries</i> is set from the Spider Controls page and can be as big as
  1439. three.
  1440. <br><br>
  1441. Msg16 may end up using a different ruleset to parse the document and set
  1442. the Doc class, and ending up with an IndexList a lot different than Msg15's
  1443. ruleset.
  1444. A ruleset is an XML document describing how to parse and score a document,
  1445. as further described in the
  1446. <a href="overview.html#ruleset">Overview</a> document.
  1447. <br><br>
  1448. Msg16 is also responsible for setting the <a href="#linkinfo">LinkInfo.cpp</a> class. That class is set by Msg25 and is used to increase the link-adjusted
  1449. quality of a document based on the number of incoming linkers, as can be seen
  1450. on the <a href="pageparser">Page Parser</a> page.
  1451. <br><br>
  1452. <a name="doc"></a>
  1453. <h2>The Doc Class</h2>
  1454. Msg15 and Msg16 both set a different instance of the <b>Doc</b> class.
  1455. The idea behind the Doc class is that it is ultimately a way to convert an
  1456. arbitrary document into an IndexList, defined by <b>IndexList.cpp</b>,
  1457. which is an RdbList of Indexdb records. Indexlist will convert a
  1458. <a href="#termtable">TermTable</a> class into an <a href="#rdblist">RdbList</a>
  1459. format. The TermTable is just a hash
  1460. table containing all of the term from the document that are to be indexed.
  1461. <br><br>
  1462. -->
  1463. <a name="linkanalysis"></a>
  1464. <a name="linkinfo"></a>
  1465. <a name="linktext"></a>
  1466. <a name="msg25"></a>
  1467. <a name="msg23"></a>
  1468. <a name="pageparser"></a>
  1469. <h2>Link Analysis</h2>
  1470. XmlDoc calls <b>Msg25</b> to set the LinkInfo class for a document. The
  1471. LinkInfo class is stored in the titledb record (the TitleRec) and can be
  1472. recycled if <i>recycle link info</i> is enabled in the Spider Controls.
  1473. LinkInfo contains an array of <b>LinkText</b> classes.
  1474. Each LinkText class corresponds
  1475. to a document that links to the url being spidered, and has an IP address,
  1476. a link-adjusted quality and docid of the linker, optional hyperlink text,
  1477. a log-spam bit, and the total number of outgoing links the linker has.
  1478. Exactly how the incoming link text is indexed and how the linkers affect the
  1479. link-adjusted quality of the url being indexed can be easily seen on the
  1480. Page Parser tool available through the administrative section or by clicking
  1481. on the [analyze] link next to a search result.
  1482. <br><br>
  1483. Msg25 uses
  1484. <a href="#msg20">Msg20</a> to get information about an inlinker.
  1485. It sends a Msg20 request to the machine that has the linking document in its
  1486. <a href="#titledb">titledb</a>. The handler loads the titledb record using
  1487. Msg22, and then extracts the relevant link text using the <b>Links.cpp</b>
  1488. class, and packages that up with the quality and docid of the linker in the
  1489. reply.
  1490. <br><br>
  1491. <!--
  1492. <a name="msg18"></a>
  1493. <a name="robotdb"></a>
  1494. <h2>Robots.txt</h2>
  1495. Msg16 uses <b>Msg18</b> to get the robots.txt page for a url. This request
  1496. uses a
  1497. distributed cache, so the request is actually forwarded to a host based on
  1498. the hash of the hostname of the url. <b>Robotdb.cpp</b> is used to handle the
  1499. caching logic, and the parsing of the robots.txt page. We recently added
  1500. support for the Crawl-Delay: directive.
  1501. <br><br>
  1502. <a name="termtable"></a>
  1503. <h2>The TermTable</h2>
  1504. This is a big hash table that contains all the terms to be indexed. It is
  1505. generated by XmlDoc::set() which essentially takes a titledb record (TitleRec)
  1506. and a sitedb record (SiteRec) as input and sets the TermTable as output. Msg14
  1507. uses the XmlDoc classes contained in Msg14::m_oldDoc and Msg14::m_newDoc
  1508. respectively, setting each with the appropriate TitleRec.
  1509. <br><br>
  1510. An IndexList can be set from the old and new TermTables using
  1511. IndexList::set(). It essentially subtracts
  1512. the two if <i>incremental indexing</i> is enabled in the Master Controls or
  1513. gb.conf. That way it will only add new terms or terms that have different
  1514. scores compared to the old TitleRec.
  1515. -->
  1516. <a name="samplevector"></a>
  1517. <h2>The Sample Vector</h2>
  1518. After downloading the page, XmlDoc::getPageSampleVector() generates a sample
  1519. vector for the document. This vector is formed by getting the top 32 or so
  1520. terms from the new TermTable sorted by their termId, a hash of each term.
  1521. This vector is stored in the TitleRec and used for deduping.
  1522. <br><br>
  1523. <a name="summaryvector"></a>
  1524. <h2>The Summary Vector</h2>
  1525. Deduping search results at query time is based partly on how similar the summaries are. We use XmlDoc::getSummaryVector() which takes two sample
  1526. vectors as input. The similarity threshold can be controlled from the
  1527. Search Controls' <i>percent similar dedup default</i> and it can also be
  1528. explicitly specified using the <i>psc</i> cgi parameter.
  1529. <a name="gigabitvector"></a>
  1530. <h2>The Gigabit Vector</h2>
  1531. Like the Sample Vector, the Gigabit Vector is a sample of the terms in a
  1532. document, but the terms are sorted by scores which are based on the popularity
  1533. of the term and on the frequency of the term in the document.
  1534. XmlDoc::getGigabitVector() computes that vector for a TitleRec and makes use
  1535. of the <a href="#msg24">Msg24</a> class which gets the gigabits for a document.
  1536. The resulting vector is stored in the TitleRec and used to do topic clustering.
  1537. <i>cluster by topic</i> can be enabled in the Search Controls, and
  1538. <i>percent similar topic default</i> can be used to tune the clustering
  1539. sensitivity so that documents in the search results are only clustered together
  1540. if their Gigabit Vectors are X% similar or more.
  1541. <a href="#msg38">Msg38</a> calls Clusterdb::getGigabitSimilarity()
  1542. to compute the similarity of two Gigabit Vectors.
  1543. <br><br>
  1544. <a name="deduping"></a>
  1545. <h2>Deduping at Spider Time</h2>
  1546. <i>Last Updated 1/29/2014 by MDW</i>
  1547. <br><br>
  1548. Many URLs contain essentially the same content as other URLs. Therefore we need to dedup URLs and prevent them from entering the index. Deduping in this fashion is disabled by default so you will need to enabled it in the Spider Controls for your collection. You will note such duplicate urls in the logs with the "Doc is a dup" (<a href=#indexcodes>EDOCDUP</a>) message. Those urls are not indexed, and if they were indexed, they will be removed.
  1549. <br><br>
  1550. When a page is indexed we store a hash of its content into a single "posdb" key.
  1551. We set the "sharded by termid" bit for that posdb key, so it is not sharded by docid which is the norm. Hostdb::getShard() checks for that bit and shards by termid when it is present. In that way all the pages with the same content hash will be together on disk in posdb.
  1552. <br><br>
  1553. We perform a lookup of all other docids that share the same content hash as the document we are indexing. If another docid has the same content hash and its site rank is the same as ours or higher, then we are a duplicate docid and we do not index our docid and getIndexCode() returns <a href=#indexcode>EDOCDUP</a>. Our document will be deleted if it was indexed. If another docid with the same content hash as us has a siterank lower than ours, then he is the dup and he will be removed when he is reindexed the next time with the <a href=#indexcode>EDOCDUP</a> error. If another docid with the same content hash as us as the SAME site rank as us, then we assume we are the dup. We go first come first server approach in that scenario, site rank trumps.
  1554. <br><br>
  1555. We use the getContentHashExact64() function to generate the 64-bit dup checksum. It hashes the document pretty much verbatim for safety. It does treat sequences of white space as a single space character however.
  1556. <br><br>
  1557. Other hash functions exist that hash all the month names and day of week names and ALL DIGITS to a specific number in order to ignore the clocks one finds on web pages. Additionally tags can be ignored unless they are a frame or iframe tag. Pure text is hashed as lower case alnum. Punctuatuation and spaces can be skipped over. This algorithm is currently unused but is available in the getLooseContentHash() function.
  1558. <br><br>
  1559. If we see a &lt;link href=<i>xxx</i> rel=canonical&gt; tag in the page and our url is different than <i>xxx</i>, then we do not index the url and use the <a href=#indexcode>ENOTCANONICAL</a> error, but we do insrt a spider request for the referenced canonical url so it can get picked up. The error when downloading such urlsshows up as "Url was dup of canonical page" in the logs. The canonical URL we insert should inherit the same SpiderRequest::m_isInjecting or SpiderRequest::m_isAddUrl bits of the "dup" url it came from, similar to how simplified meta redirects work.
  1560. <br><br>
  1561. <a name="indexcode"></a>
  1562. <h2>Indexing Error Codes</h2>
  1563. When trying to index a document we call XmlDoc::getIndexCode() to see if there was a problem. 0 means everything is good. Typically an error will be indicative of not indexing the document, or sometimes even removing it if it was indexed. Sometimes an error code will just do knowthing except add a SpiderReply to Spiderdb indicating the error so that the document might be tried again in the future. Sometimes, like in the case of simplified meta rediriects we will add the new url's SpiderRequest to spiderdb as well. So we handle all errors on an individual basis.
  1564. <br><br>
  1565. hese error
  1566. codes are all in <a href="#errno">Errno.h</a>.
  1567. <br><br>
  1568. <table cellpadding=3 border=1>
  1569. <tr><td>EBADTITLEREC</td><td>Document's TitleRec is corrupted and can not be read.</td></tr>
  1570. <tr><td>EURLHASNOIP</td><td>Url has no ip.</td></tr>
  1571. <tr><td>EDOCCGI</td><td>Document has CGI parms in url and <i>allowCgiUrls</i> is specified as false in the ruleset.</td></tr>
  1572. <tr><td>EDOCURLIP</td><td>Document is an IP-based url and <i>allowIpUrls</i> is specified as false in the ruleset.</td></tr>
  1573. <tr><td>EDOCBANNED</td><td>Document's ruleset has <i>banned</i> set to true.</td></tr>
  1574. <tr><td>EDOCDISALLOWED</td><td>Robots.txt forbids this document to be indexed. But, if it has incoming link text, Gigablast will index it anyway, but just index the link text.</td></tr>
  1575. <tr><td>EDOCURLSPAM</td><td>The url itself contains naughty words and the <i>do url sporn checking</i> is enabled in the Spider Controls.</td></tr>
  1576. <tr><td>EDOCQUOTABREECH</td><td>The quota for this site has been exceeded. Quotas is based on quality of the url. See the quota section in the <a href="overview.html#quotas">Overview</a> file.</td></tr>
  1577. <tr><td>EDOCBADCONTENTTYPE</td><td>Content type, as returned in the mime reply and parsed out by HttpMime.cpp, is not supported for indexing</td></tr>
  1578. <tr><td>EDOCBADHTTPSTATUS</td><td>Http status was 404 or some other bad stats.</td></tr>
  1579. <tr><td>EDOCNOTMODIFIED</td><td>Spider Controls have <i>use IfModifiedSince</i> enabled and document was not modified since the last time we indexed it.</td></tr>
  1580. <tr><td>EDOCREDIRECTSTOSELF</td><td>The mime redirects to itself.</td></tr>
  1581. <tr><td>EDOCTOOMANYREDIRECTS</td><td>Url had more than 6 redirects.</td></tr>
  1582. <tr><td>EDOCBADREDIRECTURL</td><td>The redirect url was empty.</td></tr>
  1583. <tr><td>EDOCSIMPLIFIEDREDIR</td><td>The document redirected to a simpler url, which had less path components, did not have cgi, or for whatever reason was prettier to look at. The current url will be discarded and the redirect url will be added to spiderdb.</td></tr>
  1584. <tr><td>EDOCNONCANONICAL</td><td>Doc has a canonical link reference a url that was not itself. Used for deduping at spider time.</td></tr>
  1585. <tr><td>EDOCNODOLLAR</td><td>Document did not contain a dollar sign followed by a price. Used for building shopping indexes.</td></tr>
  1586. <tr><td>EDOCHASBADRSS</td><td>This should not happen.</td></tr>
  1587. <tr><td>EDOCISANCHORRSS</td><td>This should not happen.</td></tr>
  1588. <tr><td>EDOCHASRSSFEED</td><td><i>only index documents from rss feeds</i> is true in the Spider Controls, and the document indicates it is part of an RSS feed, and does not currently have an RSS feed linking to it in the index. Gigablast will discard the document, and add the url of the RSS feed to spiderdb. When that is spidered the url should be picked up again.</td></tr>
  1589. <tr><td>EDOCNOTRSS</td><td>If the Spider Controls specify <i>only index articles from rss feeds</i> as true and the document is not part of an RSS feed.</td></tr>
  1590. <tr><td>EDOCDUP</td><td>According to checksumdb, a document already exists from this hostname with the same checksumdb hash. See <a href="#deduping">Deduping</a> section.</td></tr>
  1591. <!--<tr><td>EDOCDUPWWW</td><td>According to urldb, this url already exists in the index but with a "www." prepended to it. This prevents us from indexing both http://mysite.com/ and http://www.mysite.com/ because they are almost always the same thing.</td></tr>-->
  1592. <tr><td>EDOCTOOOLD</td><td>Document's last modified date is before the <i>maxLastModifiedDate</i> specified in the ruleset.</td></tr>
  1593. <tr><td>EDOCLANG</td><td>The document does not match the language given in the Spider Controls.</td></tr>
  1594. <tr><td>EDOCADULT</td><td>Document was detected as adult and adult documents are forbidden in the Spider Controls.</td></tr>
  1595. <tr><td>EDOCNOINDEX</td><td>Document has a noindex meta tag.</td></tr>
  1596. <tr><td>EDOCNOINDEX2</td><td>Document's ruleset (SiteRec) says not to index it using the &lt;indexDoc&gt; tag. Probably used to just harvest links then.</td></tr>
  1597. <tr><td>EDOCBINARY</td><td>Document is detected as a binary file.</td></tr>
  1598. <tr><td>EDOCTOONEW</td><td>Document is after the <i>minLastModifiedDate</i> specified in the ruleset.</td></tr>
  1599. <tr><td>EDOCTOOBIG</td><td>Document size is bigger than <i>maxDocSize</i> specified in the ruleset.</td></tr>
  1600. <tr><td>EDOCTOOSMALL</td><td>Document size is smaller than <i>minDocSize</i> specified in the ruleset.</td></tr>
  1601. </table>
  1602. <br><br>
  1603. The following error codes just set g_errno. Like if the error is an internet
  1604. error, or something on our side. These errors are also in Errno.h.
  1605. <br><br>
  1606. <table cellpadding=3 border=1>
  1607. <tr><td>ETTRYAGAIN</td><td>&nbsp;</td></tr>
  1608. <tr><td>ENOMEM</td><td>We ran out of memory.</td></tr>
  1609. <tr><td>ENOSLOTS</td><td>We ran out of UDP sockets.</td></tr>
  1610. <tr><td>ECANCELLED</td><td>An administrator disabled spidering in the Master Controls thereby cancelling all outstanding spiders.</td></tr>
  1611. <tr><td>EBADIP</td><td>Unable to get IP address of url.</td></tr>
  1612. <tr><td>EBADENGINEER</td><td>&nbsp;</td></tr>
  1613. <tr><td>EIPHAMMER</td><td>We would hit the IP address too hard, violating <i>sameIpWait</i> in the Spider Controls if we were to download this document.</td></tr>
  1614. <tr><td>ETIMEDOUT</td><td>If we timed out downloading the document.</td></tr>
  1615. <tr><td>EDNSTIMEDOUT</td><td>If we timed out looking up the IP of the url.</td></tr>
  1616. <tr><td>EBADREPLY</td><td>DNS server sent us a bad reply.</td></tr>
  1617. <tr><td>EDNSDEAD</td><td>DNS server was dead</td></tr>
  1618. </table>
  1619. <br><br>
  1620. <a name="xmldoc"></a>
  1621. <h2>XmlDoc: The Heart of the Parser</h2>
  1622. Spider.cpp ultimately calls XmlDoc::index() to download and index the specified URL.
  1623. <br><br>
  1624. <a name="xml"></a>
  1625. <h2>Xml</h2>
  1626. The Xml class is Gigablast's XML parsing class. You can set it by passing it
  1627. a pointer to HTML or XML content along with a content length. If you pass it
  1628. some strange character set, it will convert it to UTF-16 and use that to set
  1629. its nodes. Each node is a tag or a non-tag. It uses the XmlNode
  1630. class to tokenize the content into these nodes.
  1631. <br><br>
  1632. <a name="links"></a>
  1633. <h2>Links</h2>
  1634. The Links class is set by XmlDoc and gets all the links in a document.
  1635. Links::hash(), called by XmlDoc::hash(), will index all the link:, ilink: or
  1636. links: terms.
  1637. It is also used to get link text by the <a href="#linktext">LinkText</a> class.
  1638. <br><br>
  1639. <a name="words"></a>
  1640. <h2>Words</h2>
  1641. All text documents can be broken up into "words". Gigablast's definition of a
  1642. word is slight different than normal. A word is defined as a sequence of
  1643. alphanumeric characters, a sequence of non-alphanumeric characters. An HTML
  1644. or XML tag is also considered to be an individual word. Words for Japanese
  1645. or other similar languages are determined by a tokenizer Partap integrated.
  1646. <br><br>
  1647. The Phrases, Bits, Spam and Scores classes all contain arrays which are 1-1
  1648. with the Words class they were set from.
  1649. <br><br>
  1650. The Words class is primarily used by TermTable::set() which takes a string,
  1651. breaks it down into words and phrases and hash them word and phrase ids into
  1652. the hash table.
  1653. <br><br>
  1654. <a name="phrases"></a>
  1655. <h2>Phrases</h2>
  1656. Gigablast implements phrase searching by stringing words together and hashing
  1657. them as a single unit. It uses the Phrases and Bits class to generate phrases
  1658. from a Words class. Basically, every pair of words is considered a phrase. So if it sees "cd rom" in the document it will index the bigram "cdrom". It can easily be modified to index trigrams and quadgrams as well, but that will bloat the index somewhat.
  1659. <br><br>
  1660. <a name="bits"></a>
  1661. <h2>Bits</h2>
  1662. Before phrases can be generated from a Words class, Gigblast must set
  1663. descriptor bits. Each word has a corresponding character in the Bits class
  1664. that sets flags depending on different properties of the word. These bits
  1665. are used for generating <a href="#phrases">phrases</a>.
  1666. <br><br>
  1667. <!--
  1668. <a name="scores"></a>
  1669. <h2>Scores</h2>
  1670. A more recent addition to Gigablast is the Scores class. It was written to
  1671. extract the beefy content of a page, and exclude, or reduce the impact of,
  1672. menu text. Words with scores of 0 or less are not indexed. The ruleset controls
  1673. if and how the Scores class is used to index documents.
  1674. <br><br>
  1675. -->
  1676. <!--
  1677. <a name="msg10"></a>
  1678. <h2>Msg10</h2>
  1679. Msg14 calls Msg10 after updating all the Rdbs. Msg10 adds the links harvested
  1680. from the page. It does this using the <a href="#msg1">Msg1</a> class, but it
  1681. must also check to ensure that the links are not already in spiderdb by
  1682. checking to see if they are in <a href="#urldb">Urldb</a>.
  1683. Unless a link is <i>forced</i> then it
  1684. will not be added to spiderdb if it is already in Urldb.
  1685. Msg10 is also used by PageAddUrl to add submitted links or files that contain
  1686. a bunch of links to be added.
  1687. <br><br>
  1688. -->
  1689. <a name="gbfilter"></a>
  1690. <h2>Content Filtering</h2>
  1691. Gigablast supports multiple document types by saving the downloaded mime and
  1692. content to a file, then calling the <i>gbfilter</i> program (gbfilter.cpp)
  1693. which based on the content type in the mime will call pdftohtml, antiword,
  1694. pstotext, etc. and send the html/text content back to gb via stdout.
  1695. <br><br>
  1696. <a name="docids"></a>
  1697. <a name="msg22"></a>
  1698. <h2>DocIds</h2>
  1699. Every document indexed by Gigablast has a unique document id, called a docid. The docid is usually just the hash of the url it represents. Titledb::getProbableDocId() takes the url as a parameter and returns the probable docid for that url. The reason the docid is <i>probable</i> is that it may collide with the docid of an existing url and therefore will have to be changed. When we index a document, Msg14, the Msg class which is primarily responsible for indexing a document, calls Msg15::getOldDoc() to retrieve the TitleRec (record in Titledb) for the supplied url (the url being indexed). Msg15 will then call Msg22::getTitleRec() to actually get the TitleRec for that url. If a TitleRec is not found for the given url then <b>Msg22.cpp</b> will sets its m_availDocId so Msg14 will know which available docid it can use for that url.
  1700. <br><br>
  1701. Msg22 will actually load a list of TitleRecs from titledb corresponding to all the possible actual docids for the url. When the probable docid for a new url collides with the docid of an existing url, as evidenced by the TitleRec list, then Gigablast increments the new url's docid by one. Gigablast will only change the last six bits (bits 0-5) of a docid for purposes of collision resolution and it will wrap the docid if it tops out (see Msg22::gotUrlListWrapper). Titledb::getFirstProbableDocId() and Titledb::getLastProbableDocId() define the range of actual docids available given a probable docid by masking out or adding bits 0-5. Bits 7 and up of the docid are used to determine which host in the network is responsible for storing the TitleRec for that docid, as illustrated in Titledb::getGroupId(). That is why collision resolution is limited to the lower six bits, because we don't want to, through collision resolution, resolve a TitleRec to another host, we want to keep it local. We also do not want the high bits of the docid being used for determining the group (shard) which stores that docId because when reading the termlist of linkers to a particular url, they are sorted by docId, which means we end up hitting one particular host in the network much harder than the rest when we attempt to extract the link text. If we have to increase these 6 bits in the future, then we may consider using some higher bits, like maybe bits 12 and up or something to avoid messing with the bits that control what hosts stores the TitleRec.
  1702. <br><br>
  1703. When Msg15 asks Msg22 for the TitleRec of a url being indexed, Msg22 will either return the TitleRec or, if the TitleRec does not exist for that url, it will set Msg22::m_availDocId to an available docid for that url to use. When looking up a TitleRec from a url, the first thing Msg22 does is define a startKey and endKey for a fetch of records from Urldb. Urldb consists of a bunch of dataless keys. Each of these keys consist of a docid, a hash extension and a file number.
  1704. <br><br>
  1705. If the file number in a Urldb record <b>is 255</b> then the corresponding docid is a <i>probable</i> docid (just a hash of the url as computed in Titledb::getProbableDocId()) and that the TitleRec is either in Titledb's RdbTree or that the TitleRec does not exist, but that url is slated for spidering and is contained in Spiderdb. This makes it easy to avoid adding duplicate urls to Spiderdb. Because different urls in Spiderdb may have the same probable docid, each Urldb record contains an 7-bit hash extension of the url used to reduce the chance of collision. The 38-bit docid hash plus this 7-bit extension gives an effective docid of 45 bits. Unfortunately, if a url has the same 45-bit hash as another different url it will not be permitted into Spiderdb and therefore will not be indexed. We have to rely on the extended hash thing because we cannot store the actual url in every Urldb record because keeping Urldb mostly in memory is very important for performance reasons.
  1706. <br><br>
  1707. If the file number in a Urldb record <b>is not 255</b>, then it refers to the particular Titledb file that contains the TitleRec for the corresponding docid. In this case, the docid in the Urldb record is the <i>actual</i> docid of the url, not the probable docid. In most cases the actual docid is the same as the probable docid, but when two different urls hash to the same probable docid, Msg22 will increment the actual docid of the url being added until it finds an unused actual docid.
  1708. <br><br>
  1709. The startKey and endKey for this particular Urldb request are constructed using the lowest and highest <i>actual</i> docids possible for the url. Because we only know the probable docid for that url, we have to assume that its lower six bits were changed in order to form its actual docid. Once we get back the list of records from Urldb we compute the extended hash of the url we are looking up and we scan through the list of records to see if any have the same extended hash. If a Urldb record in the list has the same extended hash, then we can extract the file number and consult the corresponding Titledb file to get the TitleRec. If the file number is 255 then we check for the TitleRec in Titledb's tree. If not in the tree, then url is probably slated for spidering but not yet in the index, so we conclude the TitleRec does not exist.
  1710. <br><br>
  1711. If we do not find a TitleRec for the url then we must return an available actual docid for the url, so the caller can index the url. In this case, in Msg22::gotUrlListWrapper() [line 292 in Msg22.cpp], we assume the probable docid is the actual docid. This assumption is a bug, because the probable docid may have been present in one of the Urldb records in the list, but the extended hash may not have matched the extended hash of the url. So this needs to be fixed.
  1712. <br><br>
  1713. SIDE NOTE: To avoid having to lookup urls associated with each docid, a Query Reindex request will populate Spiderdb with docid-based records, rather than the typical url-based records. In this case, Msg15 can pass the exact docid to Msg22, so Msg22 does not have to worry about collisions and having to scan through a list of TitleRecs.
  1714. <br><br>
  1715. <a name="docidscaling">
  1716. <h2>Scaling to Many Docids</h2></a>
  1717. Currently, we are using 38 bit docids which allows for over 270 billion docids. Using the lower 6 bits of each docid for the chaning purposes as described above, means that there will be a problem if a probable docid is unable to be turned into an available actual docid by only changing the lower 6 bits of the probable docid because all the possible actual docids are occupied. These 6 bits correspond to 32 different values, therefore if we think of the range of possible actual docids being divided up into buckets, where each bucket is 32 consecutive docids, we might ask: what number of <i>full</i> buckets can we expect? Once a bucket is full we will have to turn away any url that hashes into that bucket. It turns out that when we have 128 billion documents the expected number of full buckets is only 29.80.
  1718. <br><br>
  1719. If we have 128 billion pages out of a possible 256 billion, each docid has a 50% probability of being used. Therefore, the probability of 32 consecutive used docids is (.5)^32 and the expected number of full buckets is (.5)^32 * 128 billion which is 29.80. If we have 16 billion pages out of the 256 billion the number of expected full buckets is a very tiny fraction. This is all assuming we have a uniformly distributed hash function.
  1720. <br><br>
  1721. The next scaling problem is that of urldb. Urldb enhances the probable docid with a 7-bit extension. The probability that a url's probable docid collides with another url's actual docid is fairly good in a 16 billion page index, it is 1/16. But then we multiply by 128 to get 1 in 2048, which is lousy. Therefore we should grow the extended bits by 16 so it is at least 1 in 134 million for a 16 billion page index which is ok because that means only about 100 urls will not be able to be indexed because of a collision in urldb. And with this 16-bit growth, we should still get decent compression for large indexes. That is, if we have 130 million pages per server, we'd have 1<<24 bits of the docid in the upper 6 bytes of the urldb key. That means we'd expect about 7 keys on average to share the same 6 bytes, (130,000,000/(1<<24)) = 7.74. That would still constitute about a 12% growth over current urldb sizes, though.
  1722. <br><br>
  1723. <hr>
  1724. <a name="searchresultslayer"></a>
  1725. <h1>The Search Results Layer</h1>
  1726. This layer is responsible for taking a query url as input and returning
  1727. search results. Gigablast is capable of doing a few queries for every server
  1728. in the cluster. So if you double the number of servers you double the
  1729. query throughput.
  1730. <br><br>
  1731. Gigablast uses the <a href="#query">Query</a> class to break a query up into
  1732. termIds. Each termId is a hash of a word or phrase in the query. Then the <a href="#indexlist">IndexList</a> for each termId is loaded and they are all
  1733. intersected (by docId) in a hash table, and their scores are mapped and
  1734. accumulated as well. Msg39 computes the top X docIds and then uses Msg38 to
  1735. get the site cluster hash and possibly the sample vector of each top docId.
  1736. Clustered
  1737. and duplicate results (docIds) are removed and if the total remaining is less
  1738. than what
  1739. was requested, Msg39 redoes the intersection to try to get more docIds to
  1740. compensate. The final docIds are returned to Msg40 which uses Msg20 to fetch
  1741. a summary for each docId. Then PageResults display the final search results.
  1742. <br><br>
  1743. <a name="queryclasses"></a>
  1744. <h2>Associated Classes (.cpp and .h files)</h2>
  1745. <table cellpadding=3 border=1>
  1746. <tr><td>Ads</td><td>Search</td><td>Interface to third party ad server.</td></tr><tr><td>Highlight</td><td>Search</td><td>Highlights query terms in a document or summary.</td></tr>
  1747. <tr><td>IndexList</td><td>Search</td><td>Derived from RdbList. Used specifically for processing Indexdb RdbLists.</td></tr>
  1748. <tr><td>IndexReadInfo</td><td>Search</td><td>Tells Gigablast how much of what IndexLists to read from Indexdb to satisfy a query.</td></tr>
  1749. <tr><td>IndexTable</td><td>Search</td><td>Intersects IndexLists to get the final docIds to satisfy a query.</td></tr>
  1750. <tr><td>Matches</td><td>Search</td><td>Identifies words in a document or string that match supplied query terms. Used by Highlight.</td></tr>
  1751. <tr><td>Msg17</td><td>Search</td><td>Used by Msg40 for distributed caching of search result pages.</td></tr>
  1752. <tr><td>Msg1a</td><td>Search</td><td>Get the reference pages from a set of search results.</td></tr>
  1753. <tr><td>Msg1b</td><td>Search</td><td>Get the related pages from a set of search results and reference pages.</td></tr>
  1754. <tr><td>Msg2</td><td>Search</td><td>Given a list of termIds, download their respective IndexLists.</td></tr>
  1755. <tr><td>Msg20</td><td>Search</td><td>Given a docId and query, return a summary or or document excerpt. Used by Msg40.</td></tr>
  1756. <tr><td>Msg33</td><td>Search</td><td>Unused. Did raid stuff.</td></tr>
  1757. <tr><td>Msg36</td><td>Search</td><td>Gets the length of an IndexList for determining query term weights.</td></tr>
  1758. <tr><td>Msg37</td><td>Search</td><td>Calls a Msg36 for each term in the query.</td></tr>
  1759. <tr><td>Msg38</td><td>Search</td><td>Returns the Clusterdb record for a docId. May also get for Titledb record if its key is in the RdbMap.</td></tr>
  1760. <tr><td>Msg39</td><td>Search</td><td>Intersects In