developer.html | searchcode

/html/developer.html

https://github.com/gigablast/open-source-search-engine · HTML · 2729 lines · 1894 code · 231 blank · 604 comment · 0 complexity · 56e98ebb44660c0e58b2b99d862d7032 MD5 · raw file
Large files are truncated click here to view the full file

<br><br>

<h1>Developer Documentation</h1>

FAQ is <a href=/admin.html>here</a>.
<br><br>

A work-in-progress <a href=/compare.html>comparison to SOLR</a>.
<br><br>

<b>CAUTION: This documentation is old and a lot of it is out of date. -- Matt, May 2014</b>
<br><br>

<!--- TABLE OF CONTENTS-->

<h1>Table of Contents</h1>

<table cellpadding=3 border=0>

<tr><td>I.</td><td><a href="#started">Getting Started</a> - Setting up your PC for Gigablast development.</td></tr>
<!--subtable-->
<tr><td></td><td><table cellpadding=3>

<tr><td valign=top>A.</td><td><a href="#ssh">SSH</a> - A brief introduction to setting up ssh</td></tr>
</table></td></tr>

<tr><td>II.</td><td><a href="#shell">Using the Shell</a> - Some common shell commands.</td></tr>

<tr><td>III.</td><td><a href="#git">Using GIT</a> - Source code management.</td></tr>

<tr><td>IV.</td><td><a href="#hardware">Hardware Administration</a> - Gigablast hardware resources.</td></tr>

<tr><td>V.</td><td><a href="#dirstruct">Directory Structure</a> - How files are laid out.</td></tr>

<tr><td>VI.</td><td><a href="#kernels">Kernels</a> - Kernels used by Gigablast.</td></tr>

<tr><td>VII.</td><td><a href="#coding">Coding Conventions</a> - The coding style used at Gigablast.</td></tr>

<tr><td>VIII.</td><td><a href="#debug">Debugging Gigablast</a> - How to debug gb.</td></tr>


<tr><td>IX.</td><td><a href="#code">Code Overview</a> - Basic layers of the code.</td></tr>

<!--subtable1-->
<tr><td></td><td><table cellpadding=3>

<tr><td valign=top>A.</td><td><a href="#loop">The Control Loop Layer</a><br>

<tr><td valign=top>B.</td><td><a href="#database">The Database Layer</a><br>

<!--subtable2-->
<tr><td></td><td><table cellpadding=3>


<tr><td>1.</td><td><a href="#dbclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#database">Rdb</a> - The core database class.</td></tr>
<!--subtable3-->
<tr><td></td><td><table cellpadding=3>
<td><td>a.</td><td><a href="#addingrecs">Adding Records</a></td></tr>
<td><td>b.</td><td><a href="#mergingfiles">Merging Rdb Files</a></td></tr>
<td><td>c.</td><td><a href="#deletingrecs">Deleting Records</a></td></tr>
<td><td>d.</td><td><a href="#readingrecs">Reading Records</a></td></tr>
<td><td>e.</td><td><a href="#errorcorrection">Error Correction</a> - How Rdb deals with corrupted data on disk.</td></tr>
<td><td>f.</td><td><a href="#netadds">Adding Records over the Net</a></td></tr>
<td><td>g.</td><td><a href="#netreads">Reading Records over the Net</a></td></tr>
<td><td>i.</td><td><a href="#varkeysize">Variable Key Sizes</a></td></tr>
<td><td>j.</td><td><a href="#rdbcache">Caching Records</a></td></tr>
</table></tr>
<!--subtable3end-->


<tr><td>3. a.</td><td><a href="#posdb">Posdb</a> - The new word-position storing search index Rdb.</td></tr>
<tr><td>3. b.</td><td><a href="#indexdb">Indexdb</a> - The retired/unused tf/idf search index Rdb.</td></tr>
<tr><td>4.</td><td><a href="#datedb">Datedb</a> - For returning search results constrained or sorted by date.</td></tr>
<tr><td>5.</td><td><a href="#titledb">Titledb</a> - Holds the cached web pages.</td></tr>
<tr><td>6.</td><td><a href="#spiderdb">Spiderdb</a> - Holds the urls to be spidered, sorted by spider date.</td></tr>
<!--<tr><td>7.</td><td><a href="#urldb">Urldb</a> - Tells us if a url is in spiderdb or where it is in titledb.</td></tr>-->
<tr><td>8.</td><td><a href="#checksumdb">Checksumdb</a> - Each record is a hash of the document content. Used for deduping.</td></tr>
<tr><td>9.</td><td><a href="#sitedb">Sitedb</a> - Used for mapping a url or site to a ruleset.</td></tr>
<tr><td>10.</td><td><a href="#clusterdb">Clusterdb</a> - Used for doing site clustering and duplicate search result removal.</td></tr>
<tr><td>11.</td><td><a href="#catdb">Catdb</a> - Used to hold DMOZ.</td></tr>
<tr><td>12.</td><td><a href="#robotdb">Robotdb</a> - Just a cache for robots.txt files.</td></tr>


</table></tr>
<!--subtable2end-->


<tr><td valign=top>C.</td><td><a href="#networklayer">The Network Layer</a><br>

<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#netclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#udpserver">Udp server</a> - Used for internal communication.</td></tr>

<!--subtable3-->
<tr><td></td><td><table cellpadding=3>
<td><td>a.</td><td><a href="#multicasting">Multicasting</a></td></tr>
<td><td>b.</td><td><a href="#msgclasses">Message Classes</a></td></tr>
</table></tr>
<!--subtable3end-->


<tr><td>3.</td><td><a href="#tcpserver">TCP server</a> - Used by the HTTP server.</td></tr>
<tr><td>4.</td><td><a href="#tcpserver">HTTP server</a> - The web server.</td></tr>
<tr><td>5.</td><td><a href="#dns">DNS Resolver</a></td></tr>

</table></tr>
<!--subtable2end-->


<tr><td valign=top>D.</td><td><a href="#buildlayer">The Build Layer</a><br>

<!--subtable2-->
<tr><td></td><td><table cellpadding=3>

<tr><td>1.</td><td><a href="#buildclasses">Associated Files</a></td></tr>
<tr><td>2.</td><td><a href="#spiderloop">Spiderdb.cpp</a> - Most of the spider code.</td></tr>
<!--
<tr><td>3.</td><td><a href="#spidercache">The Spider Cache</a> - Caches urls from spiderdb.</td></tr>
<tr><td>4.</td><td><a href="#msg14">Msg14</a> - Indexes a url.</td></tr>
<tr><td>5.</td><td><a href="#msg15">Msg15</a> - Sets the old IndexList.</td></tr>
<tr><td>6.</td><td><a href="#msg16">Msg16</a> - Sets the new IndexList.</td></tr>
<tr><td>7.</td><td><a href="#doc">The Doc Class</a> - Converts document to an IndexList.</td></tr>
-->

<tr><td>3.</td><td><a href="#linkanalysis">Link Analysis</a></td></tr>
<!--<tr><td>9.</td><td><a href="#robotdb">Robots.txt</a></td></tr>
<tr><td>10.</td><td><a href="#termtable">The TermTable</a></td></tr>
<tr><td>11.</td><td><a href="#indexlist">The IndexList</a></td></tr>-->
<tr><td>4.</td><td><a href="#samplevector">The Sample Vector</a> - Used for deduping at spider time.</td></tr>
<tr><td>5.</td><td><a href="#summaryvector">The Summary Vector</a> - Used for removing similar search results.</td></tr>
<tr><td>6.</td><td><a href="#gigabitvector">The Gigabit Vector</a> - Used for clustering results by topic.</td></tr>
<tr><td>7.</td><td><a href="#deduping">Deduping</a> - Preventing the index of duplicate pages.</td></tr>
<tr><td>8.</td><td><a href="#indexcode">Indexing Error Codes</a> - Reasons why a document failed to be indexed.</td></tr>
<tr><td>9.</td><td><a href="#docids">DocIds</a> - How DocIds are assigned.</td></tr>
<tr><td>10.</td><td><a href="#docidscaling">Scaling To Many DocIds</a> - How we scale to many DocIds.</td></tr>

</table></tr>
<!--subtable2end-->

<tr><td valign=top>E.</td><td><a href="#searchresultslayer">The Search Results Layer</a><br>

<!--subtable2-->
<tr><td></td><td><table cellpadding=3>

<tr><td>1.</td><td><a href="#queryclasses">Associated Classes</a></td></tr>
<tr><td>1.</td><td><a href="#query">The Query Parser</a></td></tr>
<tr><td>2.</td><td>Getting the Termlists</td></tr>
<tr><td>3.</td><td>Intersecting the Termlists</a></td></tr>
<tr><td>4.</td><td><a href="#raiding">Raiding Termlist Intersections</a></td></tr>
<tr><td>5.</td><td>The Search Results Cache</a></td></tr>
<tr><td>6.</td><td>Site Clustering</a></td></tr>
<tr><td>7.</td><td>Deduping</a></td></tr>
<tr><td>8.</td><td>Family Filter</a></td></tr>
<tr><td>9.</td><td>Gigabits</a></td></tr>
<tr><td>10.</td><td>Topic Clustering</a></td></tr>
<tr><td>11.</td><td>Related and Reference Pages</a></td></tr>
<tr><td>12.</td><td><a href="#spellchecker">The Spell Checker</a></td></tr>

</table></tr>
<!--subtable2end-->


<tr><td valign=top>F.</td><td><a href="#adminlayer">The Administration Layer</a><br>

<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#adminclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#collections">Collections</a></td></tr>
<tr><td>3.</td><td><a href="#parms">Parameters and Controls</a></td></tr>
<tr><td>4.</td><td><a href="#log">The Log System</a></td></tr>
<tr><td>5.</td><td><a href="#clusterswitch">Changing Live Clusters</a></td></tr>
</table>


<tr><td valign=top>G.</td><td><a href="#corelayer">The Core Functions Layer</a><br>
<!--subtable2-->
<tr><td></td><td><table cellpadding=3>
<tr><td>1.</td><td><a href="#coreclasses">Associated Classes</a></td></tr>
<tr><td>2.</td><td><a href="#files"></a>Files</td></tr>
<tr><td>3.</td><td><a href="#threads"></a>Threads</td></tr>
</table>

<tr><td valign=top>H.</td><td><a href="#filelist">File List</a> - List of all source code files with brief descriptions.</a><br>

</table>

<tr><td>XIV.</td><td><a href="#spam">Fighting Spam</a> - Search engine spam identification.</td></tr>

</table>

<br><br>


<a name="started"></a>
<hr><h1>Getting Started</h1>

Welcome to Gigablast. Thanks for supporting this open source software.

<a name="ssh"></a>
<br><br>
<h2>SSH - A Brief Introduction</h2>
Gigablast uses SSH extensively to administer and develop its software.  SSH is a secure shell
connection that helps protect sensitive information over an insecure network.  Internally, we 
access our development machines with SSH, but being prompted
for a password when logging in between machines can become tiresome as well as development gb
processes will not execute unless you can ssh without a password.  This is where our authorized
keys file comes in.  In your user directory you have a .ssh directory and inside of that directory
there lies an authorized_keys and known_hosts file.  Now what we first have to do is generate a 
key for your user on the current machine you are on.  Remember to press &lt;Enter&gt; when asked for a 
passphrase, otherwise you will still have to login.<br>
<code><i> ssh-keygen -t dsa </i></code><br>
Now that should have generated an id_dsa.pub and an id_dsa file.  Since you now have a key, we need to
let all of our trusted hosts know that key.  Luckily ssh provides a simple way to copy your key into 
the appropriate files.  When you perform this command the machine will ask you to login.<br>
<code><i> ssh-copy-id -i ~/.ssh/id_dsa.pub destHost</i></code><br>
Perform this operation to all of our development machines by replacing destHost with the appropriate
machine name.  Once you have done this, try sshing into each machine.  You should no longer be 
prompted for a password. If you were requested to enter a password/phrase, go through this section again, 
but DO NOT ENTER A PASSPHRASE when ssh-keygen requests one. <b>NOTE:  Make sure to copy the id to the 
same machine you generated the key on.  This is important for running gigablast since it requires to ssh 
freely.</b>
<br>
<br>
<i>Host Key Has Changed and Now I Can Not Access the Machine</i><br>
This is easily remedied.  If a machine happens to crash and the OS needs to be replaced for whatever
reason, the SSH host key for that machine will change, but your known_hosts file will still have the
old host key.  This is to prevent man-in-the-middle attacks, but when we know this is a rebuilt machine
we can simply correct the problem.  Just open the ~/.ssh/known_hosts file with your favorite editor.  
Find the line with the name or ip address of the offending machine and delete the entire line (the 
known_hosts file separates host keys be newlines).<br>
<br><br>

<a name="shell"></a>
<hr><h1>Using the Shell</h1>
<!-- I'd like to nominate to remove a majority of this section - if you need help with bash at this point, you probably
   	shouldn't be poking around in the code... Maybe just keep the dsh and export command descriptions? SCC -->
Everyone uses the bash shell. The navigation keystrokes work in emacs, too. Here are the common commands:
<br><br>

<table cellpadding=3 border=1>
<tr><td>Command</td><td>Description</td></tr>
<tr><td><nobr>export LD_PATH=/home/mwells/dynamicslibs/</nobr></td><td>Export the LD_PATH variable, used to tell the OS where to look for dynamic libraries.</td></tr>
<tr><td>Ctrl+p</td><td>Show the previously executed command.</td></tr>
<tr><td>Ctrl+n</td><td>Show the next executed command.</td></tr>
<tr><td>Ctrl+f</td><td>Move cursor forward.</td></tr>
<tr><td>Ctrl+b</td><td>Move cursor backward.</td></tr>
<tr><td>Ctrl+a</td><td>Move cursor to start of line.</td></tr>
<tr><td>Ctrl+e</td><td>Move cursor to end of line.</td></tr>
<tr><td>Ctrl+k</td><td>Cut buffer from cursor forward.</td></tr>
<tr><td>Ctrl+y</td><td>Yank (paste) buffer at cursor location.</td></tr>
<tr><td>Ctrl+Shift+-</td><td>Undo last keystrokes.</td></tr>
<tr><td>history</td><td>Show list of last commands executed. Edit /home/username/.bashrc to change the number of commands stored in the history. All are stored in /home/username/.history file.</td></tr>
<tr><td>!xxx</td><td>Execute command #xxx, where xxx is a number shown from the 'history' command.</td></tr>
<tr><td>ps auxww</td><td>Show all processes.</td></tr>
<tr><td>ls -latr</td><td>Show all files reverse sorted by time.</td></tr>
<tr><td>ls -larS</td><td>Show all files reverse sorted by size.</td></tr>
<tr><td>ln -s &lt;x&gt; &lt;y&gt;</td><td>Make directory or file y a symbolic link to x.</td></tr>
<tr><td>cat xxx | awk -F":" '{print $1}'</td><td>Show contents of file xxx, but for each line, use : as a delimiter and print out the first token.</td></tr>
<tr><td>dsh -c -f hosts 'cat /proc/scsi/scsi'</td><td>Show all hard drives on all machines listed in the file <i>hosts</i>. -c means to execute this command concurrently on all those machines. dsh must be installed with <i>apt-get install dsh</i> for this to work. You can use double quotes in a single quoted dsh command without problems, so you can grep for a phrase, for instance.</td></tr>
<tr><td>apt-cache search xxx</td><td>Search for a package to install. xxx is a space separated list of keywords. Debian only.</td></tr>
<tr><td>apt-cache show xxx</td><td>Show details of the package named xxx. Debian only.</td></tr>
<tr><td>apt-get install xxx</td><td>Installs a package named xxx. Must be root to do this. Debian only.</td></tr>
<tr><td>adduser xxx</td><td>Add a new user to the system with username xxx.</td></tr>
</table>

<br>


<a name="git"></a>
<hr><h1>Using Git</h1>

Git is what we use to do source code control. Git allows us to have many engineers making changes to a common set of files. The basic commands are the following:<br><br>
<table border=1 cellpadding=3>
<tr>
<td><nobr>git clone &lt;srcdir&gt; &lt;destDit&gt;</nobr></td>
<td>This will copy the git repository to the destination directory.</td>
</tr>
</table>

<p>More information available at <a href=http://www.github.com/>github.com</a></p>
<br/>
<h3>Setting up GitHub on a Linux Box</h3>
To make things easier, you can set up github access via ssh. Here's a quick list of commands to run:
(assuming Ubuntu/Debian)
<pre>
sudo apt-get install git-core git-doc
git config --global user.name "Your Name"
git config --global user.email "your@email.com"
git config --global color.ui true
ssh-keygen -t rsa -C "your@email.com" -f ~/.ssh/git_rsa
cat ~/.ssh/git_rsa.pub
</pre>
Copy and paste the ssh-rsa output from the above command into your Github profile's list of SSH Keys.
<pre>
ssh-add ~/.ssh/git_rsa
</pre>
If that gives you an error about inability to connect to ssh agent, run:
<pre>
eval `ssh-agent -a`
</pre>
Then test and clone!
<pre>
ssh -T git@github.com
git clone git@github.com:gigablast/open-source-search-engine
</pre>
<br><br>

<a name="hardware"></a>
<hr><h1>Hardware Administration</h1>

<h2>Hardware Failures</h2>
If a machine has a bunch of I/O errors in the log or the gb process is at a standstill, login and type "dmesg" to see the kernel's ring buffer. The error that means the hard drive is fried is something like this:<br>
<pre>
mwells@gf36:/a$ dmesg | tail -3
scsi5: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 13 89 68 82 00 00 18 00 
Info fld=0x1389688b, Current sd08:34: sense key Medium Error
 I/O error: dev 08:34, sector 323627528
</pre>
If you do a <i>cat /proc/scsi/scsi</i> you can see what type of hard drives are in the server:<br>
<pre>
mwells@gf36:/a$ cat /proc/scsi/scsi 
Attached devices: 
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
  Type:   Direct-Access                    ANSI SCSI revision: 03
</pre>
<!-- Should most of the following be kept internally at GB? Not sure it would help with open-source users... SCC -->
So for this error we should replace the <b>rightmost</b> hard drive with a spare hard drive. Usually we have spare hard drives floating around. You will know by looking at colo.html where all the equipment is stored.
<br><br>
If a drive is not showing up in /proc/scsi/scsi when it should be, it may be a bad sata channel on the motherboard, but sometimes you can see it again if you reboot a few times. 
<br><br>
Often, if a motherboard failures, or a machine just stops responding to pings and cannot be revived, we designate it as a junk machine and replace its good hard drives with bad ones from other machines. Then we just need to fix the junk machine.
<br><br>
At some point we email John Wang from TwinMicro, or Nick from ACME Micro to fix/RMA the gb/g/gf machines and gk machines respectively. We have to make sure there are plenty of spares to deal with spikes in the hardware failure rate.
<br><br>
Sometimes you can not ssh to a machine but you can rsh to it. This is usually because the a drive is bad and ssh can not access some files it needs to log you in. And sometimes you can ping a machine but cannot rsh or ssh to it, for the same reason. Recently we had some problems with some machines not coming back up after reboot, that was due to a flaky 48-port NetGear switch.
<br><br>
<h2>Replacing a Hard Drive</h2>
When a hard drive goes bad, make sure it is labelled as such so we don't end up reusing it. If it is the leftmost drive slot (drive #0) then the OS will need to be reinstalled using a boot cd. Otherwise, we can just <b>umount /dev/md0</b> as root and rebuild ext2 using the command in /home/mwells/gbsetup2, <b>mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0</b>. And to test the hard drives thoroughly, after the mke2fs, <i>nohup badblocks -b4096 -p0 -c10000 -s -w -o/home/mwells/badblocks /dev/md0 >& /home/mwells/bbout </i>. Sometimes you might even need to rebuild the software raid with <b>mkraid -c /etc/raidtab --really-force /dev/md0</b> before you do that. Once the raid is repaired, mount /dev/md0 and copy over the data from its twin to it. On the newly repaired host use <b>nohup rcp -r gbXX:/a/* /a/ &</b> where gbXX is its twin that has all the data still. DO NOT ERASE THE TWIN'S DATA. You will also have to do <b>rcp -r gbXX:/a/.antiword /a/</b> because rcp does not pick that directory up.
<br><br>
<h2>Measuring Drive Performance</h2>
Slow drives are sometimes worse than broken drives. Run <b>./gb thrutest &lt;directoryToWriteTo&gt; 10000000000</b> to make gigablast write out 10GB of files and then read from them. This is also a good test for checking a drive for errors. Each drive should have a write speed of about 39MB/s for a HITACHI, so the whole raid should be doing 150MB/s write, and close to 200MB/s for reading. Fuller drives will not do so well because of fragmentation. If the performance is less than what it should, then try mounting each drive individually and seeing what speed you get to isolate the culprit. Sometimes you can have the drive re-seated (just pulling it out and plugging it back in) and that does the trick. A soft reboot does not seem to work. I don't know if a power cycle works yet or not.
<br><br>
<h2>Measuring Memory Performance</h2>
Run <b>membustest 50000000 50</b> to read 50MB from main memory 50 times to see how fast the memory bus is. It should be at least 2000MB/sec, and 2500MB/sec for the faster servers. Partap also wrote <b>gb memtest</b> to see the maximum amount of memory that can be allocated, in addition to testing memory bus speeds. The max memory that can be allocated seems to be about 3GB on our 2.4.31 kernel.
<br><br>
<h2>Measuring Network Performance</h2>
Use <b>thunder 50000000</b> on the server with ip address a.b.c.d and <b>thunder 50000000 a.b.c.d</b> on the client to see how fast the client can read UDP packets from the server. Gigablast needs about a Gigabit between all machines, and they must all be on a Gigabit switch. Sometimes the 5013CM-T resets the Gigabit to 100Mbps and causes Gigablast to basically come to a hault. You can detect that by doing a <b>dmesg | grep e1000 | tail -1</b> on each machine to see what speed its network interface is running at.
<br><br>

<a name="kernels"></a>
<hr><h1>Kernels</h1>
Gigablast runs well on the Linux 2.4.31 kernel which resides in /gb/kernels/2.4.31/. The .config file is in that directory and named <b>dotconfig</b>. It does have the marvel patch, mvsata340_2.4.patch.bz2 applied so Linux can see the Marvel chip that drives the 4 sata devices on our SuperMicro 5013CM-T 1U servers. In addition we comment out the 3 'offline' statements in drives/scsi/scsi_error.c in the kernel source, and raised the 3 TIMEOUT to 30. These changes allow the kernel to deal with the 'drive offlined' messages that the Hitachi drives SCSI interface give every now and then. We still set the 'ghost lun' think in the kernel source to 10, but I am not sure if this actually helps, though.
<br><br>
Older kernels had problems with unnecessary swapping and also with crashing, most likely due to the e1000 driver, which drives the 5013CM-T's Gigabit ethernet functionality. All issues seem to be worked out at this point, so any kernel crash is likely due to a hardware failure, like bad memory, or disk.
<br><br>
I would like to have larger block sizes on the ext2fs filesystem, because 4KB is just not good for fragmentation purposes. We use mostly huge files on our partitions, raid-0'ed across 4 400GB Sata drives, so it would help if we could use a block size of a few Megabytes even. I've contacted Theodore T'so, the guy he wrote e2fs, and he may do it for a decent chunk of cash.
<br><br>



<a name="coding"></a>
<hr><h1>Coding Conventions</h1>

When editing the source please do not wrap lines passed 80 columns. Also, 
please make an effort to avoid excessive nesting, it makes for hard-to-read 
code and can almost always be avoided.
<br><br>
Classes and files are almost exclusively named using Capitalized letters for
the beginning of each word, like MyClass.cpp. Global variables start with g_,
static variables start with s_ and member variables start with m_. In order
to save vertical space and make things more readable, left curly brackets ({'s)
are always placed on the same line as the class, function or loop declaration.
In pointer declarations, the * sticks to the variable name, not the type as
in void *ptr.  All variables are to be named in camel case as in lowerCamelCase.
Underscores in variable names are forbidden, with the exception of the g_, s_ and m_ prefixes, of course.  
<br><br>
And try to avoid lots of nested curly brackets. Usually you do not have to nest
more than 3 levels deep if you write the code correctly.
<br><br>
Using the *p, p++ and p < pend mechanisms is generally preferred to using the 
p[i], i++ and i < n mechanisms. So stick to that whenever possible.
<br><br>
Using ?'s in the code, like a = b ? d : e; is also forbidden because it is too cryptic and easy to mess up.
<br><br>
Upon error, g_errno is set and the conditions causing the error are logged.
Whenever possible, Gigablast should recover after error and continue running 
gracefully.
<br><br>
Gigablast uses many non-blocking functions that return false if they block,
or true if they do not block. These type of functions will often set the 
g_errno global error variable on error. These type of functions almost always
take a pointer to the callback function and the callback state as arguments.
The callback function will be called upon completion of the function.
<br><br>
Gigablast no longer uses Xavier Leroy's pthread library (LinuxThreads) because
it was buggy. Instead, the Linux clone() system call is used. That function
is pretty much exclusively Linux, and was the basis for pthreads anyway. It
uses an older version of libc which does not contain thread support for the
errno variable, so, like the old pthreads library, Gigablast's Threads.cpp 
class has implemented the _errno_location() function to work around that. This
function provides a pointer to a separate thread's (process's) own errno 
location so we do not contend for the same errno.
<br><br>
We also make an effort to avoid deep subroutine calls. Having to trace the
IP stack down several functions is time-consuming and should be avoided.
<br><br>
Gigablast also tries to make a consistent distinction between size and length. 
Length usually refers to a string length (like strlen()) and does not include 
the NULL terminator, if any. Size is the overall size of a data blob, 
and includes everything.
<br><br>
When making comments please do not insert the "//" at the very first character
of the line in the first column of the screen.  It should be indented just
like any other line.
<br><br>
Whenever possible, please mimice the currently used styles and conventions 
and please avoid using third party libraries. Even the standard template 
library has serious performance/scalability issues and should be avoided.
<br><br>

<a name="debug"></a>
<hr><h1>Debugging Gigablast</h1>

<h2>General Procedure</h2>
When Gigablast cores you generally want to use gdb to look at the core and see where it happened. You can use BitKeeper's <b>bk revtool &lt;filename&gt;</b> to see if the class or file that cored was recently edited. You may be looking at a bug that was already fixed and only need to update to the latest version.

<h2>Using GDB</h2>

<p>We generally use gdb to debug Gigablast. If you are running gb, the gigablast process, under gdb, and it receives a signal, gdb will break and you have to tell it to ignore the signal from now now by typing <b>handle SIG39 nostop noprint</b> for signal 39, at least, and then continue execution by typing 'c' and enter. When debugging a core on a customer's machine you might have to copy your versino of gdb over to it, if they don't have one installed.</p>

<p>There is also a /gb/bin/gdbserver that you can use to debug a remote gb process, although no one really uses this except Partap used it a few times. </p>

<p>The most common way to use gdb is:
<ul>
<li><b>gdb ./gb</b> to start up gdb.
<li><b>run -c ./hosts.conf 0</b> to run gb under gdb.
<li><b>Ctrl-C</b> to break gb
<li><b>where</b> to make gdb print out the ip stack
<li><b>break MyClass.cpp:123</b> to break gb at line 123 in MyClass.cpp.
<li><b>print &lt;variableName&gt;</b> to print out the value of a variable.
<li><b>watch &lt;variableName&gt;</b> to set a watchpoint so gdb breaks if the variable changes value.
<li><b>set &lt;variableName&gt; &lt;value&gt;</b> to set the value of a variable.
<li><b>set print elements 100000</b> to tell gdb to print out the first 100000 bytes when printing strings, as opposed to limiting it to the default 200 or so bytes. 
</ul>
</p>

<p>You can also use gdb to do <i>poor man's profiling</i> by repeatedly attaching gdb to the gb pid like <b>gdb ./gb &lt;pid&gt;</b> and seeing where it is spending its time. This is a fairly effective random sampling technique.</p>

<p>If a gb process goes into an infinite loop you can get it to save its in-memory data by attaching gdb to its pid and typing <b>print mainShutdown(1)</b> which will tell gdb to run that function which will save all gb's data to disk so you don't end up losing data.</p>

<p>To debug a core type <b>gdb ./gb &lt;coreFilename&gt;</b> Then you can examine why gb core dumped. Please copy the gb binary and move the core to another filename if you want to preserve the core for another engineer to look at. You need to use the exact gb that produced the core in order to analyze it properly.
</p>

<p>It is useful to have the following .gdbinit file in your home directory:
<pre>
set print elements 100000
handle SIG32 nostop noprint
handle SIG35 nostop noprint
handle SIG39 nostop noprint
set overload-resolution off
</pre>
The overload-resolution gets in the way when tring to print the return value of some functions, like uccDebug() for instance.
</p>

<h2>Using Git to Help Debug</h2>
<a href="#git">Git</a> can be a useful debugging aid. By doing <b>git log</b> You can see what the latest changes made to gb were. This may allow you to isolate the bug. <b>bk diff HEAD~1 HEAD~0</b> will actually list the files that were changed from the last commit, instead of just the changesets. Substituting '1' with '2' will show the changes from the last two commits, etc.

<h2>Memory Leaks</h2>
All memory allocations, malloc, calloc, realloc and new, go through Gigablast's Mem.cpp class. That class has a hash table of pointers to any outstanding allocated memory, including the size of the allocation. When Gigablast allocates memory it will allocate a little more than requested so it can pad the memory with special bytes, called magic characters, that are verified when the memory is freed to see if the buffer was overwritten or underwritten. In addition to doing these boundary checks, Gigablast will also print out all the unfreed memory at exit. This allows you to detect memory leaks. The only way we can leak memory without detection now is if it is leaked in a third party library which Gigablast links to. 

<h2>Random Memory Writes</h2>
This is the hardest problem to track down. Where did that random memory write come from? Usually, RdbTree.cpp, uses the most memory, so if the write happens it will corrupt the tree. To fight this you can turn <i>protect</i> the RdbTree by forcing RdbTree::m_useProtection to true in RdbTree::set(). This will make Gigablast protect the memory when not being accessed explicitly by RdbTree.cpp code. It does this by calling mprotect(), which can slow things down quite a bit.

<h2>Corrupted Data In Memory</h2>
Try to reproduce it under gdb immediately before the corruption occurs, then set a watchpoint on some memory that was getting corrupted.  Try to determine the start of the corruption in memory. This may lead you to a buffer that was overflowing. Often times, a gb process will core in RdbTree.cpp which never happens unless the machine has bad RAM. Bad RAM also causes gb to go into an infinite loop in Msg3.cpp's RdbMap logic. In these cases try replacing the memory or using another machine.

<h2>Common Core Dumps</h2>
<table>
<tr><td>Problem</td><td>Cause</td></tr>
<tr><td>Core in RdbTree</td><td>Probably bad ram</td></tr>
</table>

<h2>Common Coding Mistakes</h2>
<table>
<tr><td>1.</td><td>Calling atoip() twice or more in the same printf() statement. atoip() outputs into a single static buffer and can not be shared this way.</td></tr>
<tr><td>2.</td><td>Not calling class constructors and destructors when mallocing/freeing class objects.  Need to allow class to initialize properly after allocation and free any allocated memory before it is freed.</td></tr>
</table>



<br><br>
<a name="sa"></a>
<hr><h1>System Administration</h1>





<a name="code"></a>
<hr><h1>Code Overview</h1>
Gigablast consists of about 500,000 lines of C++ source. The latest gb version being developed as of this writing is /gb/datil/. Gigablast is a single executable, gb, that contains all the functionality needed to run a search engine. Gigablast attempts to uniformly distribute every function amongst all hosts in the network. Please see <a href="overview.html">overview.html</a> to get a general overview of the search engine from a USER perspective.
<br><br>
Matt started writing Gigablast in 2000, with the intention of making it the most efficient search engine in the world. A lot of attention was paid to making every piece of the code fast and scalable, while still readable. In July, 2013, the majority of the search engine source code was open sourced. Gigablast continues to develop and expand the software, as well as offers consulting services for companies wishing to run their own search engine.
<br><br>
<hr>

<a name="loop"></a>
<h1>The Control Loop Layer</h1>
Gigablast uses a <b>signal-based</b> architecture, as opposed to a thread-based architecture. At the heart of the Gigablast process is the I/O control loop, contained in the <a href="../latest/src/Loop.cpp">Loop.cpp</a> file. Gigablast sits in an idle state until it receives a signal on a file or socket descriptor, or a "thread" exits and sends a signal to be cleaned up. Threads are only used when necessary, which is currently for performing disk I/O, intersecting long lists of docIds to generate search results (which can block the CPU for a while) and merging database data from multiple files into a single list.
<br><br>
<a href="#threads">Threads</a> are not only hard to debug, but can make everything more latent. If you have 10 requests coming in, each one requiring 1 second of CPU time, if you thread them all it will take 10 seconds until each query can be satisfied and the results returned, but if you handle them FIFO (one at a time) then the first request is satisfied in 1 second, the second in two seconds, etc. The Threads.cpp class handles the use of threads and uses the Linux clone() call. Xavier Leroy's pthreads were found to be too buggy at the time.
<br><br>
Loop.cpp also allows other classes to call its registerCallback() function which takes a file or socket descriptor and when call a given callback when data is ready for reading or writing on that file descriptor. Gigablast accomplishes this with Linux's fcntl() call which you can use to tell the kernel to connect a particular file descriptor to a signal. The heart of gb is really sigtimedwait() which sits there idle waiting for signals. When one arrives it breaks out and the callback responsible for the associated file descriptor is called.
<br><br>
In the event that the queue of ready file descriptors is full, the kernel will deliver a SIGIO to the gb process, which performs a poll over all the registered file descriptors.
<br><br>
Gigablast will also catch other signals, like segmentation faults and HUP signals and save the data to disk before exiting. That way we do not lose data because of a core dump or an accidental HUP signal.
<br><br>
Using a signal based architecture means that we have to work with callbacks a lot. Everything is triggered by an <i>event</i> similar to GUIs, but our events are not mouse related, but disk and network. This naturally leads to the concept of a callback, which is a function to be called when a task is completed.
So you might call a function like <b>readData ( myCallback , myState , readbuf , bytesToRead )</b> and gigablast will perform the requested read operation and call <b>myCallback ( myState )</b> when it is done. This means you have to worry about state, whereas, in a thread-based architecture you do not.
<br><br>
Furthermore, all callbacks must be expressed as pointers to <b>C</b> functions, not C++ functions. So we call these C functions wrapper functions and have them immediately call their C++ counterparts. You'll see these wrapper functions clearly illustrated throughout the code.
<br><br>
Gigablast also has support for asynchronous signals. You can tell the kernel to deliver you an asynchronous signal in an I/O event, and that signal, also known as an interrupt, will interrupt whatever the CPU is doing and call gb's signal handler, sigHandlerRT(). Be careful, you cannot allocate memory when in a real-time/asynchronous signal handler. You should only call function names ending in _ass, which stands for asychronous-signal safe. The global instance of the UdpServer called g_udpServer2 is based on these asynchronous signals, mostly set up for doing fast pings without having to worry about what the CPU is doing at the time, but it really does not interrupt like it should most likely due to kernel issues. It doesn't seem any faster than the non-asynchronouys UDP server, g_udpServer.
<br><br>This signal-based architecture also makes a lot of functions return false if they block, meaning they never completed because they have to wait on I/O, or return true and set g_errno on error. So if a function returns false on you, generally, you should return false, too.
<br><br>
The Loop class is primarily responsible for calling callbacks when an event
occurs on a file/socket descriptor. Loop.cpp also calls callbacks that
registered with it from calling registerSleepCallback(). Those sleep callbacks
are called every X milliseconds, or more than X milliseconds if load is 
extreme.  Loop also responds to SIGCHLD signals which are called when a thread
exits. The only socket descriptor related callbacks are for TcpServer and
for UdpServer. UdpServer and TcpServer each have their own routine for calling
callbacks for an associated socket descriptor. But when the system is under
heavy load we must be careful with what callbacks we call. We should call the
highest priority (lowest niceness) callbacks before calling any of the 
lowest priority (highest niceness) callbacks. 
To solve this problem we require that every callback registered with Loop.cpp
return after 2 ms or more have elapsed if possible. If they still need to
be called by Loop.cpp to do more work they should return true at that time,
otherwise they should return false. We do this in order to give higher priority
callbacks a chance to be called. This is similar to the low latency patch in
the Linux kernel. Actually, low priority callbacks will try to break out after 
2 ms or more, but high priority callbacks will not. Loop::doPoll() just calls
low priority 



<a name="database"></a>
<hr><h1>The Database Layer</h2>
Gigablast has the following databases which each contain an instance of the 
Rdb class: 
<pre>
Indexdb     - Used to hold the index.
Datedb      - Like indexdb, but its scores are dates.
Titledb     - Used to hold cached web pages.
Spiderdb    - Used to hold urls sorted by their scheduled time to be spidered.
Checksumdb  - Used for preventing spidering of duplicate pages.
Sitedb      - Used for classifying webpages. Maps webpages to rulesets.
Clusterdb   - Used to hold the site hash, family filter bit, language id of a document.
Catdb       - Used to classify a document using DMOZ.
</pre>

<a name="rdb"></a>
<h2>Rdb</h2>
<b>Rdb.cpp</b> is a record-based database. Each record has a key, and optional 
blob of data and an optional long (4 bytes) that holds the size of the blob of 
data. If present, the size of the blob is specified right after the key. The 
key is 12 bytes and of type key_t, as defined in the types.h file.

<a name="dbclasses"></a>
<h3>Associated Classes (.cpp and .h files)</h3>

<table cellpadding=3 border=1>
<tr><td>Checksumdb</td><td>DB</td><td>Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time.</td></tr>
<tr><td>Clusterdb</td><td>DB</td><td>Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time. </td></tr>
<tr><td>Datedb</td><td>DB</td><td>Like indexdb, but its <i>scores</i> are 4-byte dates.</td></tr>
<tr><td>Indexdb</td><td>DB</td><td>Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb.</td></tr>
<tr><td>MemPool</td><td>DB</td><td>Used by RdbTree to add new records to tree without having to do an individual malloc.</td></tr>
<tr><td>MemPoolTree</td><td>DB</td><td>Unused. Was our own malloc routine.</td></tr>
<tr><td>Msg0</td><td>DB</td><td>Fetches an RdbList from across the network.</td></tr>
<tr><td>Msg1</td><td>DB</td><td>Adds all the records in an RdbList to various hosts in the network.</td></tr>
<tr><td>Msg3</td><td>DB</td><td>Reads an RdbList from several consecutive files in a particular Rdb.</td></tr>
<!--<tr><td>Msg34</td><td>DB</td><td>Determines least loaded host in a group (shard) of hosts.</td></tr>
<tr><td>Msg35</td><td>DB</td><td>Merge token management functions. Currently does not work.</td></tr>-->
<tr><td>Msg5</td><td>DB</td><td>Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList.</td></tr>
<tr><td>MsgB</td><td>DB</td><td>Unused. A distributed cache for caching anything.</td></tr>
<tr><td>Rdb</td><td>DB</td><td>The core database class from which all are derived.</td></tr>
<tr><td>RdbBase</td><td>DB</td><td>Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection.</td></tr>
<tr><td>RdbCache</td><td>DB</td><td>Can cache RdbLists or individual Rdb records.</td></tr>
<tr><td>RdbDump</td><td>DB</td><td>Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file.</td></tr>
<tr><td>RdbList</td><td>DB</td><td>A list of Rdb records.</td></tr>
<tr><td>RdbMap</td><td>DB</td><td>Maps an Rdb key to an offset into an RdbFile.</td></tr>
<tr><td>RdbMem</td><td>DB</td><td>Memory manager for RdbTree so it does not have to allocate space for every record in the three.</td></tr>
<tr><td>RdbMerge</td><td>DB</td><td>Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively.</td></tr>
<tr><td>RdbScan</td><td>DB</td><td>Reads an RdbList from an RdbFile, used by Msg3.</td></tr>
<tr><td>RdbTree</td><td>DB</td><td>A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree.</td></tr>
<tr><td>SiteRec</td><td>DB</td><td>A record in Sitedb.</td></tr>
<tr><td>Sitedb</td><td>DB</td><td>An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url.</td></tr>
<tr><td>SpiderRec</td><td>DB</td><td>A record in spiderdb.</td></tr>
<tr><td>Spiderdb</td><td>DB</td><td>An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is <i>old</i> or <i>new</i> to the index, and the priority of the url, currently from 0 to 7.</td></tr>
<tr><td>TitleRec</td><td>DB</td><td>A record in Titledb.</td></tr>
<tr><td>Titledb</td><td>DB</td><td>An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class.</td></tr>
<!--<tr><td>Urldb</td><td>DB</td><td>An Rdb whose records indicate if a url is in spiderdb or what particular Titledb BigFile contains the url.</td></tr>-->
</table>

<br><br>

<a name="addingrecs"></a>
<h3>Adding a Record to Rdb</h3>
<a name="rdbtree"></a>
When a record is added to an Rdb it is housed in a binary tree, 
<b>RdbTree.cpp</b>,
whose size is configurable via gb.conf. For example, &lt;indexdbMaxTreeSize&gt;
specified how much memory in bytes that the tree for Indexdb can use. When the 
tree is 90% full it is dumped to a file on disk. The records are dumped in 
order of their keys. When enough files have accrued on disk, a merge is 
performed to keep the number of files down and responsiveness up. 
<br><br>
Each file, called a BigFile and defined by BigFile.h, can actually consist of 
multiple physical files, each limited in size to 512MB. In this manner 
Gigablast overcomes the 2GB file limit imposed by some kernels or disk systems.
Each physical file in a BigFile, after the first file, has a ".partX" 
extension added to its filename, where X ranges from 1 to infinity. Throughout
this document, "BigFile" is used interchangeably with "file".
<br><br>
If the tree can not accommodate a record add, it will return an ETRYAGAIN error.
Typically, most Gigablast routines will wait one second before retrying the
add.
<br>

<h3>Dumping an Rdb Tree</h3>
The RdbDump class is responsible for dumping the RdbTree class to disk. It
dumps a little bit of the tree at a time because it can take a few hundred
milliseconds seconds to gather a large list of records from the tree, 
especially when the records are dataless (just keys). Since RdbTree::getList() 
is not threaded it is important that RdbDump gets a little bit at a time so 
it does not block other operations.
<br>

<a name="mergingfiles"></a>
<a name="rdbmerge"></a>
<h3>Merging Rdb Files</h3>
Rdb merges just enough files as to keep the number of files at or below the 
threshold specified in gb.conf, which is &lt;indexdbMaxFilesToMerge&gt; for 
Indexdb, for example. <b>RdbMerge.cpp</b> is used to control the merge logic. 
The merge chooses which files to merge so as to minimize 
the amount of disk writing for the long term. It is important to keep the 
number of files low, because any time a key range of records is requested, the
number of disk seeks will be low and the database will be responsive.
<br><br>
When too many files have accumulated on disk, the database will enter "urgent 
merge mode". It will show this in the log when it happens. When that happens, 
Gigablast will not dump the corresponding RdbTree to disk because
it will create too many files.  If the RdbTree is full then any
attempts to add data to it when Gigablast is in "urgent merge mode" will fail
with an ETRYAGAIN error. These error replies are counted on a per host basis
and displayed in the Hosts table. At this point the host is considered to be
a spider bottleneck, and most spiders will be stuck waiting to add data to
the host.
<br><br>
If Gigablast is configured to "use merge tokens", (which no longer works for some reason and has since been disabled) then any file merge operation
may be postponed if another instance of Gigablast is performing a merge on the
same computer's IDE bus, or if the twin of the host is merging. This is done
mostly for performance reasons. File merge operations tend to decrease disk
access times, so having a handy twin and not bogging down an IDE bus allows
Gigablast's load balancing algorithms to redirect requests to hosts that are
not involved with a big file merge.
<br>

<a name="deletingrecs"></a>
<h3>Deleting Rdb Records</h3>
The last bit of the 12 bytes key, key_t, is called the delbit. It is 0 if the
key is a "negative key", it is 1 if the key is a "positive key". When 
performing a merge, a negative key may collide with its positive counterpart
thus annihilating one another. Therefore, deletes are performed by taking the
key of the record you want to delete, changing the low bit from a 1 to a 0,
and then adding that key to the database. Negative keys are always dataless.
<br>

<a name="readingrecs"></a>
<h3>Reading a List of Rdb Records</h3>
When a user requests a list of records, Gigablast will read the records
in the key range from each file. If no more than X bytes of records are
requested, then Gigablast will read no more than X bytes from each file. After
the reading, it merges the lists from each file into a final list. During this
phrase it will also annihilate positive keys with their negative counterparts.
<br><br>
<a name="rdbmap"></a>
In order to determine where to start reading in each file for the database,
Gigablast uses a "map file". The map file, encompassed by <b>RdbMap.cpp</b>,  
records the key of the first record
that occurs on each disk page. Each disk page is PAGE_SIZE bytes, currently,
16k. In addition to recording the key of the first record that start on
each page, it records a 2 byte offset of the key into that page. The map file
is represented by the RdbMap class.
<br><br>
The Msg3::readList() routine is used retrieve a list of Rdb records from 
a local Rdb. Like most Msg classes, it allows you to specify a callback
function and callback data in case it blocks. The Msg5 class contains the Msg3
class and extends it with the error correction (discussed below) and caching
capabilities. Calling Msg5::getList() also allows you to incorporate the
lists from the RdbTree in order to get realtime data. Msg3 is used mostly
by the file merge operation.
<br>

<a name="errorcorrection"></a>
<h3>Rdb Error Correction</h3>
Every time a list of records is read, either for answering a query or for
doing a file merge operation, it is checked for corruption. Right now
just the order of the keys of the records, and if those keys are out of the
requested key range, are checked. Later, checksums may
be implemented as well. If some keys are found to be out of order or out
of the requested key range, the request
for the list is forwared to that host's twin. The twin is a mirror image. 
If the twin has the data intact, it returns its data across the network,  
but the twin will return all keys it has in that range, not just necessarily from a single file, thus we can end up patching the corrupted data with a list that is hundreds of times bigger.
Since this
procedure also happens during a file merge operation, merging is a way of
correcting the data.
<br><br>
If there is no twin available, or the twin's data is corrupt as well, then
Gigablast attempts to excise the smallest amount of data so as to regain
integrity. Any time corruption is encountered it is noted in the log with
a message like:
<pre>
1145413139583 81 WARN  db     [31956]  Key out of order in list of records.
1145413139583 81 WARN  db     [31956]  Corrupt filename is indexdb1215.dat.
1145413139583 81 WARN  db     [31956]  startKey.n1=af6da55 n0=14a9fe4f69cd6d46 endKey.n1=b14e5d0 n0=8d4cfb0deeb52cc3
1145413139729 81 WARN  db     [31956]  Removed 0 bytes of data…