PageRenderTime 92ms CodeModel.GetById 40ms app.highlight 31ms RepoModel.GetById 1ms app.codeStats 1ms

/html/developer.html

https://github.com/gigablast/open-source-search-engine
HTML | 2729 lines | 1894 code | 231 blank | 604 comment | 0 complexity | 56e98ebb44660c0e58b2b99d862d7032 MD5 | raw file
Possible License(s): Apache-2.0

Large files files are truncated, but you can click here to view the full file

  1
  2<br><br>
  3
  4<h1>Developer Documentation</h1>
  5
  6FAQ is <a href=/admin.html>here</a>.
  7<br><br>
  8
  9A work-in-progress <a href=/compare.html>comparison to SOLR</a>.
 10<br><br>
 11
 12<b>CAUTION: This documentation is old and a lot of it is out of date. -- Matt, May 2014</b>
 13<br><br>
 14
 15<!--- TABLE OF CONTENTS-->
 16
 17<h1>Table of Contents</h1>
 18
 19<table cellpadding=3 border=0>
 20
 21<tr><td>I.</td><td><a href="#started">Getting Started</a> - Setting up your PC for Gigablast development.</td></tr>
 22<!--subtable-->
 23<tr><td></td><td><table cellpadding=3>
 24
 25<tr><td valign=top>A.</td><td><a href="#ssh">SSH</a> - A brief introduction to setting up ssh</td></tr>
 26</table></td></tr>
 27
 28<tr><td>II.</td><td><a href="#shell">Using the Shell</a> - Some common shell commands.</td></tr>
 29
 30<tr><td>III.</td><td><a href="#git">Using GIT</a> - Source code management.</td></tr>
 31
 32<tr><td>IV.</td><td><a href="#hardware">Hardware Administration</a> - Gigablast hardware resources.</td></tr>
 33
 34<tr><td>V.</td><td><a href="#dirstruct">Directory Structure</a> - How files are laid out.</td></tr>
 35
 36<tr><td>VI.</td><td><a href="#kernels">Kernels</a> - Kernels used by Gigablast.</td></tr>
 37
 38<tr><td>VII.</td><td><a href="#coding">Coding Conventions</a> - The coding style used at Gigablast.</td></tr>
 39
 40<tr><td>VIII.</td><td><a href="#debug">Debugging Gigablast</a> - How to debug gb.</td></tr>
 41
 42
 43<tr><td>IX.</td><td><a href="#code">Code Overview</a> - Basic layers of the code.</td></tr>
 44
 45<!--subtable1-->
 46<tr><td></td><td><table cellpadding=3>
 47
 48<tr><td valign=top>A.</td><td><a href="#loop">The Control Loop Layer</a><br>
 49
 50<tr><td valign=top>B.</td><td><a href="#database">The Database Layer</a><br>
 51
 52<!--subtable2-->
 53<tr><td></td><td><table cellpadding=3>
 54
 55
 56<tr><td>1.</td><td><a href="#dbclasses">Associated Classes</a></td></tr>
 57<tr><td>2.</td><td><a href="#database">Rdb</a> - The core database class.</td></tr>
 58<!--subtable3-->
 59<tr><td></td><td><table cellpadding=3>
 60<td><td>a.</td><td><a href="#addingrecs">Adding Records</a></td></tr>
 61<td><td>b.</td><td><a href="#mergingfiles">Merging Rdb Files</a></td></tr>
 62<td><td>c.</td><td><a href="#deletingrecs">Deleting Records</a></td></tr>
 63<td><td>d.</td><td><a href="#readingrecs">Reading Records</a></td></tr>
 64<td><td>e.</td><td><a href="#errorcorrection">Error Correction</a> - How Rdb deals with corrupted data on disk.</td></tr>
 65<td><td>f.</td><td><a href="#netadds">Adding Records over the Net</a></td></tr>
 66<td><td>g.</td><td><a href="#netreads">Reading Records over the Net</a></td></tr>
 67<td><td>i.</td><td><a href="#varkeysize">Variable Key Sizes</a></td></tr>
 68<td><td>j.</td><td><a href="#rdbcache">Caching Records</a></td></tr>
 69</table></tr>
 70<!--subtable3end-->
 71
 72
 73<tr><td>3. a.</td><td><a href="#posdb">Posdb</a> - The new word-position storing search index Rdb.</td></tr>
 74<tr><td>3. b.</td><td><a href="#indexdb">Indexdb</a> - The retired/unused tf/idf search index Rdb.</td></tr>
 75<tr><td>4.</td><td><a href="#datedb">Datedb</a> - For returning search results constrained or sorted by date.</td></tr>
 76<tr><td>5.</td><td><a href="#titledb">Titledb</a> - Holds the cached web pages.</td></tr>
 77<tr><td>6.</td><td><a href="#spiderdb">Spiderdb</a> - Holds the urls to be spidered, sorted by spider date.</td></tr>
 78<!--<tr><td>7.</td><td><a href="#urldb">Urldb</a> - Tells us if a url is in spiderdb or where it is in titledb.</td></tr>-->
 79<tr><td>8.</td><td><a href="#checksumdb">Checksumdb</a> - Each record is a hash of the document content. Used for deduping.</td></tr>
 80<tr><td>9.</td><td><a href="#sitedb">Sitedb</a> - Used for mapping a url or site to a ruleset.</td></tr>
 81<tr><td>10.</td><td><a href="#clusterdb">Clusterdb</a> - Used for doing site clustering and duplicate search result removal.</td></tr>
 82<tr><td>11.</td><td><a href="#catdb">Catdb</a> - Used to hold DMOZ.</td></tr>
 83<tr><td>12.</td><td><a href="#robotdb">Robotdb</a> - Just a cache for robots.txt files.</td></tr>
 84
 85
 86</table></tr>
 87<!--subtable2end-->
 88
 89
 90<tr><td valign=top>C.</td><td><a href="#networklayer">The Network Layer</a><br>
 91
 92<!--subtable2-->
 93<tr><td></td><td><table cellpadding=3>
 94<tr><td>1.</td><td><a href="#netclasses">Associated Classes</a></td></tr>
 95<tr><td>2.</td><td><a href="#udpserver">Udp server</a> - Used for internal communication.</td></tr>
 96
 97<!--subtable3-->
 98<tr><td></td><td><table cellpadding=3>
 99<td><td>a.</td><td><a href="#multicasting">Multicasting</a></td></tr>
100<td><td>b.</td><td><a href="#msgclasses">Message Classes</a></td></tr>
101</table></tr>
102<!--subtable3end-->
103
104
105<tr><td>3.</td><td><a href="#tcpserver">TCP server</a> - Used by the HTTP server.</td></tr>
106<tr><td>4.</td><td><a href="#tcpserver">HTTP server</a> - The web server.</td></tr>
107<tr><td>5.</td><td><a href="#dns">DNS Resolver</a></td></tr>
108
109</table></tr>
110<!--subtable2end-->
111
112
113<tr><td valign=top>D.</td><td><a href="#buildlayer">The Build Layer</a><br>
114
115<!--subtable2-->
116<tr><td></td><td><table cellpadding=3>
117
118<tr><td>1.</td><td><a href="#buildclasses">Associated Files</a></td></tr>
119<tr><td>2.</td><td><a href="#spiderloop">Spiderdb.cpp</a> - Most of the spider code.</td></tr>
120<!--
121<tr><td>3.</td><td><a href="#spidercache">The Spider Cache</a> - Caches urls from spiderdb.</td></tr>
122<tr><td>4.</td><td><a href="#msg14">Msg14</a> - Indexes a url.</td></tr>
123<tr><td>5.</td><td><a href="#msg15">Msg15</a> - Sets the old IndexList.</td></tr>
124<tr><td>6.</td><td><a href="#msg16">Msg16</a> - Sets the new IndexList.</td></tr>
125<tr><td>7.</td><td><a href="#doc">The Doc Class</a> - Converts document to an IndexList.</td></tr>
126-->
127
128<tr><td>3.</td><td><a href="#linkanalysis">Link Analysis</a></td></tr>
129<!--<tr><td>9.</td><td><a href="#robotdb">Robots.txt</a></td></tr>
130<tr><td>10.</td><td><a href="#termtable">The TermTable</a></td></tr>
131<tr><td>11.</td><td><a href="#indexlist">The IndexList</a></td></tr>-->
132<tr><td>4.</td><td><a href="#samplevector">The Sample Vector</a> - Used for deduping at spider time.</td></tr>
133<tr><td>5.</td><td><a href="#summaryvector">The Summary Vector</a> - Used for removing similar search results.</td></tr>
134<tr><td>6.</td><td><a href="#gigabitvector">The Gigabit Vector</a> - Used for clustering results by topic.</td></tr>
135<tr><td>7.</td><td><a href="#deduping">Deduping</a> - Preventing the index of duplicate pages.</td></tr>
136<tr><td>8.</td><td><a href="#indexcode">Indexing Error Codes</a> - Reasons why a document failed to be indexed.</td></tr>
137<tr><td>9.</td><td><a href="#docids">DocIds</a> - How DocIds are assigned.</td></tr>
138<tr><td>10.</td><td><a href="#docidscaling">Scaling To Many DocIds</a> - How we scale to many DocIds.</td></tr>
139
140</table></tr>
141<!--subtable2end-->
142
143<tr><td valign=top>E.</td><td><a href="#searchresultslayer">The Search Results Layer</a><br>
144
145<!--subtable2-->
146<tr><td></td><td><table cellpadding=3>
147
148<tr><td>1.</td><td><a href="#queryclasses">Associated Classes</a></td></tr>
149<tr><td>1.</td><td><a href="#query">The Query Parser</a></td></tr>
150<tr><td>2.</td><td>Getting the Termlists</td></tr>
151<tr><td>3.</td><td>Intersecting the Termlists</a></td></tr>
152<tr><td>4.</td><td><a href="#raiding">Raiding Termlist Intersections</a></td></tr>
153<tr><td>5.</td><td>The Search Results Cache</a></td></tr>
154<tr><td>6.</td><td>Site Clustering</a></td></tr>
155<tr><td>7.</td><td>Deduping</a></td></tr>
156<tr><td>8.</td><td>Family Filter</a></td></tr>
157<tr><td>9.</td><td>Gigabits</a></td></tr>
158<tr><td>10.</td><td>Topic Clustering</a></td></tr>
159<tr><td>11.</td><td>Related and Reference Pages</a></td></tr>
160<tr><td>12.</td><td><a href="#spellchecker">The Spell Checker</a></td></tr>
161
162</table></tr>
163<!--subtable2end-->
164
165
166<tr><td valign=top>F.</td><td><a href="#adminlayer">The Administration Layer</a><br>
167
168<!--subtable2-->
169<tr><td></td><td><table cellpadding=3>
170<tr><td>1.</td><td><a href="#adminclasses">Associated Classes</a></td></tr>
171<tr><td>2.</td><td><a href="#collections">Collections</a></td></tr>
172<tr><td>3.</td><td><a href="#parms">Parameters and Controls</a></td></tr>
173<tr><td>4.</td><td><a href="#log">The Log System</a></td></tr>
174<tr><td>5.</td><td><a href="#clusterswitch">Changing Live Clusters</a></td></tr>
175</table>
176
177
178<tr><td valign=top>G.</td><td><a href="#corelayer">The Core Functions Layer</a><br>
179<!--subtable2-->
180<tr><td></td><td><table cellpadding=3>
181<tr><td>1.</td><td><a href="#coreclasses">Associated Classes</a></td></tr>
182<tr><td>2.</td><td><a href="#files"></a>Files</td></tr>
183<tr><td>3.</td><td><a href="#threads"></a>Threads</td></tr>
184</table>
185
186<tr><td valign=top>H.</td><td><a href="#filelist">File List</a> - List of all source code files with brief descriptions.</a><br>
187
188</table>
189
190<tr><td>XIV.</td><td><a href="#spam">Fighting Spam</a> - Search engine spam identification.</td></tr>
191
192</table>
193
194<br><br>
195
196
197<a name="started"></a>
198<hr><h1>Getting Started</h1>
199
200Welcome to Gigablast. Thanks for supporting this open source software.
201
202<a name="ssh"></a>
203<br><br>
204<h2>SSH - A Brief Introduction</h2>
205Gigablast uses SSH extensively to administer and develop its software.  SSH is a secure shell
206connection that helps protect sensitive information over an insecure network.  Internally, we 
207access our development machines with SSH, but being prompted
208for a password when logging in between machines can become tiresome as well as development gb
209processes will not execute unless you can ssh without a password.  This is where our authorized
210keys file comes in.  In your user directory you have a .ssh directory and inside of that directory
211there lies an authorized_keys and known_hosts file.  Now what we first have to do is generate a 
212key for your user on the current machine you are on.  Remember to press &lt;Enter&gt; when asked for a 
213passphrase, otherwise you will still have to login.<br>
214<code><i> ssh-keygen -t dsa </i></code><br>
215Now that should have generated an id_dsa.pub and an id_dsa file.  Since you now have a key, we need to
216let all of our trusted hosts know that key.  Luckily ssh provides a simple way to copy your key into 
217the appropriate files.  When you perform this command the machine will ask you to login.<br>
218<code><i> ssh-copy-id -i ~/.ssh/id_dsa.pub destHost</i></code><br>
219Perform this operation to all of our development machines by replacing destHost with the appropriate
220machine name.  Once you have done this, try sshing into each machine.  You should no longer be 
221prompted for a password. If you were requested to enter a password/phrase, go through this section again, 
222but DO NOT ENTER A PASSPHRASE when ssh-keygen requests one. <b>NOTE:  Make sure to copy the id to the 
223same machine you generated the key on.  This is important for running gigablast since it requires to ssh 
224freely.</b>
225<br>
226<br>
227<i>Host Key Has Changed and Now I Can Not Access the Machine</i><br>
228This is easily remedied.  If a machine happens to crash and the OS needs to be replaced for whatever
229reason, the SSH host key for that machine will change, but your known_hosts file will still have the
230old host key.  This is to prevent man-in-the-middle attacks, but when we know this is a rebuilt machine
231we can simply correct the problem.  Just open the ~/.ssh/known_hosts file with your favorite editor.  
232Find the line with the name or ip address of the offending machine and delete the entire line (the 
233known_hosts file separates host keys be newlines).<br>
234<br><br>
235
236<a name="shell"></a>
237<hr><h1>Using the Shell</h1>
238<!-- I'd like to nominate to remove a majority of this section - if you need help with bash at this point, you probably
239   	shouldn't be poking around in the code... Maybe just keep the dsh and export command descriptions? SCC -->
240Everyone uses the bash shell. The navigation keystrokes work in emacs, too. Here are the common commands:
241<br><br>
242
243<table cellpadding=3 border=1>
244<tr><td>Command</td><td>Description</td></tr>
245<tr><td><nobr>export LD_PATH=/home/mwells/dynamicslibs/</nobr></td><td>Export the LD_PATH variable, used to tell the OS where to look for dynamic libraries.</td></tr>
246<tr><td>Ctrl+p</td><td>Show the previously executed command.</td></tr>
247<tr><td>Ctrl+n</td><td>Show the next executed command.</td></tr>
248<tr><td>Ctrl+f</td><td>Move cursor forward.</td></tr>
249<tr><td>Ctrl+b</td><td>Move cursor backward.</td></tr>
250<tr><td>Ctrl+a</td><td>Move cursor to start of line.</td></tr>
251<tr><td>Ctrl+e</td><td>Move cursor to end of line.</td></tr>
252<tr><td>Ctrl+k</td><td>Cut buffer from cursor forward.</td></tr>
253<tr><td>Ctrl+y</td><td>Yank (paste) buffer at cursor location.</td></tr>
254<tr><td>Ctrl+Shift+-</td><td>Undo last keystrokes.</td></tr>
255<tr><td>history</td><td>Show list of last commands executed. Edit /home/username/.bashrc to change the number of commands stored in the history. All are stored in /home/username/.history file.</td></tr>
256<tr><td>!xxx</td><td>Execute command #xxx, where xxx is a number shown from the 'history' command.</td></tr>
257<tr><td>ps auxww</td><td>Show all processes.</td></tr>
258<tr><td>ls -latr</td><td>Show all files reverse sorted by time.</td></tr>
259<tr><td>ls -larS</td><td>Show all files reverse sorted by size.</td></tr>
260<tr><td>ln -s &lt;x&gt; &lt;y&gt;</td><td>Make directory or file y a symbolic link to x.</td></tr>
261<tr><td>cat xxx | awk -F":" '{print $1}'</td><td>Show contents of file xxx, but for each line, use : as a delimiter and print out the first token.</td></tr>
262<tr><td>dsh -c -f hosts 'cat /proc/scsi/scsi'</td><td>Show all hard drives on all machines listed in the file <i>hosts</i>. -c means to execute this command concurrently on all those machines. dsh must be installed with <i>apt-get install dsh</i> for this to work. You can use double quotes in a single quoted dsh command without problems, so you can grep for a phrase, for instance.</td></tr>
263<tr><td>apt-cache search xxx</td><td>Search for a package to install. xxx is a space separated list of keywords. Debian only.</td></tr>
264<tr><td>apt-cache show xxx</td><td>Show details of the package named xxx. Debian only.</td></tr>
265<tr><td>apt-get install xxx</td><td>Installs a package named xxx. Must be root to do this. Debian only.</td></tr>
266<tr><td>adduser xxx</td><td>Add a new user to the system with username xxx.</td></tr>
267</table>
268
269<br>
270
271
272<a name="git"></a>
273<hr><h1>Using Git</h1>
274
275Git is what we use to do source code control. Git allows us to have many engineers making changes to a common set of files. The basic commands are the following:<br><br>
276<table border=1 cellpadding=3>
277<tr>
278<td><nobr>git clone &lt;srcdir&gt; &lt;destDit&gt;</nobr></td>
279<td>This will copy the git repository to the destination directory.</td>
280</tr>
281</table>
282
283<p>More information available at <a href=http://www.github.com/>github.com</a></p>
284<br/>
285<h3>Setting up GitHub on a Linux Box</h3>
286To make things easier, you can set up github access via ssh. Here's a quick list of commands to run:
287(assuming Ubuntu/Debian)
288<pre>
289sudo apt-get install git-core git-doc
290git config --global user.name "Your Name"
291git config --global user.email "your@email.com"
292git config --global color.ui true
293ssh-keygen -t rsa -C "your@email.com" -f ~/.ssh/git_rsa
294cat ~/.ssh/git_rsa.pub
295</pre>
296Copy and paste the ssh-rsa output from the above command into your Github profile's list of SSH Keys.
297<pre>
298ssh-add ~/.ssh/git_rsa
299</pre>
300If that gives you an error about inability to connect to ssh agent, run:
301<pre>
302eval `ssh-agent -a`
303</pre>
304Then test and clone!
305<pre>
306ssh -T git@github.com
307git clone git@github.com:gigablast/open-source-search-engine
308</pre>
309<br><br>
310
311<a name="hardware"></a>
312<hr><h1>Hardware Administration</h1>
313
314<h2>Hardware Failures</h2>
315If a machine has a bunch of I/O errors in the log or the gb process is at a standstill, login and type "dmesg" to see the kernel's ring buffer. The error that means the hard drive is fried is something like this:<br>
316<pre>
317mwells@gf36:/a$ dmesg | tail -3
318scsi5: ERROR on channel 0, id 0, lun 0, CDB: Read (10) 00 13 89 68 82 00 00 18 00 
319Info fld=0x1389688b, Current sd08:34: sense key Medium Error
320 I/O error: dev 08:34, sector 323627528
321</pre>
322If you do a <i>cat /proc/scsi/scsi</i> you can see what type of hard drives are in the server:<br>
323<pre>
324mwells@gf36:/a$ cat /proc/scsi/scsi 
325Attached devices: 
326Host: scsi2 Channel: 00 Id: 00 Lun: 00
327  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
328  Type:   Direct-Access                    ANSI SCSI revision: 03
329Host: scsi3 Channel: 00 Id: 00 Lun: 00
330  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
331  Type:   Direct-Access                    ANSI SCSI revision: 03
332Host: scsi4 Channel: 00 Id: 00 Lun: 00
333  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
334  Type:   Direct-Access                    ANSI SCSI revision: 03
335Host: scsi5 Channel: 00 Id: 00 Lun: 00
336  Vendor: Hitachi  Model: HDS724040KLSA80  Rev: KFAO
337  Type:   Direct-Access                    ANSI SCSI revision: 03
338</pre>
339<!-- Should most of the following be kept internally at GB? Not sure it would help with open-source users... SCC -->
340So for this error we should replace the <b>rightmost</b> hard drive with a spare hard drive. Usually we have spare hard drives floating around. You will know by looking at colo.html where all the equipment is stored.
341<br><br>
342If a drive is not showing up in /proc/scsi/scsi when it should be, it may be a bad sata channel on the motherboard, but sometimes you can see it again if you reboot a few times. 
343<br><br>
344Often, if a motherboard failures, or a machine just stops responding to pings and cannot be revived, we designate it as a junk machine and replace its good hard drives with bad ones from other machines. Then we just need to fix the junk machine.
345<br><br>
346At some point we email John Wang from TwinMicro, or Nick from ACME Micro to fix/RMA the gb/g/gf machines and gk machines respectively. We have to make sure there are plenty of spares to deal with spikes in the hardware failure rate.
347<br><br>
348Sometimes you can not ssh to a machine but you can rsh to it. This is usually because the a drive is bad and ssh can not access some files it needs to log you in. And sometimes you can ping a machine but cannot rsh or ssh to it, for the same reason. Recently we had some problems with some machines not coming back up after reboot, that was due to a flaky 48-port NetGear switch.
349<br><br>
350<h2>Replacing a Hard Drive</h2>
351When a hard drive goes bad, make sure it is labelled as such so we don't end up reusing it. If it is the leftmost drive slot (drive #0) then the OS will need to be reinstalled using a boot cd. Otherwise, we can just <b>umount /dev/md0</b> as root and rebuild ext2 using the command in /home/mwells/gbsetup2, <b>mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0</b>. And to test the hard drives thoroughly, after the mke2fs, <i>nohup badblocks -b4096 -p0 -c10000 -s -w -o/home/mwells/badblocks /dev/md0 >& /home/mwells/bbout </i>. Sometimes you might even need to rebuild the software raid with <b>mkraid -c /etc/raidtab --really-force /dev/md0</b> before you do that. Once the raid is repaired, mount /dev/md0 and copy over the data from its twin to it. On the newly repaired host use <b>nohup rcp -r gbXX:/a/* /a/ &</b> where gbXX is its twin that has all the data still. DO NOT ERASE THE TWIN'S DATA. You will also have to do <b>rcp -r gbXX:/a/.antiword /a/</b> because rcp does not pick that directory up.
352<br><br>
353<h2>Measuring Drive Performance</h2>
354Slow drives are sometimes worse than broken drives. Run <b>./gb thrutest &lt;directoryToWriteTo&gt; 10000000000</b> to make gigablast write out 10GB of files and then read from them. This is also a good test for checking a drive for errors. Each drive should have a write speed of about 39MB/s for a HITACHI, so the whole raid should be doing 150MB/s write, and close to 200MB/s for reading. Fuller drives will not do so well because of fragmentation. If the performance is less than what it should, then try mounting each drive individually and seeing what speed you get to isolate the culprit. Sometimes you can have the drive re-seated (just pulling it out and plugging it back in) and that does the trick. A soft reboot does not seem to work. I don't know if a power cycle works yet or not.
355<br><br>
356<h2>Measuring Memory Performance</h2>
357Run <b>membustest 50000000 50</b> to read 50MB from main memory 50 times to see how fast the memory bus is. It should be at least 2000MB/sec, and 2500MB/sec for the faster servers. Partap also wrote <b>gb memtest</b> to see the maximum amount of memory that can be allocated, in addition to testing memory bus speeds. The max memory that can be allocated seems to be about 3GB on our 2.4.31 kernel.
358<br><br>
359<h2>Measuring Network Performance</h2>
360Use <b>thunder 50000000</b> on the server with ip address a.b.c.d and <b>thunder 50000000 a.b.c.d</b> on the client to see how fast the client can read UDP packets from the server. Gigablast needs about a Gigabit between all machines, and they must all be on a Gigabit switch. Sometimes the 5013CM-T resets the Gigabit to 100Mbps and causes Gigablast to basically come to a hault. You can detect that by doing a <b>dmesg | grep e1000 | tail -1</b> on each machine to see what speed its network interface is running at.
361<br><br>
362
363<a name="kernels"></a>
364<hr><h1>Kernels</h1>
365Gigablast runs well on the Linux 2.4.31 kernel which resides in /gb/kernels/2.4.31/. The .config file is in that directory and named <b>dotconfig</b>. It does have the marvel patch, mvsata340_2.4.patch.bz2 applied so Linux can see the Marvel chip that drives the 4 sata devices on our SuperMicro 5013CM-T 1U servers. In addition we comment out the 3 'offline' statements in drives/scsi/scsi_error.c in the kernel source, and raised the 3 TIMEOUT to 30. These changes allow the kernel to deal with the 'drive offlined' messages that the Hitachi drives SCSI interface give every now and then. We still set the 'ghost lun' think in the kernel source to 10, but I am not sure if this actually helps, though.
366<br><br>
367Older kernels had problems with unnecessary swapping and also with crashing, most likely due to the e1000 driver, which drives the 5013CM-T's Gigabit ethernet functionality. All issues seem to be worked out at this point, so any kernel crash is likely due to a hardware failure, like bad memory, or disk.
368<br><br>
369I would like to have larger block sizes on the ext2fs filesystem, because 4KB is just not good for fragmentation purposes. We use mostly huge files on our partitions, raid-0'ed across 4 400GB Sata drives, so it would help if we could use a block size of a few Megabytes even. I've contacted Theodore T'so, the guy he wrote e2fs, and he may do it for a decent chunk of cash.
370<br><br>
371
372
373
374<a name="coding"></a>
375<hr><h1>Coding Conventions</h1>
376
377When editing the source please do not wrap lines passed 80 columns. Also, 
378please make an effort to avoid excessive nesting, it makes for hard-to-read 
379code and can almost always be avoided.
380<br><br>
381Classes and files are almost exclusively named using Capitalized letters for
382the beginning of each word, like MyClass.cpp. Global variables start with g_,
383static variables start with s_ and member variables start with m_. In order
384to save vertical space and make things more readable, left curly brackets ({'s)
385are always placed on the same line as the class, function or loop declaration.
386In pointer declarations, the * sticks to the variable name, not the type as
387in void *ptr.  All variables are to be named in camel case as in lowerCamelCase.
388Underscores in variable names are forbidden, with the exception of the g_, s_ and m_ prefixes, of course.  
389<br><br>
390And try to avoid lots of nested curly brackets. Usually you do not have to nest
391more than 3 levels deep if you write the code correctly.
392<br><br>
393Using the *p, p++ and p < pend mechanisms is generally preferred to using the 
394p[i], i++ and i < n mechanisms. So stick to that whenever possible.
395<br><br>
396Using ?'s in the code, like a = b ? d : e; is also forbidden because it is too cryptic and easy to mess up.
397<br><br>
398Upon error, g_errno is set and the conditions causing the error are logged.
399Whenever possible, Gigablast should recover after error and continue running 
400gracefully.
401<br><br>
402Gigablast uses many non-blocking functions that return false if they block,
403or true if they do not block. These type of functions will often set the 
404g_errno global error variable on error. These type of functions almost always
405take a pointer to the callback function and the callback state as arguments.
406The callback function will be called upon completion of the function.
407<br><br>
408Gigablast no longer uses Xavier Leroy's pthread library (LinuxThreads) because
409it was buggy. Instead, the Linux clone() system call is used. That function
410is pretty much exclusively Linux, and was the basis for pthreads anyway. It
411uses an older version of libc which does not contain thread support for the
412errno variable, so, like the old pthreads library, Gigablast's Threads.cpp 
413class has implemented the _errno_location() function to work around that. This
414function provides a pointer to a separate thread's (process's) own errno 
415location so we do not contend for the same errno.
416<br><br>
417We also make an effort to avoid deep subroutine calls. Having to trace the
418IP stack down several functions is time-consuming and should be avoided.
419<br><br>
420Gigablast also tries to make a consistent distinction between size and length. 
421Length usually refers to a string length (like strlen()) and does not include 
422the NULL terminator, if any. Size is the overall size of a data blob, 
423and includes everything.
424<br><br>
425When making comments please do not insert the "//" at the very first character
426of the line in the first column of the screen.  It should be indented just
427like any other line.
428<br><br>
429Whenever possible, please mimice the currently used styles and conventions 
430and please avoid using third party libraries. Even the standard template 
431library has serious performance/scalability issues and should be avoided.
432<br><br>
433
434<a name="debug"></a>
435<hr><h1>Debugging Gigablast</h1>
436
437<h2>General Procedure</h2>
438When Gigablast cores you generally want to use gdb to look at the core and see where it happened. You can use BitKeeper's <b>bk revtool &lt;filename&gt;</b> to see if the class or file that cored was recently edited. You may be looking at a bug that was already fixed and only need to update to the latest version.
439
440<h2>Using GDB</h2>
441
442<p>We generally use gdb to debug Gigablast. If you are running gb, the gigablast process, under gdb, and it receives a signal, gdb will break and you have to tell it to ignore the signal from now now by typing <b>handle SIG39 nostop noprint</b> for signal 39, at least, and then continue execution by typing 'c' and enter. When debugging a core on a customer's machine you might have to copy your versino of gdb over to it, if they don't have one installed.</p>
443
444<p>There is also a /gb/bin/gdbserver that you can use to debug a remote gb process, although no one really uses this except Partap used it a few times. </p>
445
446<p>The most common way to use gdb is:
447<ul>
448<li><b>gdb ./gb</b> to start up gdb.
449<li><b>run -c ./hosts.conf 0</b> to run gb under gdb.
450<li><b>Ctrl-C</b> to break gb
451<li><b>where</b> to make gdb print out the ip stack
452<li><b>break MyClass.cpp:123</b> to break gb at line 123 in MyClass.cpp.
453<li><b>print &lt;variableName&gt;</b> to print out the value of a variable.
454<li><b>watch &lt;variableName&gt;</b> to set a watchpoint so gdb breaks if the variable changes value.
455<li><b>set &lt;variableName&gt; &lt;value&gt;</b> to set the value of a variable.
456<li><b>set print elements 100000</b> to tell gdb to print out the first 100000 bytes when printing strings, as opposed to limiting it to the default 200 or so bytes. 
457</ul>
458</p>
459
460<p>You can also use gdb to do <i>poor man's profiling</i> by repeatedly attaching gdb to the gb pid like <b>gdb ./gb &lt;pid&gt;</b> and seeing where it is spending its time. This is a fairly effective random sampling technique.</p>
461
462<p>If a gb process goes into an infinite loop you can get it to save its in-memory data by attaching gdb to its pid and typing <b>print mainShutdown(1)</b> which will tell gdb to run that function which will save all gb's data to disk so you don't end up losing data.</p>
463
464<p>To debug a core type <b>gdb ./gb &lt;coreFilename&gt;</b> Then you can examine why gb core dumped. Please copy the gb binary and move the core to another filename if you want to preserve the core for another engineer to look at. You need to use the exact gb that produced the core in order to analyze it properly.
465</p>
466
467<p>It is useful to have the following .gdbinit file in your home directory:
468<pre>
469set print elements 100000
470handle SIG32 nostop noprint
471handle SIG35 nostop noprint
472handle SIG39 nostop noprint
473set overload-resolution off
474</pre>
475The overload-resolution gets in the way when tring to print the return value of some functions, like uccDebug() for instance.
476</p>
477
478<h2>Using Git to Help Debug</h2>
479<a href="#git">Git</a> can be a useful debugging aid. By doing <b>git log</b> You can see what the latest changes made to gb were. This may allow you to isolate the bug. <b>bk diff HEAD~1 HEAD~0</b> will actually list the files that were changed from the last commit, instead of just the changesets. Substituting '1' with '2' will show the changes from the last two commits, etc.
480
481<h2>Memory Leaks</h2>
482All memory allocations, malloc, calloc, realloc and new, go through Gigablast's Mem.cpp class. That class has a hash table of pointers to any outstanding allocated memory, including the size of the allocation. When Gigablast allocates memory it will allocate a little more than requested so it can pad the memory with special bytes, called magic characters, that are verified when the memory is freed to see if the buffer was overwritten or underwritten. In addition to doing these boundary checks, Gigablast will also print out all the unfreed memory at exit. This allows you to detect memory leaks. The only way we can leak memory without detection now is if it is leaked in a third party library which Gigablast links to. 
483
484<h2>Random Memory Writes</h2>
485This is the hardest problem to track down. Where did that random memory write come from? Usually, RdbTree.cpp, uses the most memory, so if the write happens it will corrupt the tree. To fight this you can turn <i>protect</i> the RdbTree by forcing RdbTree::m_useProtection to true in RdbTree::set(). This will make Gigablast protect the memory when not being accessed explicitly by RdbTree.cpp code. It does this by calling mprotect(), which can slow things down quite a bit.
486
487<h2>Corrupted Data In Memory</h2>
488Try to reproduce it under gdb immediately before the corruption occurs, then set a watchpoint on some memory that was getting corrupted.  Try to determine the start of the corruption in memory. This may lead you to a buffer that was overflowing. Often times, a gb process will core in RdbTree.cpp which never happens unless the machine has bad RAM. Bad RAM also causes gb to go into an infinite loop in Msg3.cpp's RdbMap logic. In these cases try replacing the memory or using another machine.
489
490<h2>Common Core Dumps</h2>
491<table>
492<tr><td>Problem</td><td>Cause</td></tr>
493<tr><td>Core in RdbTree</td><td>Probably bad ram</td></tr>
494</table>
495
496<h2>Common Coding Mistakes</h2>
497<table>
498<tr><td>1.</td><td>Calling atoip() twice or more in the same printf() statement. atoip() outputs into a single static buffer and can not be shared this way.</td></tr>
499<tr><td>2.</td><td>Not calling class constructors and destructors when mallocing/freeing class objects.  Need to allow class to initialize properly after allocation and free any allocated memory before it is freed.</td></tr>
500</table>
501
502
503
504<br><br>
505<a name="sa"></a>
506<hr><h1>System Administration</h1>
507
508
509
510
511
512<a name="code"></a>
513<hr><h1>Code Overview</h1>
514Gigablast consists of about 500,000 lines of C++ source. The latest gb version being developed as of this writing is /gb/datil/. Gigablast is a single executable, gb, that contains all the functionality needed to run a search engine. Gigablast attempts to uniformly distribute every function amongst all hosts in the network. Please see <a href="overview.html">overview.html</a> to get a general overview of the search engine from a USER perspective.
515<br><br>
516Matt started writing Gigablast in 2000, with the intention of making it the most efficient search engine in the world. A lot of attention was paid to making every piece of the code fast and scalable, while still readable. In July, 2013, the majority of the search engine source code was open sourced. Gigablast continues to develop and expand the software, as well as offers consulting services for companies wishing to run their own search engine.
517<br><br>
518<hr>
519
520<a name="loop"></a>
521<h1>The Control Loop Layer</h1>
522Gigablast uses a <b>signal-based</b> architecture, as opposed to a thread-based architecture. At the heart of the Gigablast process is the I/O control loop, contained in the <a href="../latest/src/Loop.cpp">Loop.cpp</a> file. Gigablast sits in an idle state until it receives a signal on a file or socket descriptor, or a "thread" exits and sends a signal to be cleaned up. Threads are only used when necessary, which is currently for performing disk I/O, intersecting long lists of docIds to generate search results (which can block the CPU for a while) and merging database data from multiple files into a single list.
523<br><br>
524<a href="#threads">Threads</a> are not only hard to debug, but can make everything more latent. If you have 10 requests coming in, each one requiring 1 second of CPU time, if you thread them all it will take 10 seconds until each query can be satisfied and the results returned, but if you handle them FIFO (one at a time) then the first request is satisfied in 1 second, the second in two seconds, etc. The Threads.cpp class handles the use of threads and uses the Linux clone() call. Xavier Leroy's pthreads were found to be too buggy at the time.
525<br><br>
526Loop.cpp also allows other classes to call its registerCallback() function which takes a file or socket descriptor and when call a given callback when data is ready for reading or writing on that file descriptor. Gigablast accomplishes this with Linux's fcntl() call which you can use to tell the kernel to connect a particular file descriptor to a signal. The heart of gb is really sigtimedwait() which sits there idle waiting for signals. When one arrives it breaks out and the callback responsible for the associated file descriptor is called.
527<br><br>
528In the event that the queue of ready file descriptors is full, the kernel will deliver a SIGIO to the gb process, which performs a poll over all the registered file descriptors.
529<br><br>
530Gigablast will also catch other signals, like segmentation faults and HUP signals and save the data to disk before exiting. That way we do not lose data because of a core dump or an accidental HUP signal.
531<br><br>
532Using a signal based architecture means that we have to work with callbacks a lot. Everything is triggered by an <i>event</i> similar to GUIs, but our events are not mouse related, but disk and network. This naturally leads to the concept of a callback, which is a function to be called when a task is completed.
533So you might call a function like <b>readData ( myCallback , myState , readbuf , bytesToRead )</b> and gigablast will perform the requested read operation and call <b>myCallback ( myState )</b> when it is done. This means you have to worry about state, whereas, in a thread-based architecture you do not.
534<br><br>
535Furthermore, all callbacks must be expressed as pointers to <b>C</b> functions, not C++ functions. So we call these C functions wrapper functions and have them immediately call their C++ counterparts. You'll see these wrapper functions clearly illustrated throughout the code.
536<br><br>
537Gigablast also has support for asynchronous signals. You can tell the kernel to deliver you an asynchronous signal in an I/O event, and that signal, also known as an interrupt, will interrupt whatever the CPU is doing and call gb's signal handler, sigHandlerRT(). Be careful, you cannot allocate memory when in a real-time/asynchronous signal handler. You should only call function names ending in _ass, which stands for asychronous-signal safe. The global instance of the UdpServer called g_udpServer2 is based on these asynchronous signals, mostly set up for doing fast pings without having to worry about what the CPU is doing at the time, but it really does not interrupt like it should most likely due to kernel issues. It doesn't seem any faster than the non-asynchronouys UDP server, g_udpServer.
538<br><br>This signal-based architecture also makes a lot of functions return false if they block, meaning they never completed because they have to wait on I/O, or return true and set g_errno on error. So if a function returns false on you, generally, you should return false, too.
539<br><br>
540The Loop class is primarily responsible for calling callbacks when an event
541occurs on a file/socket descriptor. Loop.cpp also calls callbacks that
542registered with it from calling registerSleepCallback(). Those sleep callbacks
543are called every X milliseconds, or more than X milliseconds if load is 
544extreme.  Loop also responds to SIGCHLD signals which are called when a thread
545exits. The only socket descriptor related callbacks are for TcpServer and
546for UdpServer. UdpServer and TcpServer each have their own routine for calling
547callbacks for an associated socket descriptor. But when the system is under
548heavy load we must be careful with what callbacks we call. We should call the
549highest priority (lowest niceness) callbacks before calling any of the 
550lowest priority (highest niceness) callbacks. 
551To solve this problem we require that every callback registered with Loop.cpp
552return after 2 ms or more have elapsed if possible. If they still need to
553be called by Loop.cpp to do more work they should return true at that time,
554otherwise they should return false. We do this in order to give higher priority
555callbacks a chance to be called. This is similar to the low latency patch in
556the Linux kernel. Actually, low priority callbacks will try to break out after 
5572 ms or more, but high priority callbacks will not. Loop::doPoll() just calls
558low priority 
559
560
561
562<a name="database"></a>
563<hr><h1>The Database Layer</h2>
564Gigablast has the following databases which each contain an instance of the 
565Rdb class: 
566<pre>
567Indexdb     - Used to hold the index.
568Datedb      - Like indexdb, but its scores are dates.
569Titledb     - Used to hold cached web pages.
570Spiderdb    - Used to hold urls sorted by their scheduled time to be spidered.
571Checksumdb  - Used for preventing spidering of duplicate pages.
572Sitedb      - Used for classifying webpages. Maps webpages to rulesets.
573Clusterdb   - Used to hold the site hash, family filter bit, language id of a document.
574Catdb       - Used to classify a document using DMOZ.
575</pre>
576
577<a name="rdb"></a>
578<h2>Rdb</h2>
579<b>Rdb.cpp</b> is a record-based database. Each record has a key, and optional 
580blob of data and an optional long (4 bytes) that holds the size of the blob of 
581data. If present, the size of the blob is specified right after the key. The 
582key is 12 bytes and of type key_t, as defined in the types.h file.
583
584<a name="dbclasses"></a>
585<h3>Associated Classes (.cpp and .h files)</h3>
586
587<table cellpadding=3 border=1>
588<tr><td>Checksumdb</td><td>DB</td><td>Rdb that maps a docId to a checksum for an indexed document. Used to dedup same content from the same hostname at build time.</td></tr>
589<tr><td>Clusterdb</td><td>DB</td><td>Rdb that maps a docId to the hash of a site and its family filter bit and, optionally, a sample vector used for deduping search results. Used for site clustering, family filtering and deduping at query time. </td></tr>
590<tr><td>Datedb</td><td>DB</td><td>Like indexdb, but its <i>scores</i> are 4-byte dates.</td></tr>
591<tr><td>Indexdb</td><td>DB</td><td>Rdb that maps a termId to a score and docId pair. The search index is stored in Indexdb.</td></tr>
592<tr><td>MemPool</td><td>DB</td><td>Used by RdbTree to add new records to tree without having to do an individual malloc.</td></tr>
593<tr><td>MemPoolTree</td><td>DB</td><td>Unused. Was our own malloc routine.</td></tr>
594<tr><td>Msg0</td><td>DB</td><td>Fetches an RdbList from across the network.</td></tr>
595<tr><td>Msg1</td><td>DB</td><td>Adds all the records in an RdbList to various hosts in the network.</td></tr>
596<tr><td>Msg3</td><td>DB</td><td>Reads an RdbList from several consecutive files in a particular Rdb.</td></tr>
597<!--<tr><td>Msg34</td><td>DB</td><td>Determines least loaded host in a group (shard) of hosts.</td></tr>
598<tr><td>Msg35</td><td>DB</td><td>Merge token management functions. Currently does not work.</td></tr>-->
599<tr><td>Msg5</td><td>DB</td><td>Uses Msg3 to read RdbLists from multiple files and then merges those lists into a single RdbList. Does corruption detection and repiar. Intergrates list from RdbTree into the single RdbList.</td></tr>
600<tr><td>MsgB</td><td>DB</td><td>Unused. A distributed cache for caching anything.</td></tr>
601<tr><td>Rdb</td><td>DB</td><td>The core database class from which all are derived.</td></tr>
602<tr><td>RdbBase</td><td>DB</td><td>Each Rdb has an array of RdbBases, one for each collection. Each RdbBase has an array of BigFiles for that collection.</td></tr>
603<tr><td>RdbCache</td><td>DB</td><td>Can cache RdbLists or individual Rdb records.</td></tr>
604<tr><td>RdbDump</td><td>DB</td><td>Dumps the RdbTree to an Rdb file. Also is used by RdbMerge to dump the merged RdbList to a file.</td></tr>
605<tr><td>RdbList</td><td>DB</td><td>A list of Rdb records.</td></tr>
606<tr><td>RdbMap</td><td>DB</td><td>Maps an Rdb key to an offset into an RdbFile.</td></tr>
607<tr><td>RdbMem</td><td>DB</td><td>Memory manager for RdbTree so it does not have to allocate space for every record in the three.</td></tr>
608<tr><td>RdbMerge</td><td>DB</td><td>Merges multiple Rdb files into one Rdb file. Uses Msg5 and RdbDump to do reading and writing respectively.</td></tr>
609<tr><td>RdbScan</td><td>DB</td><td>Reads an RdbList from an RdbFile, used by Msg3.</td></tr>
610<tr><td>RdbTree</td><td>DB</td><td>A binary tree of Rdb records. All collections share a single RdbTree, so the collection number is specified for each node in the tree.</td></tr>
611<tr><td>SiteRec</td><td>DB</td><td>A record in Sitedb.</td></tr>
612<tr><td>Sitedb</td><td>DB</td><td>An Rdb that maps a url to a Sitedb record which contains a ruleset to be used to parse and index that url.</td></tr>
613<tr><td>SpiderRec</td><td>DB</td><td>A record in spiderdb.</td></tr>
614<tr><td>Spiderdb</td><td>DB</td><td>An Rdb whose records are urls sorted by times they should be spidered. The key contains other information like if the url is <i>old</i> or <i>new</i> to the index, and the priority of the url, currently from 0 to 7.</td></tr>
615<tr><td>TitleRec</td><td>DB</td><td>A record in Titledb.</td></tr>
616<tr><td>Titledb</td><td>DB</td><td>An Rdb where the records are basically compressed web pages, along with other info like the quality of the page. Contains an instance of the LinkInfo class.</td></tr>
617<!--<tr><td>Urldb</td><td>DB</td><td>An Rdb whose records indicate if a url is in spiderdb or what particular Titledb BigFile contains the url.</td></tr>-->
618</table>
619
620<br><br>
621
622<a name="addingrecs"></a>
623<h3>Adding a Record to Rdb</h3>
624<a name="rdbtree"></a>
625When a record is added to an Rdb it is housed in a binary tree, 
626<b>RdbTree.cpp</b>,
627whose size is configurable via gb.conf. For example, &lt;indexdbMaxTreeSize&gt;
628specified how much memory in bytes that the tree for Indexdb can use. When the 
629tree is 90% full it is dumped to a file on disk. The records are dumped in 
630order of their keys. When enough files have accrued on disk, a merge is 
631performed to keep the number of files down and responsiveness up. 
632<br><br>
633Each file, called a BigFile and defined by BigFile.h, can actually consist of 
634multiple physical files, each limited in size to 512MB. In this manner 
635Gigablast overcomes the 2GB file limit imposed by some kernels or disk systems.
636Each physical file in a BigFile, after the first file, has a ".partX" 
637extension added to its filename, where X ranges from 1 to infinity. Throughout
638this document, "BigFile" is used interchangeably with "file".
639<br><br>
640If the tree can not accommodate a record add, it will return an ETRYAGAIN error.
641Typically, most Gigablast routines will wait one second before retrying the
642add.
643<br>
644
645<h3>Dumping an Rdb Tree</h3>
646The RdbDump class is responsible for dumping the RdbTree class to disk. It
647dumps a little bit of the tree at a time because it can take a few hundred
648milliseconds seconds to gather a large list of records from the tree, 
649especially when the records are dataless (just keys). Since RdbTree::getList() 
650is not threaded it is important that RdbDump gets a little bit at a time so 
651it does not block other operations.
652<br>
653
654<a name="mergingfiles"></a>
655<a name="rdbmerge"></a>
656<h3>Merging Rdb Files</h3>
657Rdb merges just enough files as to keep the number of files at or below the 
658threshold specified in gb.conf, which is &lt;indexdbMaxFilesToMerge&gt; for 
659Indexdb, for example. <b>RdbMerge.cpp</b> is used to control the merge logic. 
660The merge chooses which files to merge so as to minimize 
661the amount of disk writing for the long term. It is important to keep the 
662number of files low, because any time a key range of records is requested, the
663number of disk seeks will be low and the database will be responsive.
664<br><br>
665When too many files have accumulated on disk, the database will enter "urgent 
666merge mode". It will show this in the log when it happens. When that happens, 
667Gigablast will not dump the corresponding RdbTree to disk because
668it will create too many files.  If the RdbTree is full then any
669attempts to add data to it when Gigablast is in "urgent merge mode" will fail
670with an ETRYAGAIN error. These error replies are counted on a per host basis
671and displayed in the Hosts table. At this point the host is considered to be
672a spider bottleneck, and most spiders will be stuck waiting to add data to
673the host.
674<br><br>
675If Gigablast is configured to "use merge tokens", (which no longer works for some reason and has since been disabled) then any file merge operation
676may be postponed if another instance of Gigablast is performing a merge on the
677same computer's IDE bus, or if the twin of the host is merging. This is done
678mostly for performance reasons. File merge operations tend to decrease disk
679access times, so having a handy twin and not bogging down an IDE bus allows
680Gigablast's load balancing algorithms to redirect requests to hosts that are
681not involved with a big file merge.
682<br>
683
684<a name="deletingrecs"></a>
685<h3>Deleting Rdb Records</h3>
686The last bit of the 12 bytes key, key_t, is called the delbit. It is 0 if the
687key is a "negative key", it is 1 if the key is a "positive key". When 
688performing a merge, a negative key may collide with its positive counterpart
689thus annihilating one another. Therefore, deletes are performed by taking the
690key of the record you want to delete, changing the low bit from a 1 to a 0,
691and then adding that key to the database. Negative keys are always dataless.
692<br>
693
694<a name="readingrecs"></a>
695<h3>Reading a List of Rdb Records</h3>
696When a user requests a list of records, Gigablast will read the records
697in the key range from each file. If no more than X bytes of records are
698requested, then Gigablast will read no more than X bytes from each file. After
699the reading, it merges the lists from each file into a final list. During this
700phrase it will also annihilate positive keys with their negative counterparts.
701<br><br>
702<a name="rdbmap"></a>
703In order to determine where to start reading in each file for the database,
704Gigablast uses a "map file". The map file, encompassed by <b>RdbMap.cpp</b>,  
705records the key of the first record
706that occurs on each disk page. Each disk page is PAGE_SIZE bytes, currently,
70716k. In addition to recording the key of the first record that start on
708each page, it records a 2 byte offset of the key into that page. The map file
709is represented by the RdbMap class.
710<br><br>
711The Msg3::readList() routine is used retrieve a list of Rdb records from 
712a local Rdb. Like most Msg classes, it allows you to specify a callback
713function and callback data in case it blocks. The Msg5 class contains the Msg3
714class and extends it with the error correction (discussed below) and caching
715capabilities. Calling Msg5::getList() also allows you to incorporate the
716lists from the RdbTree in order to get realtime data. Msg3 is used mostly
717by the file merge operation.
718<br>
719
720<a name="errorcorrection"></a>
721<h3>Rdb Error Correction</h3>
722Every time a list of records is read, either for answering a query or for
723doing a file merge operation, it is checked for corruption. Right now
724just the order of the keys of the records, and if those keys are out of the
725requested key range, are checked. Later, checksums may
726be implemented as well. If some keys are found to be out of order or out
727of the requested key range, the request
728for the list is forwared to that host's twin. The twin is a mirror image. 
729If the twin has the data intact, it returns its data across the network,  
730but the twin will return all keys it has in that range, not just necessarily from a single file, thus we can end up patching the corrupted data with a list that is hundreds of times bigger.
731Since this
732procedure also happens during a file merge operation, merging is a way of
733correcting the data.
734<br><br>
735If there is no twin available, or the twin's data is corrupt as well, then
736Gigablast attempts to excise the smallest amount of data so as to regain
737integrity. Any time corruption is encountered it is noted in the log with
738a message like:
739<pre>
7401145413139583 81 WARN  db     [31956]  Key out of order in list of records.
7411145413139583 81 WARN  db     [31956]  Corrupt filename is indexdb1215.dat.
7421145413139583 81 WARN  db     [31956]  startKey.n1=af6da55 n0=14a9fe4f69cd6d46 endKey.n1=b14e5d0 n0=8d4cfb0deeb52cc3
7431145413139729 81 WARN  db     [31956]  Removed 0 bytes of data…

Large files files are truncated, but you can click here to view the full file