PageRenderTime 485ms CodeModel.GetById 180ms app.highlight 0ms RepoModel.GetById 171ms app.codeStats 1ms

/historical/README

https://bitbucket.org/lindenlab/apiary/
#! | 64 lines | 42 code | 22 blank | 0 comment | 0 complexity | 50d157473770cb0f18586c522d4b101d MD5 | raw file
 1
 2=== LOGS ===
 3
 4On nyarlathotep we generate logs from packet captures of SQL hitting the
 5database. These files look like this:
 6
 7    1237237351.064750	10.2.231.65:40784	sim	QueryStart
 8    SELECT owner_id, is_owner_group FROM parcel WHERE
 9    parcel_id='9d50d6eb-a623-2a30-a8e0-840189fabff7'
10    **************************************
11    1237237351.064759	10.0.0.172:56714	web	Quit
12    Quit
13    **************************************
14    1237237351.064861	10.2.231.65:40784	sim	QueryResponse
15    SELECT owner_id, is_owner_group FROM parcel WHERE
16    parcel_id='9d50d6eb-a623-2a30-a8e0-840189fabff7'
17    **************************************
18    1237237351.065393	10.6.6.97:39706	sim	QueryStart
19
20These logs allow us to replay sequences of SQL by connection.  However, the
21"Quit" indicator that shows when a sequence ends, isn't always present. Once
22you have a log, you need to generate the missing ends (assumed to be just
23after the last SQL on a connection). To do this, run something like:
24
25    gzcat query.log.21.gz | python sqllog_genends.py > query.log.21-ends
26    
27The log and the end files need to be merged for any operation, however, the
28scripts all take multiple input files and will do a sorted merge as they
29process.  Since the source is almost always gzip'd, you can use a dash to
30make stdin one of the merged inputs:
31
32    gzcat query.log.21.gz | python sqllog_stats.py - query.log.21-ends
33
34The above also shows that you can generate statistics about a stream of events
35(or several streams merged) with the sqllog_stats.py script.
36
37Warning: These log files are VERY large and 
38
39
40=== HIVE ===
41
42You can play around with a simple test of hive (that does no work but pass
43around strings):
44
45    python hive_stats.py --central --workers 5
46
47You can play around with settings there and explore the hive framework
48This test program could be improved, I suppose, and might help us determine
49the proper thread and fork and machine counts (see below)
50
51
52=== MISC ===
53
54There are some generally useful python modules here:
55
56mergetools - iterate over a merge of multiple sorted sequences in order
57stattools - a statistics gathering counter
58timestamp - a representation of time in seconds + microseconds format
59
60There are unit tests... 
61
62    for f in *_test.py; do echo === $f ===; python $f; done
63    
64