PageRenderTime 19ms CodeModel.GetById 14ms app.highlight 1ms RepoModel.GetById 1ms app.codeStats 0ms

/historical/TODO

https://bitbucket.org/lindenlab/apiary/
#! | 83 lines | 46 code | 37 blank | 0 comment | 0 complexity | 3563934666ff1c9cdcea6f8c9c9816af MD5 | raw file
 1=== SQL Database Prep ===
 2
 31. run a slave sync'd to mysql.agni - but stop it at 2pm, Sunday
 42. grab the logs off of nyar for 2pm-3pm and 3pm-4pm
 53. generate ends for these files... something like:
 6
 7    gzcat query.log.98.gz query.log.99.gz 
 8        | python sqllog_genends.py > query.log.98-99-ends
 9        
10    this will take a long time
11
124. generate a short, test file -- say 100k likes:
13
14    gzcat query.log.98.gz | head -100000 > query.log.98-short
15    
165. generate ends for that:
17
18    python sqllog_genends.py query.log.98-short > query.log.98-short-ends
19
206. generate stats for both of the above sets
21
22    gzcat query.log.98.gz query.log.99.gz 
23        | python sqllog_stats.py - query.log.98-99-ends > stats-big
24    
25    python sqllog_stats.py query.log.98-short query.log.98-short-ends > stats-small
26    
27    The big stats will tell us how many workers we need....
28    
29    
30    
31=== Software To Do ===
32
33The forking code has not been tested, the threading code has.  The fork bit, though, is pretty boilerplate.  The forks should detach correctly. 
34
35QueryResponse events need to be ignored when replayed, but not for end generation.
36
37Need to add logic to guess the schema on the first query
38
39Hive needs a better way to build up option parsers, and pass them around
40with defaults.
41
42
43=== Running ===
44
45We are going to need the following procedure either in a script, or at least
46easy to run:
47
481. reset test mysql instance to known state
49
502. start n worker threads, on m processes, on k machines:
51
52    dsh to k machines:
53        python hive_mysql.py --fork m --workers n
54    ... assuming all those machines have the software in the right place...
55
56    yup - i don't know what n, m, or k should be.... Only that we'll
57    need n*m*k to be about the max concurrency given in the stats (from above)
58    whether threads are good enough, or forks are better... who knows...
59    nor do I know how many per machine..... thoughts?
60    
61    
623. start a central:
63
64    python hive_mysql.py --central query.log.98-short query.log.98-short-ends
65    
66    -or-
67    
68    gzcat query.log.98.gz query.log.99.gz
69        | python hive_mysql.py --central - query.log.98-99-ends
70
71That should print out stats at the very end -- how long it took to run in particular.
72
73At the same time as we are running this, we should, I suppose be gathering
74the IO stats and CPU stats on the mysql db machine.
75
76
77
78=== The Experiment ===
79
80Once the above is all done .... THEN we can start the experiments,
81in which we filter the event stream in hive_mysql.py's MySQLCentral
82to not have various tables, etc....  -- and then re-run and look at times
83