/historical/TODO
#! | 83 lines | 46 code | 37 blank | 0 comment | 0 complexity | 3563934666ff1c9cdcea6f8c9c9816af MD5 | raw file
1=== SQL Database Prep === 2 31. run a slave sync'd to mysql.agni - but stop it at 2pm, Sunday 42. grab the logs off of nyar for 2pm-3pm and 3pm-4pm 53. generate ends for these files... something like: 6 7 gzcat query.log.98.gz query.log.99.gz 8 | python sqllog_genends.py > query.log.98-99-ends 9 10 this will take a long time 11 124. generate a short, test file -- say 100k likes: 13 14 gzcat query.log.98.gz | head -100000 > query.log.98-short 15 165. generate ends for that: 17 18 python sqllog_genends.py query.log.98-short > query.log.98-short-ends 19 206. generate stats for both of the above sets 21 22 gzcat query.log.98.gz query.log.99.gz 23 | python sqllog_stats.py - query.log.98-99-ends > stats-big 24 25 python sqllog_stats.py query.log.98-short query.log.98-short-ends > stats-small 26 27 The big stats will tell us how many workers we need.... 28 29 30 31=== Software To Do === 32 33The forking code has not been tested, the threading code has. The fork bit, though, is pretty boilerplate. The forks should detach correctly. 34 35QueryResponse events need to be ignored when replayed, but not for end generation. 36 37Need to add logic to guess the schema on the first query 38 39Hive needs a better way to build up option parsers, and pass them around 40with defaults. 41 42 43=== Running === 44 45We are going to need the following procedure either in a script, or at least 46easy to run: 47 481. reset test mysql instance to known state 49 502. start n worker threads, on m processes, on k machines: 51 52 dsh to k machines: 53 python hive_mysql.py --fork m --workers n 54 ... assuming all those machines have the software in the right place... 55 56 yup - i don't know what n, m, or k should be.... Only that we'll 57 need n*m*k to be about the max concurrency given in the stats (from above) 58 whether threads are good enough, or forks are better... who knows... 59 nor do I know how many per machine..... thoughts? 60 61 623. start a central: 63 64 python hive_mysql.py --central query.log.98-short query.log.98-short-ends 65 66 -or- 67 68 gzcat query.log.98.gz query.log.99.gz 69 | python hive_mysql.py --central - query.log.98-99-ends 70 71That should print out stats at the very end -- how long it took to run in particular. 72 73At the same time as we are running this, we should, I suppose be gathering 74the IO stats and CPU stats on the mysql db machine. 75 76 77 78=== The Experiment === 79 80Once the above is all done .... THEN we can start the experiments, 81in which we filter the event stream in hive_mysql.py's MySQLCentral 82to not have various tables, etc.... -- and then re-run and look at times 83