PageRenderTime 25ms CodeModel.GetById 17ms RepoModel.GetById 1ms app.codeStats 0ms

/historical/TODO

https://bitbucket.org/lindenlab/apiary/
#! | 83 lines | 46 code | 37 blank | 0 comment | 0 complexity | 3563934666ff1c9cdcea6f8c9c9816af MD5 | raw file
  1. === SQL Database Prep ===
  2. 1. run a slave sync'd to mysql.agni - but stop it at 2pm, Sunday
  3. 2. grab the logs off of nyar for 2pm-3pm and 3pm-4pm
  4. 3. generate ends for these files... something like:
  5. gzcat query.log.98.gz query.log.99.gz
  6. | python sqllog_genends.py > query.log.98-99-ends
  7. this will take a long time
  8. 4. generate a short, test file -- say 100k likes:
  9. gzcat query.log.98.gz | head -100000 > query.log.98-short
  10. 5. generate ends for that:
  11. python sqllog_genends.py query.log.98-short > query.log.98-short-ends
  12. 6. generate stats for both of the above sets
  13. gzcat query.log.98.gz query.log.99.gz
  14. | python sqllog_stats.py - query.log.98-99-ends > stats-big
  15. python sqllog_stats.py query.log.98-short query.log.98-short-ends > stats-small
  16. The big stats will tell us how many workers we need....
  17. === Software To Do ===
  18. The forking code has not been tested, the threading code has. The fork bit, though, is pretty boilerplate. The forks should detach correctly.
  19. QueryResponse events need to be ignored when replayed, but not for end generation.
  20. Need to add logic to guess the schema on the first query
  21. Hive needs a better way to build up option parsers, and pass them around
  22. with defaults.
  23. === Running ===
  24. We are going to need the following procedure either in a script, or at least
  25. easy to run:
  26. 1. reset test mysql instance to known state
  27. 2. start n worker threads, on m processes, on k machines:
  28. dsh to k machines:
  29. python hive_mysql.py --fork m --workers n
  30. ... assuming all those machines have the software in the right place...
  31. yup - i don't know what n, m, or k should be.... Only that we'll
  32. need n*m*k to be about the max concurrency given in the stats (from above)
  33. whether threads are good enough, or forks are better... who knows...
  34. nor do I know how many per machine..... thoughts?
  35. 3. start a central:
  36. python hive_mysql.py --central query.log.98-short query.log.98-short-ends
  37. -or-
  38. gzcat query.log.98.gz query.log.99.gz
  39. | python hive_mysql.py --central - query.log.98-99-ends
  40. That should print out stats at the very end -- how long it took to run in particular.
  41. At the same time as we are running this, we should, I suppose be gathering
  42. the IO stats and CPU stats on the mysql db machine.
  43. === The Experiment ===
  44. Once the above is all done .... THEN we can start the experiments,
  45. in which we filter the event stream in hive_mysql.py's MySQLCentral
  46. to not have various tables, etc.... -- and then re-run and look at times