PageRenderTime 10ms CodeModel.GetById 6ms RepoModel.GetById 0ms app.codeStats 0ms

/historical/README

https://bitbucket.org/lindenlab/apiary/
#! | 64 lines | 42 code | 22 blank | 0 comment | 0 complexity | 50d157473770cb0f18586c522d4b101d MD5 | raw file
  1. === LOGS ===
  2. On nyarlathotep we generate logs from packet captures of SQL hitting the
  3. database. These files look like this:
  4. 1237237351.064750 10.2.231.65:40784 sim QueryStart
  5. SELECT owner_id, is_owner_group FROM parcel WHERE
  6. parcel_id='9d50d6eb-a623-2a30-a8e0-840189fabff7'
  7. **************************************
  8. 1237237351.064759 10.0.0.172:56714 web Quit
  9. Quit
  10. **************************************
  11. 1237237351.064861 10.2.231.65:40784 sim QueryResponse
  12. SELECT owner_id, is_owner_group FROM parcel WHERE
  13. parcel_id='9d50d6eb-a623-2a30-a8e0-840189fabff7'
  14. **************************************
  15. 1237237351.065393 10.6.6.97:39706 sim QueryStart
  16. These logs allow us to replay sequences of SQL by connection. However, the
  17. "Quit" indicator that shows when a sequence ends, isn't always present. Once
  18. you have a log, you need to generate the missing ends (assumed to be just
  19. after the last SQL on a connection). To do this, run something like:
  20. gzcat query.log.21.gz | python sqllog_genends.py > query.log.21-ends
  21. The log and the end files need to be merged for any operation, however, the
  22. scripts all take multiple input files and will do a sorted merge as they
  23. process. Since the source is almost always gzip'd, you can use a dash to
  24. make stdin one of the merged inputs:
  25. gzcat query.log.21.gz | python sqllog_stats.py - query.log.21-ends
  26. The above also shows that you can generate statistics about a stream of events
  27. (or several streams merged) with the sqllog_stats.py script.
  28. Warning: These log files are VERY large and
  29. === HIVE ===
  30. You can play around with a simple test of hive (that does no work but pass
  31. around strings):
  32. python hive_stats.py --central --workers 5
  33. You can play around with settings there and explore the hive framework
  34. This test program could be improved, I suppose, and might help us determine
  35. the proper thread and fork and machine counts (see below)
  36. === MISC ===
  37. There are some generally useful python modules here:
  38. mergetools - iterate over a merge of multiple sorted sequences in order
  39. stattools - a statistics gathering counter
  40. timestamp - a representation of time in seconds + microseconds format
  41. There are unit tests...
  42. for f in *_test.py; do echo === $f ===; python $f; done