PageRenderTime 21ms CodeModel.GetById 14ms app.highlight 3ms RepoModel.GetById 1ms app.codeStats 0ms

/mysql_watcher/README

https://bitbucket.org/lindenlab/apiary/
#! | 222 lines | 170 code | 52 blank | 0 comment | 0 complexity | 3f21ad68397db3a95faed2d990efdb37 MD5 | raw file
  1This documentation was pulled from Linden Lab's internal wiki, and needs some exhaustive cleanup. - Doug
  2
  3mysql_watcher contains a set of tools which monitor, analyze and log
  4database traffic on MySQL databases by reading a tcpdump packet stream
  5and parsing it.  It is particularly useful as a low-impact means of
  6profiling load on high-volume MySQL hosts in real-time, providing
  7statistical information on queries that tools such as mytop and
  8innotop lack.
  9
 10== Tools ==
 11=== "Current" Tools ===
 12To use them you should be able to check out the directory and run them on any host.  They should NOT be run from the database host that is being monitored.
 13* mysql_watcher - connects to a live mysql host and processes the incoming query packet stream in real time, displaying it through a curses UI.
 14* mysql_logger - connects to a live mysql host and processes the incoming query stream, putting the resulting stream of query information into a set of log files.
 15* log_watcher - replays a set of logs created by mysql_logger and generates statistics (in theory - it's definitely buggy).
 16* browse_queries - script that allows you to view the summarized output of the watcher scripts in curses.
 17* db_info - tiny script that gets statistics about tables and databases on a database host.
 18* table_info - hacky script that reads in summary information created by the watcher scripts, and generates statistics and other output based on table names.
 19
 20The following tools are not "modern", and should not be used as examples for future development, except at a very basic level.
 21=== "Legacy" Tools ===
 22These are tools that haven't been migrated from the "old" location for scaling scripts at svn+ssh://svn.lindenlab.com/svn/release/scripts/scaling.  It would be good to get these migrated if possible.
 23* traffic_watcher - generically watches traffic for a host, breaks down by protocol and "Linden" destinations
 24* message_watcher - watches the packet stream for a host, and identifies Linden-related traffic by protocol (and in the case of UDP, message name).
 25
 26== Libraries ==
 27=== Current ===
 28** /dblibs - Database utility libraries are located here.
 29*** /dblibs/dbmonitor.py - utility library to do "safe" batch operations on production databases with slaves.
 30*** /dblibs/dbutil.py - library for processing and gathering statistics on mysql TCP streams
 31*** /dblibs/dbbrowser.py - curses-based query browsing tools
 32
 33
 34== Applications ==
 35** mysql_watcher - real-time MySQL query stats collector/browser
 36** mysql_logger - log all SQL queries to rotating 36 hour compresssed log
 37** browse_queries - browse/edit query metadata
 38** query_info.llsd - query metadata file
 39== Basic Implementation ==
 40* The various tools spawn off an ssh to the target hosts that runs tcpdump and pipes the results using netcat over the network to a server socket.
 41* LLQueryStream is a class that reads a tcpdump stream (as a file), and generates events that correspond to queries/other events (QueryStart, QueryResponse, Quit).  All the tools hook up the incoming tcpdump stream to the LLQueryStream class.
 42* The server sockets run a while loop around LLQueryStream.getNextEvent(), and shove the results into a query stats class LLQueryStatMap which collects statistics.
 43* Every once in a while, LLQueryBrowser is used to handle keyboard input and redraw the screen.
 44== Bugs/Improvements ==
 45=== Core Libraries ===
 46* Make conform to Python coding standards
 47* Improved query generalizer/parser
 48* Handle non-UTF8 queries better (affects dumping LLSD files)
 49* Improve performance (high CPU load can sometimes cause dropped packets in the stream)
 50* Handle dropped packets better (can cause stats)
 51* Improve long-term running (accumulated bad connections/queries causes memory/performance bloat)
 52* Track actual response data transmission time, not just initial response time
 53=== Curses/Tools ===
 54* Make conform to Python coding standards
 55* No help to speak of
 56* Handle resize
 57* Multi-line editor sucks
 58* Display more statistical data (total queries/sec, time/sec)
 59== New Features/Tools ==
 60* Allow metadata editing while watching live stream
 61* Watch and flag unknown queries crossing the wire (intrusion detection)
 62* Look for non-Unicode data in the query stream
 63* Store historical query traffic in a meaningful way
 64* Real visualization tools
 65* Drill-downs to watch specific queries
 66
 67
 68
 69= mysql_watcher =
 70
 71
 72mysql_watcher is a script written in python which monitors database traffic on databases by reading a tcpdump packet stream.  It collects various statistical and logging information, and displays the information using curses.
 73
 74See [[MySQL Monitoring Tools]] For the toolkit that I used to implement this.  There are also other related database and network monitoring tools.
 75
 76== Usage ==
 77
 78=== Running ===
 79'''DO NOT run this on the database host itself.  It talks to the database over the network, so you should run it on a fast, non-critical production host, preferably in the same colo facility.'''
 80
 81Arguments:
 82* --host=<mysql_host> - the mysql host you wish to monitor.  You will need to make sure you have root access (it logs in as root so it can run tcpdump without sudoing).
 83
 84So, to monitor mysql.agni, I run "./mysql_watcher --host=mysql.agni".
 85
 86You will want to use a fairly large terminal window - I haven't taken the time to make curses work 100% properly yet, and if your window is too small there's a good chance that it will crash when attempting to display some of the longer strings.
 87
 88Notes:
 89* If you're running against mysql.agni (or another similarly high-traffic database), I'd recommend a class 3 or higher.  This currently burns 60-70% of a CPU on a class 3 running against mysql.agni
 90* The query logging can take a LOT of disk space - possibly a megabyte a minute against mysql.agni.
 91* You almost definitely do not want to be running more than one of these against a particular DB at any time - it uses about 4% of a CPU between tcpdump and netcat.
 92
 93=== Output ===
 94<pre>
 95Total
 96QPS: 1956.19    Elapsed: 4642.90
 97 TotalSec  AvgSec   Count   QPS Host      Query                                                       Notes
 98----------------------------------------------------------------------------------------------------------------
 99 11827.02  0.0120  983071 211.7 sim       SELECT u.agent_id as agent_id, u.im_via_email as im_via_ema store_instant_messa
100 10708.46  0.0084 1269642 273.5 sim       SELECT count(*) FROM script_email_message WHERE script_id = script email delive
101  8286.53  0.1054   78625  16.9 sim       SELECT g.name, g.charter, g.member_title, hex(g.member_powe Group profile infor
102  7268.40  0.0335  216983  46.7 sim       SELECT g.group_id FROM groups g, user_groups_map ugm WHERE  get_groups_list: Gr
103  7206.51  0.0332  216916  46.7 sim       SELECT g.group_id as group_id, g.name as group_name, hex(ug get_groups_data, fr
104  5201.93  0.5598    9292   2.0 sim       SELECT s.region_id, s.grid_x, s.grid_y, s.sim_name, s.sim_c simulator startup
105  4641.38  0.4464   10397   2.2 web       SELECT * FROM user_last_name WHERE (availability = *string* none
106  4073.26  0.0549   74254  16.0 login     SELECT r.agent_related AS buddy_id,r.agent_id_rights as bud Friends list on log
107  3029.89  0.0344   88053  19.0 sim       SELECT g.group_id as group_id, g.name as group_name, hex(ug
108  2621.76  0.2095   12513   2.7 web       select count(*) as total from simulator s, estate e where s
109  2576.30  0.0128  201496  43.4 sim       SELECT u.agent_id, u.limited_to_estate, u.god_level, v.desc
110  2439.34  1.1517    2118   0.5 web       SELECT DISTINCT p.parcel_id, p.name, s.sim_name, p.grid_x,
111  2231.35 14.9755     149   0.0 vega      DELETE FROM indra.money_transaction WHERE transaction_time
112  2157.60  0.0058  372079  80.1 sim       SELECT u.money, u.square_meters_credit, IFNULL(SUM(p.billab query_balance: get
113  1913.40  0.3638    5260   1.1 web       SELECT IFNULL(SUM(cot.sell_net_usd * *num*),*num*) as total Lindex query?
114  1913.07  0.0258   74238  16.0 login     SELECT g.group_id AS group_id, g.name AS group_name, hex(ug none
115  1885.45  0.0298   63326  13.6 login     REPLACE INTO known_address SET agent_id = *uuid*, ip_string known address looku
116  1884.30  0.0639   29501   6.4 sim       SELECT m.agent_id as agent_id, m.donated_square_meters as d none
117  1832.59  0.0084  216925  46.7 sim       SELECT image_id, about, want_to_mask, want_to, skills_mask, get_avatar_properti
118  1785.39  0.8422    2120   0.5 web       select customer_id from customer_activation where activatio
119
120Clean Query: sim
121SELECT u.agent_id as agent_id, u.im_via_email as im_via_email, u.email as email, u.enabled as enabled, CONCAT(u.username, *string*, l.name) as nam
122e, u.limited_to_estate
123
124Sample Query: sim
125SELECT u.agent_id as agent_id, u.im_via_email as im_via_email, u.email as email, u.enabled as enabled, CONCAT(u.username, ' ', l.name) as name, u.
126limited_to_estate as limited_to_estate, u.god_level as god_level, u.inventory_host_name as inventory_host_name FROM user u, user_last_name l  WHER
127E u.last_name_id = l.last_name_id AND u.agent_id in ('6f53dc02-d75b-4542-b606-767275596735', '7ab025d4-eb5c-4113-a859-22039839dacb')
128
129
130
131
132
133Tables:   ['user', 'user_last_name']
134Columns:
135Notes:
136store_instant_message: user info lookup for IM delivery
137</pre>
138=== Keyboard shortcuts ===
139* ''<space>'' - toggles between the last 5 second view versus all stats collected since the run started.
140* ''<up, down, pg up, pg dn>'' - navigates the selected query up and down the screen.
141* ''s'' - changes which column is being sorted by.
142* ''q'' - quit the application (and dump data)
143
144==== On-disk output for mysql_logger ====
145* <output_path>/query.log - a log of every query that is being run on the database.  The output format is:
146<pre>
147<unixtime> <hostname>
148<query string>
149**********************
150<unixtime> <hostname>
151<query string>
152**********************
153.
154.
155.
156</pre>
157* <output_path>/query_timing.txt - detailed version of the stdout stats, dumped every hour.
158* <output_path>/query_counts.llsd - LLSD dump used for merge_tables.
159'''Currently <output_path> is hardcoded as ./<hostname>/ - sorry.'''
160== Old documentation ==
161Update: mysql_watcher now has a companion script merge_table that merges useful metadata.
162
163<pre>
164Locations of scripts:
165
166All of the database scripts that I use reside here:
167svn+ssh://svn.lindenlab.com/linden/release/scripts/scaling
168
169mysql_watcher is the script that you want to use to gather real-time data.  You'll need the root ssh key added to run it (so it can run tcpdump on the database machine).  You'll also want to run it on a fairly beefy box - the stations don't have quite enough juice to run it when mysql.agni is going full tilt.  You may have to export PYTHONPATH if you're not running on a box with a code deploy.
170
171Run it using the command "mysql_watcher mysql.agni.lindenlab.com".  It does some stuff with popen, so it seems to misbehave sometimes when the output is piped, and I haven't had time to fix it yet.
172
173When you break out, it generates raw output in two files:
174- query_info_dump.llsd
175- query_counts.llsd
176
177The next step is to merge this count data with the metadata in query_info.llsd using merge_tables - which generates text and HTML output.
178
179query_info.llsd is an LLDS document that contains a list of all known queries, and currently has a map with metadata (human-editable) such as the host which called it, notes, and the tables accessed.  When merge_tables is run, it takes this metadata and merges it with the counts so that you don't have to do it by hand every time.  Check this file into svn whenever you make updates or tweaks to it.
180
181Known issues:
182- If a query comes from multiple classes of hosts, the merge_tables script will currently ignore the data from the host which generated less queries.  This could result in query counts which are less than expected.
183
184- merge_tables currently attempts to scp the HTML version into my public_html folder in my homedir, which obviously won't work for you.
185
186- Because this isn't a heavily tested script, merge_tables won't automatically generate a new version of query_info.llsd when it encounters new queries it hasn't seen before.  You'll want to edit the script to set the merge_query_info variable to True.
187
188The cool thing that I want to do which I haven't done yet to these scripts is to put metadata for how expensive a query is into the query_info.llsd file - that way we can generate good weightings for how much individual queries are affecting the DB, and even do lookups in the metadata to figure out what's hurting us in real time.
189
190Let me know if you have more questions.
191
192- Doug
193</pre>
194
195mysql_watcher is a script that parses the results of tcpdump on a mysql host, and generates real-time statistical output of what queries are running on the host.  It currently lives in svn in /linden/scripts/scaling/mysql_watcher.  The following are the docs that I wrote at the top of the script:
196
197<pre>
198Reads the input of a tcpdump stream using -s 0 -A, and generates either real-time information
199or summary information.
200
201On the server (NOT the database, but something like a station) (which will be reading the stream), run:
202
203doug@station0$ nc -l -p 9999 | /usr/sbin/tcpdump -r - -A -s 0 | ./mysql_watcher
204
205This listens to data using netcat on port 9999, and feeds it through tcpdump which parses it into a format
206that mysql_watcher likes.
207
208On the client (on the mysql database host), run:
209
210doug@mysql.agni$ sudo /usr/sbin/tcpdump -n -s 0 -w - -i eth0 port 3306 and dst host db2c6 | nc station0 9999
211
212This takes the tcpdump raw packet output and shoves it to the server on station0, running on port 9999.
213
214On the station0 host it will output every 5 seconds the top queries (currently 50 of them).  On breaking out
215of the server, it will dump a summary of ALL of the queries.
216
217At some point in the future this will be made more glamorous, and less finicky, but this works.
218</pre>
219
220Further improvements include adding command line parameters, and using curses to allow you to interact with and tweak the output.  Also, using popen and/or llspawner to automatically open up all of the pipes to make it all easier.
221
222'''Note:  You will probably want to verify that nobody else is running this at the time.  It only seems to generate around 3% load on mysql.agni running near peak, but...'''