/mysql_watcher/README
#! | 222 lines | 170 code | 52 blank | 0 comment | 0 complexity | 3f21ad68397db3a95faed2d990efdb37 MD5 | raw file
1This documentation was pulled from Linden Lab's internal wiki, and needs some exhaustive cleanup. - Doug
2
3mysql_watcher contains a set of tools which monitor, analyze and log
4database traffic on MySQL databases by reading a tcpdump packet stream
5and parsing it. It is particularly useful as a low-impact means of
6profiling load on high-volume MySQL hosts in real-time, providing
7statistical information on queries that tools such as mytop and
8innotop lack.
9
10== Tools ==
11=== "Current" Tools ===
12To use them you should be able to check out the directory and run them on any host. They should NOT be run from the database host that is being monitored.
13* mysql_watcher - connects to a live mysql host and processes the incoming query packet stream in real time, displaying it through a curses UI.
14* mysql_logger - connects to a live mysql host and processes the incoming query stream, putting the resulting stream of query information into a set of log files.
15* log_watcher - replays a set of logs created by mysql_logger and generates statistics (in theory - it's definitely buggy).
16* browse_queries - script that allows you to view the summarized output of the watcher scripts in curses.
17* db_info - tiny script that gets statistics about tables and databases on a database host.
18* table_info - hacky script that reads in summary information created by the watcher scripts, and generates statistics and other output based on table names.
19
20The following tools are not "modern", and should not be used as examples for future development, except at a very basic level.
21=== "Legacy" Tools ===
22These are tools that haven't been migrated from the "old" location for scaling scripts at svn+ssh://svn.lindenlab.com/svn/release/scripts/scaling. It would be good to get these migrated if possible.
23* traffic_watcher - generically watches traffic for a host, breaks down by protocol and "Linden" destinations
24* message_watcher - watches the packet stream for a host, and identifies Linden-related traffic by protocol (and in the case of UDP, message name).
25
26== Libraries ==
27=== Current ===
28** /dblibs - Database utility libraries are located here.
29*** /dblibs/dbmonitor.py - utility library to do "safe" batch operations on production databases with slaves.
30*** /dblibs/dbutil.py - library for processing and gathering statistics on mysql TCP streams
31*** /dblibs/dbbrowser.py - curses-based query browsing tools
32
33
34== Applications ==
35** mysql_watcher - real-time MySQL query stats collector/browser
36** mysql_logger - log all SQL queries to rotating 36 hour compresssed log
37** browse_queries - browse/edit query metadata
38** query_info.llsd - query metadata file
39== Basic Implementation ==
40* The various tools spawn off an ssh to the target hosts that runs tcpdump and pipes the results using netcat over the network to a server socket.
41* LLQueryStream is a class that reads a tcpdump stream (as a file), and generates events that correspond to queries/other events (QueryStart, QueryResponse, Quit). All the tools hook up the incoming tcpdump stream to the LLQueryStream class.
42* The server sockets run a while loop around LLQueryStream.getNextEvent(), and shove the results into a query stats class LLQueryStatMap which collects statistics.
43* Every once in a while, LLQueryBrowser is used to handle keyboard input and redraw the screen.
44== Bugs/Improvements ==
45=== Core Libraries ===
46* Make conform to Python coding standards
47* Improved query generalizer/parser
48* Handle non-UTF8 queries better (affects dumping LLSD files)
49* Improve performance (high CPU load can sometimes cause dropped packets in the stream)
50* Handle dropped packets better (can cause stats)
51* Improve long-term running (accumulated bad connections/queries causes memory/performance bloat)
52* Track actual response data transmission time, not just initial response time
53=== Curses/Tools ===
54* Make conform to Python coding standards
55* No help to speak of
56* Handle resize
57* Multi-line editor sucks
58* Display more statistical data (total queries/sec, time/sec)
59== New Features/Tools ==
60* Allow metadata editing while watching live stream
61* Watch and flag unknown queries crossing the wire (intrusion detection)
62* Look for non-Unicode data in the query stream
63* Store historical query traffic in a meaningful way
64* Real visualization tools
65* Drill-downs to watch specific queries
66
67
68
69= mysql_watcher =
70
71
72mysql_watcher is a script written in python which monitors database traffic on databases by reading a tcpdump packet stream. It collects various statistical and logging information, and displays the information using curses.
73
74See [[MySQL Monitoring Tools]] For the toolkit that I used to implement this. There are also other related database and network monitoring tools.
75
76== Usage ==
77
78=== Running ===
79'''DO NOT run this on the database host itself. It talks to the database over the network, so you should run it on a fast, non-critical production host, preferably in the same colo facility.'''
80
81Arguments:
82* --host=<mysql_host> - the mysql host you wish to monitor. You will need to make sure you have root access (it logs in as root so it can run tcpdump without sudoing).
83
84So, to monitor mysql.agni, I run "./mysql_watcher --host=mysql.agni".
85
86You will want to use a fairly large terminal window - I haven't taken the time to make curses work 100% properly yet, and if your window is too small there's a good chance that it will crash when attempting to display some of the longer strings.
87
88Notes:
89* If you're running against mysql.agni (or another similarly high-traffic database), I'd recommend a class 3 or higher. This currently burns 60-70% of a CPU on a class 3 running against mysql.agni
90* The query logging can take a LOT of disk space - possibly a megabyte a minute against mysql.agni.
91* You almost definitely do not want to be running more than one of these against a particular DB at any time - it uses about 4% of a CPU between tcpdump and netcat.
92
93=== Output ===
94<pre>
95Total
96QPS: 1956.19 Elapsed: 4642.90
97 TotalSec AvgSec Count QPS Host Query Notes
98----------------------------------------------------------------------------------------------------------------
99 11827.02 0.0120 983071 211.7 sim SELECT u.agent_id as agent_id, u.im_via_email as im_via_ema store_instant_messa
100 10708.46 0.0084 1269642 273.5 sim SELECT count(*) FROM script_email_message WHERE script_id = script email delive
101 8286.53 0.1054 78625 16.9 sim SELECT g.name, g.charter, g.member_title, hex(g.member_powe Group profile infor
102 7268.40 0.0335 216983 46.7 sim SELECT g.group_id FROM groups g, user_groups_map ugm WHERE get_groups_list: Gr
103 7206.51 0.0332 216916 46.7 sim SELECT g.group_id as group_id, g.name as group_name, hex(ug get_groups_data, fr
104 5201.93 0.5598 9292 2.0 sim SELECT s.region_id, s.grid_x, s.grid_y, s.sim_name, s.sim_c simulator startup
105 4641.38 0.4464 10397 2.2 web SELECT * FROM user_last_name WHERE (availability = *string* none
106 4073.26 0.0549 74254 16.0 login SELECT r.agent_related AS buddy_id,r.agent_id_rights as bud Friends list on log
107 3029.89 0.0344 88053 19.0 sim SELECT g.group_id as group_id, g.name as group_name, hex(ug
108 2621.76 0.2095 12513 2.7 web select count(*) as total from simulator s, estate e where s
109 2576.30 0.0128 201496 43.4 sim SELECT u.agent_id, u.limited_to_estate, u.god_level, v.desc
110 2439.34 1.1517 2118 0.5 web SELECT DISTINCT p.parcel_id, p.name, s.sim_name, p.grid_x,
111 2231.35 14.9755 149 0.0 vega DELETE FROM indra.money_transaction WHERE transaction_time
112 2157.60 0.0058 372079 80.1 sim SELECT u.money, u.square_meters_credit, IFNULL(SUM(p.billab query_balance: get
113 1913.40 0.3638 5260 1.1 web SELECT IFNULL(SUM(cot.sell_net_usd * *num*),*num*) as total Lindex query?
114 1913.07 0.0258 74238 16.0 login SELECT g.group_id AS group_id, g.name AS group_name, hex(ug none
115 1885.45 0.0298 63326 13.6 login REPLACE INTO known_address SET agent_id = *uuid*, ip_string known address looku
116 1884.30 0.0639 29501 6.4 sim SELECT m.agent_id as agent_id, m.donated_square_meters as d none
117 1832.59 0.0084 216925 46.7 sim SELECT image_id, about, want_to_mask, want_to, skills_mask, get_avatar_properti
118 1785.39 0.8422 2120 0.5 web select customer_id from customer_activation where activatio
119
120Clean Query: sim
121SELECT u.agent_id as agent_id, u.im_via_email as im_via_email, u.email as email, u.enabled as enabled, CONCAT(u.username, *string*, l.name) as nam
122e, u.limited_to_estate
123
124Sample Query: sim
125SELECT u.agent_id as agent_id, u.im_via_email as im_via_email, u.email as email, u.enabled as enabled, CONCAT(u.username, ' ', l.name) as name, u.
126limited_to_estate as limited_to_estate, u.god_level as god_level, u.inventory_host_name as inventory_host_name FROM user u, user_last_name l WHER
127E u.last_name_id = l.last_name_id AND u.agent_id in ('6f53dc02-d75b-4542-b606-767275596735', '7ab025d4-eb5c-4113-a859-22039839dacb')
128
129
130
131
132
133Tables: ['user', 'user_last_name']
134Columns:
135Notes:
136store_instant_message: user info lookup for IM delivery
137</pre>
138=== Keyboard shortcuts ===
139* ''<space>'' - toggles between the last 5 second view versus all stats collected since the run started.
140* ''<up, down, pg up, pg dn>'' - navigates the selected query up and down the screen.
141* ''s'' - changes which column is being sorted by.
142* ''q'' - quit the application (and dump data)
143
144==== On-disk output for mysql_logger ====
145* <output_path>/query.log - a log of every query that is being run on the database. The output format is:
146<pre>
147<unixtime> <hostname>
148<query string>
149**********************
150<unixtime> <hostname>
151<query string>
152**********************
153.
154.
155.
156</pre>
157* <output_path>/query_timing.txt - detailed version of the stdout stats, dumped every hour.
158* <output_path>/query_counts.llsd - LLSD dump used for merge_tables.
159'''Currently <output_path> is hardcoded as ./<hostname>/ - sorry.'''
160== Old documentation ==
161Update: mysql_watcher now has a companion script merge_table that merges useful metadata.
162
163<pre>
164Locations of scripts:
165
166All of the database scripts that I use reside here:
167svn+ssh://svn.lindenlab.com/linden/release/scripts/scaling
168
169mysql_watcher is the script that you want to use to gather real-time data. You'll need the root ssh key added to run it (so it can run tcpdump on the database machine). You'll also want to run it on a fairly beefy box - the stations don't have quite enough juice to run it when mysql.agni is going full tilt. You may have to export PYTHONPATH if you're not running on a box with a code deploy.
170
171Run it using the command "mysql_watcher mysql.agni.lindenlab.com". It does some stuff with popen, so it seems to misbehave sometimes when the output is piped, and I haven't had time to fix it yet.
172
173When you break out, it generates raw output in two files:
174- query_info_dump.llsd
175- query_counts.llsd
176
177The next step is to merge this count data with the metadata in query_info.llsd using merge_tables - which generates text and HTML output.
178
179query_info.llsd is an LLDS document that contains a list of all known queries, and currently has a map with metadata (human-editable) such as the host which called it, notes, and the tables accessed. When merge_tables is run, it takes this metadata and merges it with the counts so that you don't have to do it by hand every time. Check this file into svn whenever you make updates or tweaks to it.
180
181Known issues:
182- If a query comes from multiple classes of hosts, the merge_tables script will currently ignore the data from the host which generated less queries. This could result in query counts which are less than expected.
183
184- merge_tables currently attempts to scp the HTML version into my public_html folder in my homedir, which obviously won't work for you.
185
186- Because this isn't a heavily tested script, merge_tables won't automatically generate a new version of query_info.llsd when it encounters new queries it hasn't seen before. You'll want to edit the script to set the merge_query_info variable to True.
187
188The cool thing that I want to do which I haven't done yet to these scripts is to put metadata for how expensive a query is into the query_info.llsd file - that way we can generate good weightings for how much individual queries are affecting the DB, and even do lookups in the metadata to figure out what's hurting us in real time.
189
190Let me know if you have more questions.
191
192- Doug
193</pre>
194
195mysql_watcher is a script that parses the results of tcpdump on a mysql host, and generates real-time statistical output of what queries are running on the host. It currently lives in svn in /linden/scripts/scaling/mysql_watcher. The following are the docs that I wrote at the top of the script:
196
197<pre>
198Reads the input of a tcpdump stream using -s 0 -A, and generates either real-time information
199or summary information.
200
201On the server (NOT the database, but something like a station) (which will be reading the stream), run:
202
203doug@station0$ nc -l -p 9999 | /usr/sbin/tcpdump -r - -A -s 0 | ./mysql_watcher
204
205This listens to data using netcat on port 9999, and feeds it through tcpdump which parses it into a format
206that mysql_watcher likes.
207
208On the client (on the mysql database host), run:
209
210doug@mysql.agni$ sudo /usr/sbin/tcpdump -n -s 0 -w - -i eth0 port 3306 and dst host db2c6 | nc station0 9999
211
212This takes the tcpdump raw packet output and shoves it to the server on station0, running on port 9999.
213
214On the station0 host it will output every 5 seconds the top queries (currently 50 of them). On breaking out
215of the server, it will dump a summary of ALL of the queries.
216
217At some point in the future this will be made more glamorous, and less finicky, but this works.
218</pre>
219
220Further improvements include adding command line parameters, and using curses to allow you to interact with and tweak the output. Also, using popen and/or llspawner to automatically open up all of the pipes to make it all easier.
221
222'''Note: You will probably want to verify that nobody else is running this at the time. It only seems to generate around 3% load on mysql.agni running near peak, but...'''