PageRenderTime 52ms CodeModel.GetById 21ms RepoModel.GetById 0ms app.codeStats 0ms

/piwik/misc/log-analytics/README.md

https://github.com/imagesdesmaths/idm
Markdown | 273 lines | 205 code | 68 blank | 0 comment | 0 complexity | 0c19c2119b6f34e447bcd39185676055 MD5 | raw file
Possible License(s): BSD-3-Clause, MIT, BSD-2-Clause, GPL-3.0, LGPL-2.1
  1. # Piwik Server Log Analytics: Import your server logs in Piwik!
  2. ## Requirements
  3. * Python 2.6 or 2.7. Python 3.x is not supported.
  4. * Update to Piwik 1.11
  5. * OrderedDict is optional (see https://pypi.python.org/pypi/ordereddict for more details). .
  6. ## How to use this script?
  7. The most simple way to import your logs is to run:
  8. ./import_logs.py --url=piwik.example.com /path/to/access.log
  9. You must specify your Piwik URL with the `--url` argument.
  10. The script will automatically read your config.inc.php file to get the authentication
  11. token and communicate with your Piwik install to import the lines.
  12. The default mode will try to mimic the Javascript tracker as much as possible,
  13. and will not track bots, static files, or error requests.
  14. If you wish to track all requests the following command would be used:
  15. python /path/to/piwik/misc/log-analytics/import_logs.py --url=http://mysite/piwik/ --idsite=1234 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots access.log
  16. ### Format Specific Details
  17. * If you are importing Netscaler log files, make sure to specify the **--iis-time-taken-secs** option. Netscaler stores
  18. the time-taken field in seconds while most other formats use milliseconds. Using this option will ensure that the
  19. log importer interprets the field correctly.
  20. ## How to import your logs automatically every day?
  21. You must first make sure your logs are automatically rotated every day. The most
  22. popular ways to implement this are using either:
  23. * logrotate: http://www.linuxcommand.org/man_pages/logrotate8.html
  24. It will work with any HTTP daemon.
  25. * rotatelogs: http://httpd.apache.org/docs/2.0/programs/rotatelogs.html
  26. Only works with Apache.
  27. * let us know what else is useful and we will add it to the list
  28. Your logs should be automatically rotated and stored on your webserver, for instance in daily logs
  29. `/var/log/apache/access-%Y-%m-%d.log` (where %Y, %m and %d represent the year,
  30. month and day).
  31. You can then import your logs automatically each day (at 0:01). Setup a cron job with the command:
  32. 0 1 * * * /path/to/piwik/misc/log-analytics/import-logs.py -u piwik.example.com `date --date=yesterday +/var/log/apache/access-\%Y-\%m-\%d.log`
  33. ## Performance
  34. With an Intel Core i5-2400 @ 3.10GHz (2 cores, 4 virtual cores with Hyper-threading),
  35. running Piwik and its MySQL database, between 250 and 300 records were imported per second.
  36. The import_logs.py script needs CPU to read and parse the log files, but it is actually
  37. Piwik server itself (i.e. PHP/MySQL) which will use more CPU during data import.
  38. To improve performance,
  39. 1. by default, the script one thread to parse and import log lines.
  40. you can use the `--recorders` option to specify the number of parallel threads which will
  41. import hits into Piwik. We recommend to set `--recorders=N` to the number N of CPU cores
  42. that the server hosting Piwik has. The parsing will still be single-threaded,
  43. but several hits will be tracked in Piwik at the same time.
  44. 2. the script will issue hundreds of requests to piwik.php - to improve the Piwik webserver performance
  45. you can disable server access logging for these requests.
  46. Each Piwik webserver (Apache, Nginx, IIS) can also be tweaked a bit to handle more req/sec.
  47. ## Setup Apache CustomLog that directly imports in Piwik
  48. Since apache CustomLog directives can send log data to a script, it is possible to import hits into piwik server-side in real-time rather than processing a logfile each day.
  49. This approach has many advantages, including real-time data being available on your piwik site, using real logs files instead of relying on client-side Javacsript, and not having a surge of CPU/RAM usage during log processing.
  50. The disadvantage is that if Piwik is unavailable, logging data will be lost. Therefore we recommend to also log into a standard log file. Bear in mind also that apache processes will wait until a request is logged before processing a new request, so if piwik runs slow so does your site: it's therefore important to tune --recorders to the right level.
  51. In the most basic setup, you might have in your main config section:
  52. ```
  53. # Set up your log format as a normal extended format, with hostname at the start
  54. LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" myLogFormat
  55. # Log to a file as usual
  56. CustomLog /path/to/logfile myLogFormat
  57. # Log to piwik as well
  58. CustomLog "|/path/to/import_logs.py --option1 --option2 ... -" myLogFormat
  59. ```
  60. Note: on Debian/Ubuntu, the default configuration defines the vhost_combined format. You
  61. can use it instead of defining myLogFormat.
  62. Useful options here are:
  63. * --add-sites-new-hosts (creates new websites in piwik based on %v in the LogFormat)
  64. * --output=/path/to/piwik.log (puts any output into a log file for reference/debugging later)
  65. * --recorders=4 (use whatever value seems sensible for you - higher traffic sites will need more recorders to keep up)
  66. * "-" so it reads straight from /dev/stdin
  67. You can have as many CustomLog statements as you like. However, if you define any CustomLog directives within a <VirtualHost> block, all CustomLogs in the main config will be overridden. Therefore if you require custom logging for particular VirtualHosts, it is recommended to use mod_macro to make configuration more maintainable.
  68. ## Advanced Log Analytics use case: Apache vhost, custom logs, automatic website creation
  69. As a rather extreme example of what you can do, here is an apache config with:
  70. * standard logging in the main config area for the majority of VirtualHosts
  71. * customised logging in a particular virtualhost to change the hostname (for instance, if a particular virtualhost should be logged as if it were a different site)
  72. * customised logging in another virtualhost which creates new websites in piwik for subsites (e.g. to have domain.com/subsite1 as a whole website in its own right). This requires setting up a custom --log-format-regex to allow "/" in the hostname section (NB the escaping necessary for apache to pass through the regex to piwik properly), and also to have multiple CustomLog directives so the subsite gets logged to both domain.com and domain.com/subsite1 websites in piwik
  73. * we also use mod_rewrite to set environment variables so that if you have multiple subsites with the same format , e.g. /subsite1, /subsite2, etc, you can automatically create a new piwik website for each one without having to configure them manually
  74. NB use of mod_macro to ensure consistency and maintainability
  75. ## Apache configuration source code:
  76. ```
  77. # Set up macro with the options
  78. # * $vhost (this will be used as the piwik website name),
  79. # * $logname (the name of the LogFormat we're using),
  80. # * $output (which logfile to save import_logs.py output to),
  81. # * $env (CustomLog can be set only to fire if an environment variable is set - this contains that environment variable, so subsites only log when it's set)
  82. # NB the --log-format-regex line is exactly the same regex as import_logs.py's own 'common_vhost' format, but with "\/" added in the "host" section's allowed characters
  83. <Macro piwiklog $vhost $logname $output $env>
  84. LogFormat "$vhost %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" $logname
  85. CustomLog "|/path/to/piwik/misc/log-analytics/import_logs.py \
  86. --add-sites-new-hosts \
  87. --config=/path/to/piwik/config/config.ini.php \
  88. --url='http://your.piwik.install/' \
  89. --recorders=4 \
  90. --log-format-regex='(?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)? (?P<ip>\\\\S+) \\\\S+ \\\\S+ \\\\[(?P<date>.*?) (?P<timezone>.*?)\\\\] \\\"\\\\S+ (?P<path>.*?) \\\\S+\\\" (?P<status>\\\\S+) (?P<length>\\\\S+) \\\"(?P<referrer>.*?)\\\" \\\"(?P<user_agent>.*?)\\\"' \
  91. --output=/var/log/piwik/$output.log \
  92. -" \
  93. $logname \
  94. $env
  95. </Macro>
  96. # Set up main apache logging, with:
  97. # * normal %v as hostname,
  98. # * vhost_common as logformat name,
  99. # * /var/log/piwik/main.log as the logfile,
  100. # * no env variable needed since we always want to trigger
  101. Use piwiklog %v vhost_common main " "
  102. <VirtualHost>
  103. ServerName example.com
  104. # Set this host to log to piwik with a different hostname (and using a different output file, /var/log/piwik/example_com.log)
  105. Use piwiklog "another-host.com" vhost_common example_com " "
  106. </VirtualHost>
  107. <VirtualHost>
  108. ServerName domain.com
  109. # We want to log this normally, so repeat the CustomLog from the main section
  110. # (if this is omitted, our other CustomLogs below will override the one in the main section, so the main site won't be logged)
  111. Use piwiklog %v vhost_common main " "
  112. # Now set up mod_rewrite to detect our subsites and set up new piwik websites to track just hits to these (this is a bit like profiles in Google Analytics).
  113. # We want to match domain.com/anothersubsite and domain.com/subsite[0-9]+
  114. # First to be on the safe side, unset the env we'll use to test if we're in a subsite:
  115. UnsetEnv vhostLogName
  116. # Subsite definitions. NB check for both URI and REFERER (some files used in a page, or downloads linked from a page, may not reside within our subsite directory):
  117. # Do the one-off subsite first:
  118. RewriteCond %{REQUEST_URI} ^/anothersubsite(/|$) [OR]
  119. RewriteCond %{HTTP_REFERER} domain\.com/anothersubsite(/|$)
  120. RewriteRule ^/.* - [E=vhostLogName:anothersubsite]
  121. # Subsite of the form /subsite[0-9]+. NB the capture brackets in the RewriteCond rules which get mapped to %1 in the RewriteRule
  122. RewriteCond %{REQUEST_URI} ^/(subsite[0-9]+)(/|$)) [OR]
  123. RewriteCond %{HTTP_REFERER} domain\.com/(subsite[0-9]+)(/|$)
  124. RewriteRule ^/.* - [E=vhostLogName:subsite%1]
  125. # Now set the logging to piwik setting:
  126. # * the hostname to domain.com/<subsitename>
  127. # * the logformat to vhost_domain_com_subsites (can be anything so long as it's unique)
  128. # * the output to go to /var/log/piwik/domain_com_subsites.log (again, can be anything)
  129. # * triggering only when the env variable is set, so requests to other URIs on this domain don't call this logging rule
  130. Use piwiklog domain.com/%{vhostLogName}e vhost_domain_com_subsites domain_com_subsites env=vhostLogName
  131. </VirtualHost>
  132. ```
  133. ## Nginx Virtual Host Log Format
  134. This log format can be specified for nginx access logs to capture multiple virtual hosts:
  135. * log_format vhosts '$host $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"';
  136. * access_log /PATH/TO/access.log vhosts;
  137. When executing import_logs.py specify the "common_complete" format.
  138. ## Import Page Speed Metric from logs
  139. In Piwik> Actions> Page URLs and Page Title reports, Piwik reports the Avg. generation time, as an indicator of your website speed.
  140. This metric works by default when using the Javascript tracker, but you can use it with log file as well.
  141. Apache can log the generation time in microseconds using %D in the LogFormat.
  142. This metric can be imported using a custom log format in this script.
  143. In the command line, add the --log-format-regex parameter that contains the group generation_time_micro.
  144. Here's an example:
  145. Apache LogFormat "%h %l %u %t \"%r\" %>s %b %D"
  146. --log-format-regex="(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?) \S+\" (?P<status>\S+) (?P<length>\S+) (?P<generation_time_micro>\S+)"
  147. Note: the group <generation_time_milli> is also available if your server logs generation time in milliseconds rather than microseconds.
  148. ## Setup Nginx to directly imports in Piwik via syslog
  149. With the syslog patch from http://wiki.nginx.org/3rdPartyModules which is compiled in dotdeb's release, you can log to syslog and imports them live to Piwik.
  150. Path: Nginx -> syslog -> (syslog central server) -> this script -> piwik
  151. You can use any log format that this script can handle, like Apache Combined, and Json format which needs less processing.
  152. ### Setup Nginx logs
  153. ```
  154. http {
  155. ...
  156. log_format piwik '{"ip": "$remote_addr",'
  157. '"host": "$host",'
  158. '"path": "$request_uri",'
  159. '"status": "$status",'
  160. '"referrer": "$http_referer",'
  161. '"user_agent": "$http_user_agent",'
  162. '"length": $bytes_sent,'
  163. '"generation_time_milli": $request_time,'
  164. '"date": "$time_iso8601"}';
  165. ...
  166. server {
  167. ...
  168. access_log syslog:info piwik;
  169. ...
  170. }
  171. }
  172. ```
  173. # Setup syslog-ng
  174. This is the config for the central server if any. If not, you can also use this config on the same server as Nginx.
  175. ```
  176. options {
  177. stats_freq(600); stats_level(1);
  178. log_fifo_size(1280000);
  179. log_msg_size(8192);
  180. };
  181. source s_nginx { udp(); };
  182. destination d_piwik {
  183. program("/usr/local/piwik/piwik.sh" template("$MSG\n"));
  184. };
  185. log { source(s_nginx); filter(f_info); destination(d_piwik); };
  186. ```
  187. # piwik.sh
  188. Just needed to configure the best params for import_logs.py :
  189. ```
  190. #!/bin/sh
  191. exec python /path/to/misc/log-analytics/import_logs.py \
  192. --url=http://localhost/ --token-auth=<your_auth_token> \
  193. --idsite=1 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots \
  194. --log-format-name=nginx_json -
  195. ```
  196. # regex example for syslog format (centralized logs)
  197. ## log format exemple
  198. ```
  199. Aug 31 23:59:59 tt-srv-name www.tt.com: 1.1.1.1 - - [31/Aug/2014:23:59:59 +0200] "GET /index.php HTTP/1.0" 200 3838 "http://www.tt.com/index.php" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0" 365020 www.tt.com
  200. ```
  201. ## Corresponding regex
  202. ```
  203. --log-format-regex='.* ((?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ (?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)").*'
  204. ```
  205. And that's all !