PageRenderTime 117ms CodeModel.GetById 42ms RepoModel.GetById 0ms app.codeStats 0ms

/Pycon2009/scrape/scraping-the-web.rst

https://github.com/toastdriven/pydanny-event-notes
ReStructuredText | 109 lines | 80 code | 29 blank | 0 comment | 0 complexity | 0ff3b2a93b73c091b5a5650d713c9705 MD5 | raw file
  1. ===============
  2. Scrap the Web
  3. ===============
  4. * Asheesh Laroia
  5. * http://pycon09.asheesh.org/
  6. Things to remember
  7. ===================
  8. * Scrapists are people who scrape web pages.
  9. * Ignore terms of service at your on peril.
  10. * DO NOT BECOME AN EVIL COMMENT SPAMMER
  11. * Hidden form fields
  12. * make traps for automated web scrapers look for emails
  13. * If its on the web, you can scrape it
  14. - Now you have an API for everything
  15. Why scrape the web?
  16. --------------------
  17. * You can maintain interoperation with unmaintained systems
  18. * "Rogue interoperability"
  19. Three perspectives
  20. ---------------------
  21. * Be a maverick
  22. * Use the old standby XML tools
  23. * "Use the web tools against the web"
  24. Choosing a parser
  25. ------------------
  26. * Performance
  27. * Ease-of-use
  28. ----
  29. Some tools
  30. ===========
  31. * BeautifulSoup is old and not maintained anymore
  32. * html5lib
  33. - builds BeautifulSoup objects
  34. - builds elementTrees
  35. * lxml provides XPath
  36. * mechanize (so much better than urllib2!)
  37. * pyquery
  38. * DOMForm is a python module for web scraping but it is not maintained
  39. * python-spidermonkey is a bridge between python and spidermonkey
  40. - lets you build a python engine that lets you run Javascript inside of Python!
  41. * Firefox via SeleniumRC. Really powerful. Downside is that instances of Selenium is not cheap.
  42. Mechanize
  43. -----------
  44. * You may need to turn off robots control
  45. - mechanize.Browser().set_handle_robots(False)
  46. * See some other tricks
  47. * Handles hidden fields and cookies
  48. HTTP Headers for urllib
  49. ------------------------
  50. * Need to support headers in stuff via urllib2
  51. * 2xx: Success
  52. * 3xx: Redirection
  53. * 4xx: Error
  54. * JavaScript behavior does not mirror Firefox
  55. * Image download behavior
  56. * Cookie behavior
  57. * Invalid HTML handling behavior
  58. * Not emulating web browsers can cause all kinds of fun and random things to happen
  59. ----
  60. HTTP Methods
  61. =============
  62. * GET is for requesting a page
  63. * POST is for submitting a form to a server
  64. * PUT is for uploading files in WebDav
  65. * BREW is because POST is trademarked by Post.
  66. robots.txt
  67. -----------
  68. * Check out http://www.robotstxt.org
  69. Getting around IP address limits
  70. ================================
  71. * ssh -D to borrow the IP the machine you can log in to
  72. * ssh -D 1080 asheesh.org
  73. - Binds the 1080 port number to your local
  74. - Sets up a soft proxy redirect to fake out people wondering who you are
  75. ----
  76. Handling CAPTCHA
  77. =================
  78. * Sometimes people only have a limited set of images
  79. * Audio analysis
  80. * Use SeleniumRC to let you do the CAPTCHA for things.