/Doc/library/robotparser.rst

http://unladen-swallow.googlecode.com/ · ReStructuredText · 79 lines · 49 code · 30 blank · 0 comment · 0 complexity · 8b2888f6d9521060d16869bb1cd31973 MD5 · raw file

  1. :mod:`robotparser` --- Parser for robots.txt
  2. =============================================
  3. .. module:: robotparser
  4. :synopsis: Loads a robots.txt file and answers questions about
  5. fetchability of other URLs.
  6. .. sectionauthor:: Skip Montanaro <skip@pobox.com>
  7. .. index::
  8. single: WWW
  9. single: World Wide Web
  10. single: URL
  11. single: robots.txt
  12. .. note::
  13. The :mod:`robotparser` module has been renamed :mod:`urllib.robotparser` in
  14. Python 3.0.
  15. The :term:`2to3` tool will automatically adapt imports when converting
  16. your sources to 3.0.
  17. This module provides a single class, :class:`RobotFileParser`, which answers
  18. questions about whether or not a particular user agent can fetch a URL on the
  19. Web site that published the :file:`robots.txt` file. For more details on the
  20. structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
  21. .. class:: RobotFileParser()
  22. This class provides a set of methods to read, parse and answer questions
  23. about a single :file:`robots.txt` file.
  24. .. method:: set_url(url)
  25. Sets the URL referring to a :file:`robots.txt` file.
  26. .. method:: read()
  27. Reads the :file:`robots.txt` URL and feeds it to the parser.
  28. .. method:: parse(lines)
  29. Parses the lines argument.
  30. .. method:: can_fetch(useragent, url)
  31. Returns ``True`` if the *useragent* is allowed to fetch the *url*
  32. according to the rules contained in the parsed :file:`robots.txt`
  33. file.
  34. .. method:: mtime()
  35. Returns the time the ``robots.txt`` file was last fetched. This is
  36. useful for long-running web spiders that need to check for new
  37. ``robots.txt`` files periodically.
  38. .. method:: modified()
  39. Sets the time the ``robots.txt`` file was last fetched to the current
  40. time.
  41. The following example demonstrates basic use of the RobotFileParser class. ::
  42. >>> import robotparser
  43. >>> rp = robotparser.RobotFileParser()
  44. >>> rp.set_url("http://www.musi-cal.com/robots.txt")
  45. >>> rp.read()
  46. >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
  47. False
  48. >>> rp.can_fetch("*", "http://www.musi-cal.com/")
  49. True