PageRenderTime 131ms CodeModel.GetById 13ms RepoModel.GetById 0ms app.codeStats 0ms

/docs/topics/selectors.rst

https://github.com/noplay/scrapy
ReStructuredText | 380 lines | 255 code | 125 blank | 0 comment | 0 complexity | ab8ce12bcffab59f6f2c6dc683905b9b MD5 | raw file
  1. .. _topics-selectors:
  2. ===============
  3. XPath Selectors
  4. ===============
  5. When you're scraping web pages, the most common task you need to perform is
  6. to extract data from the HTML source. There are several libraries available to
  7. achieve this:
  8. * `BeautifulSoup`_ is a very popular screen scraping library among Python
  9. programmers which constructs a Python object based on the
  10. structure of the HTML code and also deals with bad markup reasonably well,
  11. but it has one drawback: it's slow.
  12. * `lxml`_ is a XML parsing library (which also parses HTML) with a pythonic
  13. API based on `ElementTree`_ (which is not part of the Python standard
  14. library).
  15. Scrapy comes with its own mechanism for extracting data. They're called XPath
  16. selectors (or just "selectors", for short) because they "select" certain parts
  17. of the HTML document specified by `XPath`_ expressions.
  18. `XPath`_ is a language for selecting nodes in XML documents, which can also be used with HTML.
  19. Both `lxml`_ and Scrapy Selectors are built over the `libxml2`_ library, which
  20. means they're very similar in speed and parsing accuracy.
  21. This page explains how selectors work and describes their API which is very
  22. small and simple, unlike the `lxml`_ API which is much bigger because the
  23. `lxml`_ library can be used for many other tasks, besides selecting markup
  24. documents.
  25. For a complete reference of the selectors API see the :ref:`XPath selector
  26. reference <topics-selectors-ref>`.
  27. .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
  28. .. _lxml: http://codespeak.net/lxml/
  29. .. _ElementTree: http://docs.python.org/library/xml.etree.elementtree.html
  30. .. _libxml2: http://xmlsoft.org/
  31. .. _XPath: http://www.w3.org/TR/xpath
  32. Using selectors
  33. ===============
  34. Constructing selectors
  35. ----------------------
  36. There are two types of selectors bundled with Scrapy. Those are:
  37. * :class:`~scrapy.selector.HtmlXPathSelector` - for working with HTML documents
  38. * :class:`~scrapy.selector.XmlXPathSelector` - for working with XML documents
  39. .. highlight:: python
  40. Both share the same selector API, and are constructed with a Response object as
  41. their first parameter. This is the Response they're going to be "selecting".
  42. Example::
  43. hxs = HtmlXPathSelector(response) # a HTML selector
  44. xxs = XmlXPathSelector(response) # a XML selector
  45. Using selectors with XPaths
  46. ---------------------------
  47. To explain how to use the selectors we'll use the `Scrapy shell` (which
  48. provides interactive testing) and an example page located in the Scrapy
  49. documentation server:
  50. http://doc.scrapy.org/_static/selectors-sample1.html
  51. .. _topics-selectors-htmlcode:
  52. Here's its HTML code:
  53. .. literalinclude:: ../_static/selectors-sample1.html
  54. :language: html
  55. .. highlight:: sh
  56. First, let's open the shell::
  57. scrapy shell http://doc.scrapy.org/_static/selectors-sample1.html
  58. Then, after the shell loads, you'll have some selectors already instantiated and
  59. ready to use.
  60. Since we're dealing with HTML, we'll be using the
  61. :class:`~scrapy.selector.HtmlXPathSelector` object which is found, by default, in
  62. the ``hxs`` shell variable.
  63. .. highlight:: python
  64. So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that page,
  65. let's construct an XPath (using an HTML selector) for selecting the text inside
  66. the title tag::
  67. >>> hxs.select('//title/text()')
  68. [<HtmlXPathSelector (text) xpath=//title/text()>]
  69. As you can see, the select() method returns an XPathSelectorList, which is a list of
  70. new selectors. This API can be used quickly for extracting nested data.
  71. To actually extract the textual data, you must call the selector ``extract()``
  72. method, as follows::
  73. >>> hxs.select('//title/text()').extract()
  74. [u'Example website']
  75. Now we're going to get the base URL and some image links::
  76. >>> hxs.select('//base/@href').extract()
  77. [u'http://example.com/']
  78. >>> hxs.select('//a[contains(@href, "image")]/@href').extract()
  79. [u'image1.html',
  80. u'image2.html',
  81. u'image3.html',
  82. u'image4.html',
  83. u'image5.html']
  84. >>> hxs.select('//a[contains(@href, "image")]/img/@src').extract()
  85. [u'image1_thumb.jpg',
  86. u'image2_thumb.jpg',
  87. u'image3_thumb.jpg',
  88. u'image4_thumb.jpg',
  89. u'image5_thumb.jpg']
  90. Using selectors with regular expressions
  91. ----------------------------------------
  92. Selectors also have a ``re()`` method for extracting data using regular
  93. expressions. However, unlike using the ``select()`` method, the ``re()`` method
  94. does not return a list of :class:`~scrapy.selector.XPathSelector` objects, so you
  95. can't construct nested ``.re()`` calls.
  96. Here's an example used to extract images names from the :ref:`HTML code
  97. <topics-selectors-htmlcode>` above::
  98. >>> hxs.select('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
  99. [u'My image 1',
  100. u'My image 2',
  101. u'My image 3',
  102. u'My image 4',
  103. u'My image 5']
  104. .. _topics-selectors-nesting-selectors:
  105. Nesting selectors
  106. -----------------
  107. The ``select()`` selector method returns a list of selectors, so you can call the
  108. ``select()`` for those selectors too. Here's an example::
  109. >>> links = hxs.select('//a[contains(@href, "image")]')
  110. >>> links.extract()
  111. [u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
  112. u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
  113. u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
  114. u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
  115. u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
  116. >>> for index, link in enumerate(links):
  117. args = (index, link.select('@href').extract(), link.select('img/@src').extract())
  118. print 'Link number %d points to url %s and image %s' % args
  119. Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
  120. Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
  121. Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
  122. Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
  123. Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
  124. .. _topics-selectors-relative-xpaths:
  125. Working with relative XPaths
  126. ----------------------------
  127. Keep in mind that if you are nesting XPathSelectors and use an XPath that
  128. starts with ``/``, that XPath will be absolute to the document and not relative
  129. to the ``XPathSelector`` you're calling it from.
  130. For example, suppose you want to extract all ``<p>`` elements inside ``<div>``
  131. elements. First, you would get all ``<div>`` elements::
  132. >>> divs = hxs.select('//div')
  133. At first, you may be tempted to use the following approach, which is wrong, as
  134. it actually extracts all ``<p>`` elements from the document, not only those
  135. inside ``<div>`` elements::
  136. >>> for p in divs.select('//p') # this is wrong - gets all <p> from the whole document
  137. >>> print p.extract()
  138. This is the proper way to do it (note the dot prefixing the ``.//p`` XPath)::
  139. >>> for p in divs.select('.//p') # extracts all <p> inside
  140. >>> print p.extract()
  141. Another common case would be to extract all direct ``<p>`` children::
  142. >>> for p in divs.select('p')
  143. >>> print p.extract()
  144. For more details about relative XPaths see the `Location Paths`_ section in the
  145. XPath specification.
  146. .. _Location Paths: http://www.w3.org/TR/xpath#location-paths
  147. .. _topics-selectors-ref:
  148. Built-in XPath Selectors reference
  149. ==================================
  150. .. module:: scrapy.selector
  151. :synopsis: XPath selectors classes
  152. There are two types of selectors bundled with Scrapy:
  153. :class:`HtmlXPathSelector` and :class:`XmlXPathSelector`. Both of them
  154. implement the same :class:`XPathSelector` interface. The only different is that
  155. one is used to process HTML data and the other XML data.
  156. XPathSelector objects
  157. ---------------------
  158. .. class:: XPathSelector(response)
  159. A :class:`XPathSelector` object is a wrapper over response to select
  160. certain parts of its content.
  161. ``response`` is a :class:`~scrapy.http.Response` object that will be used
  162. for selecting and extracting data
  163. .. method:: select(xpath)
  164. Apply the given XPath relative to this XPathSelector and return a list
  165. of :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList`) with
  166. the result.
  167. ``xpath`` is a string containing the XPath to apply
  168. .. method:: re(regex)
  169. Apply the given regex and return a list of unicode strings with the
  170. matches.
  171. ``regex`` can be either a compiled regular expression or a string which
  172. will be compiled to a regular expression using ``re.compile(regex)``
  173. .. method:: extract()
  174. Return a unicode string with the content of this :class:`XPathSelector`
  175. object.
  176. .. method:: register_namespace(prefix, uri)
  177. Register the given namespace to be used in this :class:`XPathSelector`.
  178. Without registering namespaces you can't select or extract data from
  179. non-standard namespaces. See examples below.
  180. .. method:: __nonzero__()
  181. Returns ``True`` if there is any real content selected by this
  182. :class:`XPathSelector` or ``False`` otherwise. In other words, the boolean
  183. value of an XPathSelector is given by the contents it selects.
  184. XPathSelectorList objects
  185. -------------------------
  186. .. class:: XPathSelectorList
  187. The :class:`XPathSelectorList` class is subclass of the builtin ``list``
  188. class, which provides a few additional methods.
  189. .. method:: select(xpath)
  190. Call the :meth:`XPathSelector.select` method for all :class:`XPathSelector`
  191. objects in this list and return their results flattened, as a new
  192. :class:`XPathSelectorList`.
  193. ``xpath`` is the same argument as the one in :meth:`XPathSelector.select`
  194. .. method:: re(regex)
  195. Call the :meth:`XPathSelector.re` method for all :class:`XPathSelector`
  196. objects in this list and return their results flattened, as a list of
  197. unicode strings.
  198. ``regex`` is the same argument as the one in :meth:`XPathSelector.re`
  199. .. method:: extract()
  200. Call the :meth:`XPathSelector.extract` method for all :class:`XPathSelector`
  201. objects in this list and return their results flattened, as a list of
  202. unicode strings.
  203. .. method:: extract_unquoted()
  204. Call the :meth:`XPathSelector.extract_unoquoted` method for all
  205. :class:`XPathSelector` objects in this list and return their results
  206. flattened, as a list of unicode strings. This method should not be applied
  207. to all kinds of XPathSelectors. For more info see
  208. :meth:`XPathSelector.extract_unoquoted`.
  209. HtmlXPathSelector objects
  210. -------------------------
  211. .. class:: HtmlXPathSelector(response)
  212. A subclass of :class:`XPathSelector` for working with HTML content. It uses
  213. the `libxml2`_ HTML parser. See the :class:`XPathSelector` API for more info.
  214. .. _libxml2: http://xmlsoft.org/
  215. HtmlXPathSelector examples
  216. ~~~~~~~~~~~~~~~~~~~~~~~~~~
  217. Here's a couple of :class:`HtmlXPathSelector` examples to illustrate several
  218. concepts. In all cases, we assume there is already an :class:`HtmlPathSelector`
  219. instantiated with a :class:`~scrapy.http.Response` object like this::
  220. x = HtmlXPathSelector(html_response)
  221. 1. Select all ``<h1>`` elements from a HTML response body, returning a list of
  222. :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
  223. x.select("//h1")
  224. 2. Extract the text of all ``<h1>`` elements from a HTML response body,
  225. returning a list of unicode strings::
  226. x.select("//h1").extract() # this includes the h1 tag
  227. x.select("//h1/text()").extract() # this excludes the h1 tag
  228. 3. Iterate over all ``<p>`` tags and print their class attribute::
  229. for node in x.select("//p"):
  230. ... print node.select("@href")
  231. 4. Extract textual data from all ``<p>`` tags without entities, as a list of
  232. unicode strings::
  233. x.select("//p/text()").extract_unquoted()
  234. # the following line is wrong. extract_unquoted() should only be used
  235. # with textual XPathSelectors
  236. x.select("//p").extract_unquoted() # it may work but output is unpredictable
  237. XmlXPathSelector objects
  238. ------------------------
  239. .. class:: XmlXPathSelector(response)
  240. A subclass of :class:`XPathSelector` for working with XML content. It uses
  241. the `libxml2`_ XML parser. See the :class:`XPathSelector` API for more info.
  242. XmlXPathSelector examples
  243. ~~~~~~~~~~~~~~~~~~~~~~~~~
  244. Here's a couple of :class:`XmlXPathSelector` examples to illustrate several
  245. concepts. In all cases we assume there is already a :class:`XmlPathSelector`
  246. instantiated with a :class:`~scrapy.http.Response` object like this::
  247. x = HtmlXPathSelector(xml_response)
  248. 1. Select all ``<product>`` elements from a XML response body, returning a list of
  249. :class:`XPathSelector` objects (ie. a :class:`XPathSelectorList` object)::
  250. x.select("//h1")
  251. 2. Extract all prices from a `Google Base XML feed`_ which requires registering
  252. a namespace::
  253. x.register_namespace("g", "http://base.google.com/ns/1.0")
  254. x.select("//g:price").extract()
  255. .. _Google Base XML feed: http://base.google.com/support/bin/answer.py?hl=en&answer=59461