/Doc/library/htmllib.rst
http://unladen-swallow.googlecode.com/ · ReStructuredText · 195 lines · 126 code · 69 blank · 0 comment · 0 complexity · c5e8450c7cf0b13ffb0e7a6d11ce9cd7 MD5 · raw file
- :mod:`htmllib` --- A parser for HTML documents
- ==============================================
- .. module:: htmllib
- :synopsis: A parser for HTML documents.
- :deprecated:
- .. deprecated:: 2.6
- The :mod:`htmllib` module has been removed in Python 3.0.
- .. index::
- single: HTML
- single: hypertext
- .. index::
- module: sgmllib
- module: formatter
- single: SGMLParser (in module sgmllib)
- This module defines a class which can serve as a base for parsing text files
- formatted in the HyperText Mark-up Language (HTML). The class is not directly
- concerned with I/O --- it must be provided with input in string form via a
- method, and makes calls to methods of a "formatter" object in order to produce
- output. The :class:`HTMLParser` class is designed to be used as a base class
- for other classes in order to add functionality, and allows most of its methods
- to be extended or overridden. In turn, this class is derived from and extends
- the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The
- :class:`HTMLParser` implementation supports the HTML 2.0 language as described
- in :rfc:`1866`. Two implementations of formatter objects are provided in the
- :mod:`formatter` module; refer to the documentation for that module for
- information on the formatter interface.
- The following is a summary of the interface defined by
- :class:`sgmllib.SGMLParser`:
- * The interface to feed data to an instance is through the :meth:`feed` method,
- which takes a string argument. This can be called with as little or as much
- text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as
- ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these
- are processed immediately; incomplete constructs are saved in a buffer. To
- force processing of all unprocessed data, call the :meth:`close` method.
- For example, to parse the entire contents of a file, use::
- parser.feed(open('myfile.html').read())
- parser.close()
- * The interface to define semantics for HTML tags is very simple: derive a class
- and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`.
- The parser will call these at appropriate moments: :meth:`start_tag` or
- :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is
- encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>``
- is encountered. If an opening tag requires a corresponding closing tag, like
- ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if
- a tag requires no closing tag, like ``<P>``, the class should define the
- :meth:`do_tag` method.
- The module defines a parser class and an exception:
- .. class:: HTMLParser(formatter)
- This is the basic HTML parser class. It supports all entity names required by
- the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines
- handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
- .. exception:: HTMLParseError
- Exception raised by the :class:`HTMLParser` class when it encounters an error
- while parsing.
- .. versionadded:: 2.4
- .. seealso::
- Module :mod:`formatter`
- Interface definition for transforming an abstract flow of formatting events into
- specific output events on writer objects.
- Module :mod:`HTMLParser`
- Alternate HTML parser that offers a slightly lower-level view of the input, but
- is designed to work with XHTML, and does not implement some of the SGML syntax
- not used in "HTML as deployed" and which isn't legal for XHTML.
- Module :mod:`htmlentitydefs`
- Definition of replacement text for XHTML 1.0 entities.
- Module :mod:`sgmllib`
- Base class for :class:`HTMLParser`.
- .. _html-parser-objects:
- HTMLParser Objects
- ------------------
- In addition to tag methods, the :class:`HTMLParser` class provides some
- additional methods and instance variables for use within tag methods.
- .. attribute:: HTMLParser.formatter
- This is the formatter instance associated with the parser.
- .. attribute:: HTMLParser.nofill
- Boolean flag which should be true when whitespace should not be collapsed, or
- false when it should be. In general, this should only be true when character
- data is to be treated as "preformatted" text, as within a ``<PRE>`` element.
- The default value is false. This affects the operation of :meth:`handle_data`
- and :meth:`save_end`.
- .. method:: HTMLParser.anchor_bgn(href, name, type)
- This method is called at the start of an anchor region. The arguments
- correspond to the attributes of the ``<A>`` tag with the same names. The
- default implementation maintains a list of hyperlinks (defined by the ``HREF``
- attribute for ``<A>`` tags) within the document. The list of hyperlinks is
- available as the data attribute :attr:`anchorlist`.
- .. method:: HTMLParser.anchor_end()
- This method is called at the end of an anchor region. The default
- implementation adds a textual footnote marker using an index into the list of
- hyperlinks created by :meth:`anchor_bgn`.
- .. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]])
- This method is called to handle images. The default implementation simply
- passes the *alt* value to the :meth:`handle_data` method.
- .. method:: HTMLParser.save_bgn()
- Begins saving character data in a buffer instead of sending it to the formatter
- object. Retrieve the stored data via :meth:`save_end`. Use of the
- :meth:`save_bgn` / :meth:`save_end` pair may not be nested.
- .. method:: HTMLParser.save_end()
- Ends buffering character data and returns all data saved since the preceding
- call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is
- collapsed to single spaces. A call to this method without a preceding call to
- :meth:`save_bgn` will raise a :exc:`TypeError` exception.
- :mod:`htmlentitydefs` --- Definitions of HTML general entities
- ==============================================================
- .. module:: htmlentitydefs
- :synopsis: Definitions of HTML general entities.
- .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
- .. note::
- The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in
- Python 3.0. The :term:`2to3` tool will automatically adapt imports when
- converting your sources to 3.0.
- This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
- and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to
- provide the :attr:`entitydefs` member of the :class:`HTMLParser` class. The
- definition provided here contains all the entities defined by XHTML 1.0 that
- can be handled using simple textual substitution in the Latin-1 character set
- (ISO-8859-1).
- .. data:: entitydefs
- A dictionary mapping XHTML 1.0 entity definitions to their replacement text in
- ISO Latin-1.
- .. data:: name2codepoint
- A dictionary that maps HTML entity names to the Unicode codepoints.
- .. versionadded:: 2.3
- .. data:: codepoint2name
- A dictionary that maps Unicode codepoints to HTML entity names.
- .. versionadded:: 2.3