PageRenderTime 69ms CodeModel.GetById 23ms RepoModel.GetById 0ms app.codeStats 0ms

/lib/cpython-doc/tools/docutils/parsers/rst/states.py

https://bitbucket.org/csenger/benchmarks
Python | 3008 lines | 2912 code | 21 blank | 75 comment | 48 complexity | 494a8d8114df0149de8ee8fef642e535 MD5 | raw file
Possible License(s): BSD-3-Clause, Apache-2.0, GPL-2.0

Large files files are truncated, but you can click here to view the full file

  1. # $Id: states.py 78909 2010-03-13 10:49:23Z georg.brandl $
  2. # Author: David Goodger <goodger@python.org>
  3. # Copyright: This module has been placed in the public domain.
  4. """
  5. This is the ``docutils.parsers.restructuredtext.states`` module, the core of
  6. the reStructuredText parser. It defines the following:
  7. :Classes:
  8. - `RSTStateMachine`: reStructuredText parser's entry point.
  9. - `NestedStateMachine`: recursive StateMachine.
  10. - `RSTState`: reStructuredText State superclass.
  11. - `Inliner`: For parsing inline markup.
  12. - `Body`: Generic classifier of the first line of a block.
  13. - `SpecializedBody`: Superclass for compound element members.
  14. - `BulletList`: Second and subsequent bullet_list list_items
  15. - `DefinitionList`: Second+ definition_list_items.
  16. - `EnumeratedList`: Second+ enumerated_list list_items.
  17. - `FieldList`: Second+ fields.
  18. - `OptionList`: Second+ option_list_items.
  19. - `RFC2822List`: Second+ RFC2822-style fields.
  20. - `ExtensionOptions`: Parses directive option fields.
  21. - `Explicit`: Second+ explicit markup constructs.
  22. - `SubstitutionDef`: For embedded directives in substitution definitions.
  23. - `Text`: Classifier of second line of a text block.
  24. - `SpecializedText`: Superclass for continuation lines of Text-variants.
  25. - `Definition`: Second line of potential definition_list_item.
  26. - `Line`: Second line of overlined section title or transition marker.
  27. - `Struct`: An auxiliary collection class.
  28. :Exception classes:
  29. - `MarkupError`
  30. - `ParserError`
  31. - `MarkupMismatch`
  32. :Functions:
  33. - `escape2null()`: Return a string, escape-backslashes converted to nulls.
  34. - `unescape()`: Return a string, nulls removed or restored to backslashes.
  35. :Attributes:
  36. - `state_classes`: set of State classes used with `RSTStateMachine`.
  37. Parser Overview
  38. ===============
  39. The reStructuredText parser is implemented as a recursive state machine,
  40. examining its input one line at a time. To understand how the parser works,
  41. please first become familiar with the `docutils.statemachine` module. In the
  42. description below, references are made to classes defined in this module;
  43. please see the individual classes for details.
  44. Parsing proceeds as follows:
  45. 1. The state machine examines each line of input, checking each of the
  46. transition patterns of the state `Body`, in order, looking for a match.
  47. The implicit transitions (blank lines and indentation) are checked before
  48. any others. The 'text' transition is a catch-all (matches anything).
  49. 2. The method associated with the matched transition pattern is called.
  50. A. Some transition methods are self-contained, appending elements to the
  51. document tree (`Body.doctest` parses a doctest block). The parser's
  52. current line index is advanced to the end of the element, and parsing
  53. continues with step 1.
  54. B. Other transition methods trigger the creation of a nested state machine,
  55. whose job is to parse a compound construct ('indent' does a block quote,
  56. 'bullet' does a bullet list, 'overline' does a section [first checking
  57. for a valid section header], etc.).
  58. - In the case of lists and explicit markup, a one-off state machine is
  59. created and run to parse contents of the first item.
  60. - A new state machine is created and its initial state is set to the
  61. appropriate specialized state (`BulletList` in the case of the
  62. 'bullet' transition; see `SpecializedBody` for more detail). This
  63. state machine is run to parse the compound element (or series of
  64. explicit markup elements), and returns as soon as a non-member element
  65. is encountered. For example, the `BulletList` state machine ends as
  66. soon as it encounters an element which is not a list item of that
  67. bullet list. The optional omission of inter-element blank lines is
  68. enabled by this nested state machine.
  69. - The current line index is advanced to the end of the elements parsed,
  70. and parsing continues with step 1.
  71. C. The result of the 'text' transition depends on the next line of text.
  72. The current state is changed to `Text`, under which the second line is
  73. examined. If the second line is:
  74. - Indented: The element is a definition list item, and parsing proceeds
  75. similarly to step 2.B, using the `DefinitionList` state.
  76. - A line of uniform punctuation characters: The element is a section
  77. header; again, parsing proceeds as in step 2.B, and `Body` is still
  78. used.
  79. - Anything else: The element is a paragraph, which is examined for
  80. inline markup and appended to the parent element. Processing
  81. continues with step 1.
  82. """
  83. __docformat__ = 'reStructuredText'
  84. import sys
  85. import re
  86. import roman
  87. from types import FunctionType, MethodType
  88. from docutils import nodes, statemachine, utils, urischemes
  89. from docutils import ApplicationError, DataError
  90. from docutils.statemachine import StateMachineWS, StateWS
  91. from docutils.nodes import fully_normalize_name as normalize_name
  92. from docutils.nodes import whitespace_normalize_name
  93. from docutils.utils import escape2null, unescape, column_width
  94. import docutils.parsers.rst
  95. from docutils.parsers.rst import directives, languages, tableparser, roles
  96. from docutils.parsers.rst.languages import en as _fallback_language_module
  97. class MarkupError(DataError): pass
  98. class UnknownInterpretedRoleError(DataError): pass
  99. class InterpretedRoleNotImplementedError(DataError): pass
  100. class ParserError(ApplicationError): pass
  101. class MarkupMismatch(Exception): pass
  102. class Struct:
  103. """Stores data attributes for dotted-attribute access."""
  104. def __init__(self, **keywordargs):
  105. self.__dict__.update(keywordargs)
  106. class RSTStateMachine(StateMachineWS):
  107. """
  108. reStructuredText's master StateMachine.
  109. The entry point to reStructuredText parsing is the `run()` method.
  110. """
  111. def run(self, input_lines, document, input_offset=0, match_titles=1,
  112. inliner=None):
  113. """
  114. Parse `input_lines` and modify the `document` node in place.
  115. Extend `StateMachineWS.run()`: set up parse-global data and
  116. run the StateMachine.
  117. """
  118. self.language = languages.get_language(
  119. document.settings.language_code)
  120. self.match_titles = match_titles
  121. if inliner is None:
  122. inliner = Inliner()
  123. inliner.init_customizations(document.settings)
  124. self.memo = Struct(document=document,
  125. reporter=document.reporter,
  126. language=self.language,
  127. title_styles=[],
  128. section_level=0,
  129. section_bubble_up_kludge=0,
  130. inliner=inliner)
  131. self.document = document
  132. self.attach_observer(document.note_source)
  133. self.reporter = self.memo.reporter
  134. self.node = document
  135. results = StateMachineWS.run(self, input_lines, input_offset,
  136. input_source=document['source'])
  137. assert results == [], 'RSTStateMachine.run() results should be empty!'
  138. self.node = self.memo = None # remove unneeded references
  139. class NestedStateMachine(StateMachineWS):
  140. """
  141. StateMachine run from within other StateMachine runs, to parse nested
  142. document structures.
  143. """
  144. def run(self, input_lines, input_offset, memo, node, match_titles=1):
  145. """
  146. Parse `input_lines` and populate a `docutils.nodes.document` instance.
  147. Extend `StateMachineWS.run()`: set up document-wide data.
  148. """
  149. self.match_titles = match_titles
  150. self.memo = memo
  151. self.document = memo.document
  152. self.attach_observer(self.document.note_source)
  153. self.reporter = memo.reporter
  154. self.language = memo.language
  155. self.node = node
  156. results = StateMachineWS.run(self, input_lines, input_offset)
  157. assert results == [], ('NestedStateMachine.run() results should be '
  158. 'empty!')
  159. return results
  160. class RSTState(StateWS):
  161. """
  162. reStructuredText State superclass.
  163. Contains methods used by all State subclasses.
  164. """
  165. nested_sm = NestedStateMachine
  166. nested_sm_cache = []
  167. def __init__(self, state_machine, debug=0):
  168. self.nested_sm_kwargs = {'state_classes': state_classes,
  169. 'initial_state': 'Body'}
  170. StateWS.__init__(self, state_machine, debug)
  171. def runtime_init(self):
  172. StateWS.runtime_init(self)
  173. memo = self.state_machine.memo
  174. self.memo = memo
  175. self.reporter = memo.reporter
  176. self.inliner = memo.inliner
  177. self.document = memo.document
  178. self.parent = self.state_machine.node
  179. def goto_line(self, abs_line_offset):
  180. """
  181. Jump to input line `abs_line_offset`, ignoring jumps past the end.
  182. """
  183. try:
  184. self.state_machine.goto_line(abs_line_offset)
  185. except EOFError:
  186. pass
  187. def no_match(self, context, transitions):
  188. """
  189. Override `StateWS.no_match` to generate a system message.
  190. This code should never be run.
  191. """
  192. self.reporter.severe(
  193. 'Internal error: no transition pattern match. State: "%s"; '
  194. 'transitions: %s; context: %s; current line: %r.'
  195. % (self.__class__.__name__, transitions, context,
  196. self.state_machine.line),
  197. line=self.state_machine.abs_line_number())
  198. return context, None, []
  199. def bof(self, context):
  200. """Called at beginning of file."""
  201. return [], []
  202. def nested_parse(self, block, input_offset, node, match_titles=0,
  203. state_machine_class=None, state_machine_kwargs=None):
  204. """
  205. Create a new StateMachine rooted at `node` and run it over the input
  206. `block`.
  207. """
  208. use_default = 0
  209. if state_machine_class is None:
  210. state_machine_class = self.nested_sm
  211. use_default += 1
  212. if state_machine_kwargs is None:
  213. state_machine_kwargs = self.nested_sm_kwargs
  214. use_default += 1
  215. block_length = len(block)
  216. state_machine = None
  217. if use_default == 2:
  218. try:
  219. state_machine = self.nested_sm_cache.pop()
  220. except IndexError:
  221. pass
  222. if not state_machine:
  223. state_machine = state_machine_class(debug=self.debug,
  224. **state_machine_kwargs)
  225. state_machine.run(block, input_offset, memo=self.memo,
  226. node=node, match_titles=match_titles)
  227. if use_default == 2:
  228. self.nested_sm_cache.append(state_machine)
  229. else:
  230. state_machine.unlink()
  231. new_offset = state_machine.abs_line_offset()
  232. # No `block.parent` implies disconnected -- lines aren't in sync:
  233. if block.parent and (len(block) - block_length) != 0:
  234. # Adjustment for block if modified in nested parse:
  235. self.state_machine.next_line(len(block) - block_length)
  236. return new_offset
  237. def nested_list_parse(self, block, input_offset, node, initial_state,
  238. blank_finish,
  239. blank_finish_state=None,
  240. extra_settings={},
  241. match_titles=0,
  242. state_machine_class=None,
  243. state_machine_kwargs=None):
  244. """
  245. Create a new StateMachine rooted at `node` and run it over the input
  246. `block`. Also keep track of optional intermediate blank lines and the
  247. required final one.
  248. """
  249. if state_machine_class is None:
  250. state_machine_class = self.nested_sm
  251. if state_machine_kwargs is None:
  252. state_machine_kwargs = self.nested_sm_kwargs.copy()
  253. state_machine_kwargs['initial_state'] = initial_state
  254. state_machine = state_machine_class(debug=self.debug,
  255. **state_machine_kwargs)
  256. if blank_finish_state is None:
  257. blank_finish_state = initial_state
  258. state_machine.states[blank_finish_state].blank_finish = blank_finish
  259. for key, value in extra_settings.items():
  260. setattr(state_machine.states[initial_state], key, value)
  261. state_machine.run(block, input_offset, memo=self.memo,
  262. node=node, match_titles=match_titles)
  263. blank_finish = state_machine.states[blank_finish_state].blank_finish
  264. state_machine.unlink()
  265. return state_machine.abs_line_offset(), blank_finish
  266. def section(self, title, source, style, lineno, messages):
  267. """Check for a valid subsection and create one if it checks out."""
  268. if self.check_subsection(source, style, lineno):
  269. self.new_subsection(title, lineno, messages)
  270. def check_subsection(self, source, style, lineno):
  271. """
  272. Check for a valid subsection header. Return 1 (true) or None (false).
  273. When a new section is reached that isn't a subsection of the current
  274. section, back up the line count (use ``previous_line(-x)``), then
  275. ``raise EOFError``. The current StateMachine will finish, then the
  276. calling StateMachine can re-examine the title. This will work its way
  277. back up the calling chain until the correct section level isreached.
  278. @@@ Alternative: Evaluate the title, store the title info & level, and
  279. back up the chain until that level is reached. Store in memo? Or
  280. return in results?
  281. :Exception: `EOFError` when a sibling or supersection encountered.
  282. """
  283. memo = self.memo
  284. title_styles = memo.title_styles
  285. mylevel = memo.section_level
  286. try: # check for existing title style
  287. level = title_styles.index(style) + 1
  288. except ValueError: # new title style
  289. if len(title_styles) == memo.section_level: # new subsection
  290. title_styles.append(style)
  291. return 1
  292. else: # not at lowest level
  293. self.parent += self.title_inconsistent(source, lineno)
  294. return None
  295. if level <= mylevel: # sibling or supersection
  296. memo.section_level = level # bubble up to parent section
  297. if len(style) == 2:
  298. memo.section_bubble_up_kludge = 1
  299. # back up 2 lines for underline title, 3 for overline title
  300. self.state_machine.previous_line(len(style) + 1)
  301. raise EOFError # let parent section re-evaluate
  302. if level == mylevel + 1: # immediate subsection
  303. return 1
  304. else: # invalid subsection
  305. self.parent += self.title_inconsistent(source, lineno)
  306. return None
  307. def title_inconsistent(self, sourcetext, lineno):
  308. error = self.reporter.severe(
  309. 'Title level inconsistent:', nodes.literal_block('', sourcetext),
  310. line=lineno)
  311. return error
  312. def new_subsection(self, title, lineno, messages):
  313. """Append new subsection to document tree. On return, check level."""
  314. memo = self.memo
  315. mylevel = memo.section_level
  316. memo.section_level += 1
  317. section_node = nodes.section()
  318. self.parent += section_node
  319. textnodes, title_messages = self.inline_text(title, lineno)
  320. titlenode = nodes.title(title, '', *textnodes)
  321. name = normalize_name(titlenode.astext())
  322. section_node['names'].append(name)
  323. section_node += titlenode
  324. section_node += messages
  325. section_node += title_messages
  326. self.document.note_implicit_target(section_node, section_node)
  327. offset = self.state_machine.line_offset + 1
  328. absoffset = self.state_machine.abs_line_offset() + 1
  329. newabsoffset = self.nested_parse(
  330. self.state_machine.input_lines[offset:], input_offset=absoffset,
  331. node=section_node, match_titles=1)
  332. self.goto_line(newabsoffset)
  333. if memo.section_level <= mylevel: # can't handle next section?
  334. raise EOFError # bubble up to supersection
  335. # reset section_level; next pass will detect it properly
  336. memo.section_level = mylevel
  337. def paragraph(self, lines, lineno):
  338. """
  339. Return a list (paragraph & messages) & a boolean: literal_block next?
  340. """
  341. data = '\n'.join(lines).rstrip()
  342. if re.search(r'(?<!\\)(\\\\)*::$', data):
  343. if len(data) == 2:
  344. return [], 1
  345. elif data[-3] in ' \n':
  346. text = data[:-3].rstrip()
  347. else:
  348. text = data[:-1]
  349. literalnext = 1
  350. else:
  351. text = data
  352. literalnext = 0
  353. textnodes, messages = self.inline_text(text, lineno)
  354. p = nodes.paragraph(data, '', *textnodes)
  355. p.line = lineno
  356. return [p] + messages, literalnext
  357. def inline_text(self, text, lineno):
  358. """
  359. Return 2 lists: nodes (text and inline elements), and system_messages.
  360. """
  361. return self.inliner.parse(text, lineno, self.memo, self.parent)
  362. def unindent_warning(self, node_name):
  363. return self.reporter.warning(
  364. '%s ends without a blank line; unexpected unindent.' % node_name,
  365. line=(self.state_machine.abs_line_number() + 1))
  366. def build_regexp(definition, compile=1):
  367. """
  368. Build, compile and return a regular expression based on `definition`.
  369. :Parameter: `definition`: a 4-tuple (group name, prefix, suffix, parts),
  370. where "parts" is a list of regular expressions and/or regular
  371. expression definitions to be joined into an or-group.
  372. """
  373. name, prefix, suffix, parts = definition
  374. part_strings = []
  375. for part in parts:
  376. if type(part) is tuple:
  377. part_strings.append(build_regexp(part, None))
  378. else:
  379. part_strings.append(part)
  380. or_group = '|'.join(part_strings)
  381. regexp = '%(prefix)s(?P<%(name)s>%(or_group)s)%(suffix)s' % locals()
  382. if compile:
  383. return re.compile(regexp, re.UNICODE)
  384. else:
  385. return regexp
  386. class Inliner:
  387. """
  388. Parse inline markup; call the `parse()` method.
  389. """
  390. def __init__(self):
  391. self.implicit_dispatch = [(self.patterns.uri, self.standalone_uri),]
  392. """List of (pattern, bound method) tuples, used by
  393. `self.implicit_inline`."""
  394. def init_customizations(self, settings):
  395. """Setting-based customizations; run when parsing begins."""
  396. if settings.pep_references:
  397. self.implicit_dispatch.append((self.patterns.pep,
  398. self.pep_reference))
  399. if settings.rfc_references:
  400. self.implicit_dispatch.append((self.patterns.rfc,
  401. self.rfc_reference))
  402. def parse(self, text, lineno, memo, parent):
  403. # Needs to be refactored for nested inline markup.
  404. # Add nested_parse() method?
  405. """
  406. Return 2 lists: nodes (text and inline elements), and system_messages.
  407. Using `self.patterns.initial`, a pattern which matches start-strings
  408. (emphasis, strong, interpreted, phrase reference, literal,
  409. substitution reference, and inline target) and complete constructs
  410. (simple reference, footnote reference), search for a candidate. When
  411. one is found, check for validity (e.g., not a quoted '*' character).
  412. If valid, search for the corresponding end string if applicable, and
  413. check it for validity. If not found or invalid, generate a warning
  414. and ignore the start-string. Implicit inline markup (e.g. standalone
  415. URIs) is found last.
  416. """
  417. self.reporter = memo.reporter
  418. self.document = memo.document
  419. self.language = memo.language
  420. self.parent = parent
  421. pattern_search = self.patterns.initial.search
  422. dispatch = self.dispatch
  423. remaining = escape2null(text)
  424. processed = []
  425. unprocessed = []
  426. messages = []
  427. while remaining:
  428. match = pattern_search(remaining)
  429. if match:
  430. groups = match.groupdict()
  431. method = dispatch[groups['start'] or groups['backquote']
  432. or groups['refend'] or groups['fnend']]
  433. before, inlines, remaining, sysmessages = method(self, match,
  434. lineno)
  435. unprocessed.append(before)
  436. messages += sysmessages
  437. if inlines:
  438. processed += self.implicit_inline(''.join(unprocessed),
  439. lineno)
  440. processed += inlines
  441. unprocessed = []
  442. else:
  443. break
  444. remaining = ''.join(unprocessed) + remaining
  445. if remaining:
  446. processed += self.implicit_inline(remaining, lineno)
  447. return processed, messages
  448. openers = u'\'"([{<\u2018\u201c\xab\u00a1\u00bf' # see quoted_start below
  449. closers = u'\'")]}>\u2019\u201d\xbb!?'
  450. unicode_delimiters = u'\u2010\u2011\u2012\u2013\u2014\u00a0'
  451. start_string_prefix = (u'((?<=^)|(?<=[-/: \\n\u2019%s%s]))'
  452. % (re.escape(unicode_delimiters),
  453. re.escape(openers)))
  454. end_string_suffix = (r'((?=$)|(?=[-/:.,; \n\x00%s%s]))'
  455. % (re.escape(unicode_delimiters),
  456. re.escape(closers)))
  457. non_whitespace_before = r'(?<![ \n])'
  458. non_whitespace_escape_before = r'(?<![ \n\x00])'
  459. non_whitespace_after = r'(?![ \n])'
  460. # Alphanumerics with isolated internal [-._+:] chars (i.e. not 2 together):
  461. simplename = r'(?:(?!_)\w)+(?:[-._+:](?:(?!_)\w)+)*'
  462. # Valid URI characters (see RFC 2396 & RFC 2732);
  463. # final \x00 allows backslash escapes in URIs:
  464. uric = r"""[-_.!~*'()[\];/:@&=+$,%a-zA-Z0-9\x00]"""
  465. # Delimiter indicating the end of a URI (not part of the URI):
  466. uri_end_delim = r"""[>]"""
  467. # Last URI character; same as uric but no punctuation:
  468. urilast = r"""[_~*/=+a-zA-Z0-9]"""
  469. # End of a URI (either 'urilast' or 'uric followed by a
  470. # uri_end_delim'):
  471. uri_end = r"""(?:%(urilast)s|%(uric)s(?=%(uri_end_delim)s))""" % locals()
  472. emailc = r"""[-_!~*'{|}/#?^`&=+$%a-zA-Z0-9\x00]"""
  473. email_pattern = r"""
  474. %(emailc)s+(?:\.%(emailc)s+)* # name
  475. (?<!\x00)@ # at
  476. %(emailc)s+(?:\.%(emailc)s*)* # host
  477. %(uri_end)s # final URI char
  478. """
  479. parts = ('initial_inline', start_string_prefix, '',
  480. [('start', '', non_whitespace_after, # simple start-strings
  481. [r'\*\*', # strong
  482. r'\*(?!\*)', # emphasis but not strong
  483. r'``', # literal
  484. r'_`', # inline internal target
  485. r'\|(?!\|)'] # substitution reference
  486. ),
  487. ('whole', '', end_string_suffix, # whole constructs
  488. [# reference name & end-string
  489. r'(?P<refname>%s)(?P<refend>__?)' % simplename,
  490. ('footnotelabel', r'\[', r'(?P<fnend>\]_)',
  491. [r'[0-9]+', # manually numbered
  492. r'\#(%s)?' % simplename, # auto-numbered (w/ label?)
  493. r'\*', # auto-symbol
  494. r'(?P<citationlabel>%s)' % simplename] # citation reference
  495. )
  496. ]
  497. ),
  498. ('backquote', # interpreted text or phrase reference
  499. '(?P<role>(:%s:)?)' % simplename, # optional role
  500. non_whitespace_after,
  501. ['`(?!`)'] # but not literal
  502. )
  503. ]
  504. )
  505. patterns = Struct(
  506. initial=build_regexp(parts),
  507. emphasis=re.compile(non_whitespace_escape_before
  508. + r'(\*)' + end_string_suffix),
  509. strong=re.compile(non_whitespace_escape_before
  510. + r'(\*\*)' + end_string_suffix),
  511. interpreted_or_phrase_ref=re.compile(
  512. r"""
  513. %(non_whitespace_escape_before)s
  514. (
  515. `
  516. (?P<suffix>
  517. (?P<role>:%(simplename)s:)?
  518. (?P<refend>__?)?
  519. )
  520. )
  521. %(end_string_suffix)s
  522. """ % locals(), re.VERBOSE | re.UNICODE),
  523. embedded_uri=re.compile(
  524. r"""
  525. (
  526. (?:[ \n]+|^) # spaces or beginning of line/string
  527. < # open bracket
  528. %(non_whitespace_after)s
  529. ([^<>\x00]+) # anything but angle brackets & nulls
  530. %(non_whitespace_before)s
  531. > # close bracket w/o whitespace before
  532. )
  533. $ # end of string
  534. """ % locals(), re.VERBOSE),
  535. literal=re.compile(non_whitespace_before + '(``)'
  536. + end_string_suffix),
  537. target=re.compile(non_whitespace_escape_before
  538. + r'(`)' + end_string_suffix),
  539. substitution_ref=re.compile(non_whitespace_escape_before
  540. + r'(\|_{0,2})'
  541. + end_string_suffix),
  542. email=re.compile(email_pattern % locals() + '$', re.VERBOSE),
  543. uri=re.compile(
  544. (r"""
  545. %(start_string_prefix)s
  546. (?P<whole>
  547. (?P<absolute> # absolute URI
  548. (?P<scheme> # scheme (http, ftp, mailto)
  549. [a-zA-Z][a-zA-Z0-9.+-]*
  550. )
  551. :
  552. (
  553. ( # either:
  554. (//?)? # hierarchical URI
  555. %(uric)s* # URI characters
  556. %(uri_end)s # final URI char
  557. )
  558. ( # optional query
  559. \?%(uric)s*
  560. %(uri_end)s
  561. )?
  562. ( # optional fragment
  563. \#%(uric)s*
  564. %(uri_end)s
  565. )?
  566. )
  567. )
  568. | # *OR*
  569. (?P<email> # email address
  570. """ + email_pattern + r"""
  571. )
  572. )
  573. %(end_string_suffix)s
  574. """) % locals(), re.VERBOSE),
  575. pep=re.compile(
  576. r"""
  577. %(start_string_prefix)s
  578. (
  579. (pep-(?P<pepnum1>\d+)(.txt)?) # reference to source file
  580. |
  581. (PEP\s+(?P<pepnum2>\d+)) # reference by name
  582. )
  583. %(end_string_suffix)s""" % locals(), re.VERBOSE),
  584. rfc=re.compile(
  585. r"""
  586. %(start_string_prefix)s
  587. (RFC(-|\s+)?(?P<rfcnum>\d+))
  588. %(end_string_suffix)s""" % locals(), re.VERBOSE))
  589. def quoted_start(self, match):
  590. """Return 1 if inline markup start-string is 'quoted', 0 if not."""
  591. string = match.string
  592. start = match.start()
  593. end = match.end()
  594. if start == 0: # start-string at beginning of text
  595. return 0
  596. prestart = string[start - 1]
  597. try:
  598. poststart = string[end]
  599. if self.openers.index(prestart) \
  600. == self.closers.index(poststart): # quoted
  601. return 1
  602. except IndexError: # start-string at end of text
  603. return 1
  604. except ValueError: # not quoted
  605. pass
  606. return 0
  607. def inline_obj(self, match, lineno, end_pattern, nodeclass,
  608. restore_backslashes=0):
  609. string = match.string
  610. matchstart = match.start('start')
  611. matchend = match.end('start')
  612. if self.quoted_start(match):
  613. return (string[:matchend], [], string[matchend:], [], '')
  614. endmatch = end_pattern.search(string[matchend:])
  615. if endmatch and endmatch.start(1): # 1 or more chars
  616. text = unescape(endmatch.string[:endmatch.start(1)],
  617. restore_backslashes)
  618. textend = matchend + endmatch.end(1)
  619. rawsource = unescape(string[matchstart:textend], 1)
  620. return (string[:matchstart], [nodeclass(rawsource, text)],
  621. string[textend:], [], endmatch.group(1))
  622. msg = self.reporter.warning(
  623. 'Inline %s start-string without end-string.'
  624. % nodeclass.__name__, line=lineno)
  625. text = unescape(string[matchstart:matchend], 1)
  626. rawsource = unescape(string[matchstart:matchend], 1)
  627. prb = self.problematic(text, rawsource, msg)
  628. return string[:matchstart], [prb], string[matchend:], [msg], ''
  629. def problematic(self, text, rawsource, message):
  630. msgid = self.document.set_id(message, self.parent)
  631. problematic = nodes.problematic(rawsource, text, refid=msgid)
  632. prbid = self.document.set_id(problematic)
  633. message.add_backref(prbid)
  634. return problematic
  635. def emphasis(self, match, lineno):
  636. before, inlines, remaining, sysmessages, endstring = self.inline_obj(
  637. match, lineno, self.patterns.emphasis, nodes.emphasis)
  638. return before, inlines, remaining, sysmessages
  639. def strong(self, match, lineno):
  640. before, inlines, remaining, sysmessages, endstring = self.inline_obj(
  641. match, lineno, self.patterns.strong, nodes.strong)
  642. return before, inlines, remaining, sysmessages
  643. def interpreted_or_phrase_ref(self, match, lineno):
  644. end_pattern = self.patterns.interpreted_or_phrase_ref
  645. string = match.string
  646. matchstart = match.start('backquote')
  647. matchend = match.end('backquote')
  648. rolestart = match.start('role')
  649. role = match.group('role')
  650. position = ''
  651. if role:
  652. role = role[1:-1]
  653. position = 'prefix'
  654. elif self.quoted_start(match):
  655. return (string[:matchend], [], string[matchend:], [])
  656. endmatch = end_pattern.search(string[matchend:])
  657. if endmatch and endmatch.start(1): # 1 or more chars
  658. textend = matchend + endmatch.end()
  659. if endmatch.group('role'):
  660. if role:
  661. msg = self.reporter.warning(
  662. 'Multiple roles in interpreted text (both '
  663. 'prefix and suffix present; only one allowed).',
  664. line=lineno)
  665. text = unescape(string[rolestart:textend], 1)
  666. prb = self.problematic(text, text, msg)
  667. return string[:rolestart], [prb], string[textend:], [msg]
  668. role = endmatch.group('suffix')[1:-1]
  669. position = 'suffix'
  670. escaped = endmatch.string[:endmatch.start(1)]
  671. rawsource = unescape(string[matchstart:textend], 1)
  672. if rawsource[-1:] == '_':
  673. if role:
  674. msg = self.reporter.warning(
  675. 'Mismatch: both interpreted text role %s and '
  676. 'reference suffix.' % position, line=lineno)
  677. text = unescape(string[rolestart:textend], 1)
  678. prb = self.problematic(text, text, msg)
  679. return string[:rolestart], [prb], string[textend:], [msg]
  680. return self.phrase_ref(string[:matchstart], string[textend:],
  681. rawsource, escaped, unescape(escaped))
  682. else:
  683. rawsource = unescape(string[rolestart:textend], 1)
  684. nodelist, messages = self.interpreted(rawsource, escaped, role,
  685. lineno)
  686. return (string[:rolestart], nodelist,
  687. string[textend:], messages)
  688. msg = self.reporter.warning(
  689. 'Inline interpreted text or phrase reference start-string '
  690. 'without end-string.', line=lineno)
  691. text = unescape(string[matchstart:matchend], 1)
  692. prb = self.problematic(text, text, msg)
  693. return string[:matchstart], [prb], string[matchend:], [msg]
  694. def phrase_ref(self, before, after, rawsource, escaped, text):
  695. match = self.patterns.embedded_uri.search(escaped)
  696. if match:
  697. text = unescape(escaped[:match.start(0)])
  698. uri_text = match.group(2)
  699. uri = ''.join(uri_text.split())
  700. uri = self.adjust_uri(uri)
  701. if uri:
  702. target = nodes.target(match.group(1), refuri=uri)
  703. else:
  704. raise ApplicationError('problem with URI: %r' % uri_text)
  705. if not text:
  706. text = uri
  707. else:
  708. target = None
  709. refname = normalize_name(text)
  710. reference = nodes.reference(rawsource, text,
  711. name=whitespace_normalize_name(text))
  712. node_list = [reference]
  713. if rawsource[-2:] == '__':
  714. if target:
  715. reference['refuri'] = uri
  716. else:
  717. reference['anonymous'] = 1
  718. else:
  719. if target:
  720. reference['refuri'] = uri
  721. target['names'].append(refname)
  722. self.document.note_explicit_target(target, self.parent)
  723. node_list.append(target)
  724. else:
  725. reference['refname'] = refname
  726. self.document.note_refname(reference)
  727. return before, node_list, after, []
  728. def adjust_uri(self, uri):
  729. match = self.patterns.email.match(uri)
  730. if match:
  731. return 'mailto:' + uri
  732. else:
  733. return uri
  734. def interpreted(self, rawsource, text, role, lineno):
  735. role_fn, messages = roles.role(role, self.language, lineno,
  736. self.reporter)
  737. if role_fn:
  738. nodes, messages2 = role_fn(role, rawsource, text, lineno, self)
  739. return nodes, messages + messages2
  740. else:
  741. msg = self.reporter.error(
  742. 'Unknown interpreted text role "%s".' % role,
  743. line=lineno)
  744. return ([self.problematic(rawsource, rawsource, msg)],
  745. messages + [msg])
  746. def literal(self, match, lineno):
  747. before, inlines, remaining, sysmessages, endstring = self.inline_obj(
  748. match, lineno, self.patterns.literal, nodes.literal,
  749. restore_backslashes=1)
  750. return before, inlines, remaining, sysmessages
  751. def inline_internal_target(self, match, lineno):
  752. before, inlines, remaining, sysmessages, endstring = self.inline_obj(
  753. match, lineno, self.patterns.target, nodes.target)
  754. if inlines and isinstance(inlines[0], nodes.target):
  755. assert len(inlines) == 1
  756. target = inlines[0]
  757. name = normalize_name(target.astext())
  758. target['names'].append(name)
  759. self.document.note_explicit_target(target, self.parent)
  760. return before, inlines, remaining, sysmessages
  761. def substitution_reference(self, match, lineno):
  762. before, inlines, remaining, sysmessages, endstring = self.inline_obj(
  763. match, lineno, self.patterns.substitution_ref,
  764. nodes.substitution_reference)
  765. if len(inlines) == 1:
  766. subref_node = inlines[0]
  767. if isinstance(subref_node, nodes.substitution_reference):
  768. subref_text = subref_node.astext()
  769. self.document.note_substitution_ref(subref_node, subref_text)
  770. if endstring[-1:] == '_':
  771. reference_node = nodes.reference(
  772. '|%s%s' % (subref_text, endstring), '')
  773. if endstring[-2:] == '__':
  774. reference_node['anonymous'] = 1
  775. else:
  776. reference_node['refname'] = normalize_name(subref_text)
  777. self.document.note_refname(reference_node)
  778. reference_node += subref_node
  779. inlines = [reference_node]
  780. return before, inlines, remaining, sysmessages
  781. def footnote_reference(self, match, lineno):
  782. """
  783. Handles `nodes.footnote_reference` and `nodes.citation_reference`
  784. elements.
  785. """
  786. label = match.group('footnotelabel')
  787. refname = normalize_name(label)
  788. string = match.string
  789. before = string[:match.start('whole')]
  790. remaining = string[match.end('whole'):]
  791. if match.group('citationlabel'):
  792. refnode = nodes.citation_reference('[%s]_' % label,
  793. refname=refname)
  794. refnode += nodes.Text(label)
  795. self.document.note_citation_ref(refnode)
  796. else:
  797. refnode = nodes.footnote_reference('[%s]_' % label)
  798. if refname[0] == '#':
  799. refname = refname[1:]
  800. refnode['auto'] = 1
  801. self.document.note_autofootnote_ref(refnode)
  802. elif refname == '*':
  803. refname = ''
  804. refnode['auto'] = '*'
  805. self.document.note_symbol_footnote_ref(
  806. refnode)
  807. else:
  808. refnode += nodes.Text(label)
  809. if refname:
  810. refnode['refname'] = refname
  811. self.document.note_footnote_ref(refnode)
  812. if utils.get_trim_footnote_ref_space(self.document.settings):
  813. before = before.rstrip()
  814. return (before, [refnode], remaining, [])
  815. def reference(self, match, lineno, anonymous=None):
  816. referencename = match.group('refname')
  817. refname = normalize_name(referencename)
  818. referencenode = nodes.reference(
  819. referencename + match.group('refend'), referencename,
  820. name=whitespace_normalize_name(referencename))
  821. if anonymous:
  822. referencenode['anonymous'] = 1
  823. else:
  824. referencenode['refname'] = refname
  825. self.document.note_refname(referencenode)
  826. string = match.string
  827. matchstart = match.start('whole')
  828. matchend = match.end('whole')
  829. return (string[:matchstart], [referencenode], string[matchend:], [])
  830. def anonymous_reference(self, match, lineno):
  831. return self.reference(match, lineno, anonymous=1)
  832. def standalone_uri(self, match, lineno):
  833. if (not match.group('scheme')
  834. or match.group('scheme').lower() in urischemes.schemes):
  835. if match.group('email'):
  836. addscheme = 'mailto:'
  837. else:
  838. addscheme = ''
  839. text = match.group('whole')
  840. unescaped = unescape(text, 0)
  841. return [nodes.reference(unescape(text, 1), unescaped,
  842. refuri=addscheme + unescaped)]
  843. else: # not a valid scheme
  844. raise MarkupMismatch
  845. def pep_reference(self, match, lineno):
  846. text = match.group(0)
  847. if text.startswith('pep-'):
  848. pepnum = int(match.group('pepnum1'))
  849. elif text.startswith('PEP'):
  850. pepnum = int(match.group('pepnum2'))
  851. else:
  852. raise MarkupMismatch
  853. ref = (self.document.settings.pep_base_url
  854. + self.document.settings.pep_file_url_template % pepnum)
  855. unescaped = unescape(text, 0)
  856. return [nodes.reference(unescape(text, 1), unescaped, refuri=ref)]
  857. rfc_url = 'rfc%d.html'
  858. def rfc_reference(self, match, lineno):
  859. text = match.group(0)
  860. if text.startswith('RFC'):
  861. rfcnum = int(match.group('rfcnum'))
  862. ref = self.document.settings.rfc_base_url + self.rfc_url % rfcnum
  863. else:
  864. raise MarkupMismatch
  865. unescaped = unescape(text, 0)
  866. return [nodes.reference(unescape(text, 1), unescaped, refuri=ref)]
  867. def implicit_inline(self, text, lineno):
  868. """
  869. Check each of the patterns in `self.implicit_dispatch` for a match,
  870. and dispatch to the stored method for the pattern. Recursively check
  871. the text before and after the match. Return a list of `nodes.Text`
  872. and inline element nodes.
  873. """
  874. if not text:
  875. return []
  876. for pattern, method in self.implicit_dispatch:
  877. match = pattern.search(text)
  878. if match:
  879. try:
  880. # Must recurse on strings before *and* after the match;
  881. # there may be multiple patterns.
  882. return (self.implicit_inline(text[:match.start()], lineno)
  883. + method(match, lineno) +
  884. self.implicit_inline(text[match.end():], lineno))
  885. except MarkupMismatch:
  886. pass
  887. return [nodes.Text(unescape(text), rawsource=unescape(text, 1))]
  888. dispatch = {'*': emphasis,
  889. '**': strong,
  890. '`': interpreted_or_phrase_ref,
  891. '``': literal,
  892. '_`': inline_internal_target,
  893. ']_': footnote_reference,
  894. '|': substitution_reference,
  895. '_': reference,
  896. '__': anonymous_reference}
  897. def _loweralpha_to_int(s, _zero=(ord('a')-1)):
  898. return ord(s) - _zero
  899. def _upperalpha_to_int(s, _zero=(ord('A')-1)):
  900. return ord(s) - _zero
  901. def _lowerroman_to_int(s):
  902. return roman.fromRoman(s.upper())
  903. class Body(RSTState):
  904. """
  905. Generic classifier of the first line of a block.
  906. """
  907. double_width_pad_char = tableparser.TableParser.double_width_pad_char
  908. """Padding character for East Asian double-width text."""
  909. enum = Struct()
  910. """Enumerated list parsing information."""
  911. enum.formatinfo = {
  912. 'parens': Struct(prefix='(', suffix=')', start=1, end=-1),
  913. 'rparen': Struct(prefix='', suffix=')', start=0, end=-1),
  914. 'period': Struct(prefix='', suffix='.', start=0, end=-1)}
  915. enum.formats = enum.formatinfo.keys()
  916. enum.sequences = ['arabic', 'loweralpha', 'upperalpha',
  917. 'lowerroman', 'upperroman'] # ORDERED!
  918. enum.sequencepats = {'arabic': '[0-9]+',
  919. 'loweralpha': '[a-z]',
  920. 'upperalpha': '[A-Z]',
  921. 'lowerroman': '[ivxlcdm]+',
  922. 'upperroman': '[IVXLCDM]+',}
  923. enum.converters = {'arabic': int,
  924. 'loweralpha': _loweralpha_to_int,
  925. 'upperalpha': _upperalpha_to_int,
  926. 'lowerroman': _lowerroman_to_int,
  927. 'upperroman': roman.fromRoman}
  928. enum.sequenceregexps = {}
  929. for sequence in enum.sequences:
  930. enum.sequenceregexps[sequence] = re.compile(
  931. enum.sequencepats[sequence] + '$')
  932. grid_table_top_pat = re.compile(r'\+-[-+]+-\+ *$')
  933. """Matches the top (& bottom) of a full table)."""
  934. simple_table_top_pat = re.compile('=+( +=+)+ *$')
  935. """Matches the top of a simple table."""
  936. simple_table_border_pat = re.compile('=+[ =]*$')
  937. """Matches the bottom & header bottom of a simple table."""
  938. pats = {}
  939. """Fragments of patterns used by transitions."""
  940. pats['nonalphanum7bit'] = '[!-/:-@[-`{-~]'
  941. pats['alpha'] = '[a-zA-Z]'
  942. pats['alphanum'] = '[a-zA-Z0-9]'
  943. pats['alphanumplus'] = '[a-zA-Z0-9_-]'
  944. pats['enum'] = ('(%(arabic)s|%(loweralpha)s|%(upperalpha)s|%(lowerroman)s'
  945. '|%(upperroman)s|#)' % enum.sequencepats)
  946. pats['optname'] = '%(alphanum)s%(alphanumplus)s*' % pats
  947. # @@@ Loosen up the pattern? Allow Unicode?
  948. pats['optarg'] = '(%(alpha)s%(alphanumplus)s*|<[^<>]+>)' % pats
  949. pats['shortopt'] = r'(-|\+)%(alphanum)s( ?%(optarg)s)?' % pats
  950. pats['longopt'] = r'(--|/)%(optname)s([ =]%(optarg)s)?' % pats
  951. pats['option'] = r'(%(shortopt)s|%(longopt)s)' % pats
  952. for format in enum.formats:
  953. pats[format] = '(?P<%s>%s%s%s)' % (
  954. format, re.escape(enum.formatinfo[format].prefix),
  955. pats['enum'], re.escape(enum.formatinfo[format].suffix))
  956. patterns = {
  957. 'bullet': u'[-+*\u2022\u2023\u2043]( +|$)',
  958. 'enumerator': r'(%(parens)s|%(rparen)s|%(period)s)( +|$)' % pats,
  959. 'field_marker': r':(?![: ])([^:\\]|\\.)*(?<! ):( +|$)',
  960. 'option_marker': r'%(option)s(, %(option)s)*( +| ?$)' % pats,
  961. 'doctest': r'>>>( +|$)',
  962. 'line_block': r'\|( +|$)',
  963. 'grid_table_top': grid_table_top_pat,
  964. 'simple_table_top': simple_table_top_pat,
  965. 'explicit_markup': r'\.\.( +|$)',
  966. 'anonymous': r'__( +|$)',
  967. 'line': r'(%(nonalphanum7bit)s)\1* *$' % pats,
  968. 'text': r''}
  969. initial_transitions = (
  970. 'bullet',
  971. 'enumerator',
  972. 'field_marker',
  973. 'option_marker',
  974. 'doctest',
  975. 'line_block',
  976. 'grid_table_top',
  977. 'simple_table_top',
  978. 'explicit_markup',
  979. 'anonymous',
  980. 'line',
  981. 'text')
  982. def indent(self, match, context, next_state):
  983. """Block quote."""
  984. indented, indent, line_offset, blank_finish = \
  985. self.state_machine.get_indented()
  986. elements = self.block_quote(indented, line_offset)
  987. self.parent += elements
  988. if not blank_finish:
  989. self.parent += self.unindent_warning('Block quote')
  990. return context, next_state, []
  991. def block_quote(self, indented, line_offset):
  992. elements = []
  993. while indented:
  994. (blockquote_lines,
  995. attribution_lines,
  996. attribution_offset,
  997. indented,
  998. new_line_offset) = self.split_attribution(indented, line_offset)
  999. blockquote = nodes.block_quote()
  1000. self.nested_parse(blockquote_lines, line_offset, blockquote)
  1001. elements.append(blockquote)
  1002. if attribution_lines:
  1003. attribution, messages = self.parse_attribution(
  1004. attribution_lines, attribution_offset)
  1005. blockquote += attribution
  1006. elements += messages
  1007. line_offset = new_line_offset
  1008. while indented and not indented[0]:
  1009. indented = indented[1:]
  1010. line_offset += 1
  1011. return elements
  1012. # U+2014 is an em-dash:
  1013. attribution_pattern = re.compile(u'(---?(?!-)|\u2014) *(?=[^ \\n])')
  1014. def split_attribution(self, indented, line_offset):
  1015. """
  1016. Check for a block quote attribution and split it off:
  1017. * First line after a blank line must begin with a dash ("--", "---",
  1018. em-dash; matches `self.attribution_pattern`).
  1019. * Every line after that must have consistent indentation.
  1020. * Attributions must be preceded by block quote content.
  1021. Return a tuple of: (block quote content lines, content offset,
  1022. attribution lines, attribution offset, remaining indented lines).
  1023. """
  1024. blank = None
  1025. nonblank_seen = False
  1026. for i in range(len(indented)):
  1027. line = indented[i].rstrip()
  1028. if line:
  1029. if nonblank_seen and blank == i - 1: # last line blank
  1030. match = self.attribution_pattern.match(line)
  1031. if match:
  1032. attribution_end, indent = self.check_attribution(
  1033. indented, i)
  1034. if attribution_end:
  1035. a_lines = indented[i:attribution_end]
  1036. a_lines.trim_left(match.end(), end=1)
  1037. a_lines.trim_left(indent, start=1)
  1038. return (indented[:i], a_lines,
  1039. i, indented[attribution_end:],
  1040. line_offset + attribution_end)
  1041. nonblank_seen = True
  1042. else:
  1043. blank = i
  1044. else:
  1045. return (indented, None, None, None, None)
  1046. def check_attribution(self, indented, attribution_start):
  1047. """
  1048. Check attribution shape.
  1049. Return the index past the end of the attribution, and the indent.
  1050. """
  1051. indent = None
  1052. i = attribution_start + 1
  1053. for i in range(attribution_start + 1, len(indented)):
  1054. line = indented[i].rstrip()
  1055. if not line:
  1056. break
  1057. if indent is None:
  1058. indent = len(line) - len(line.lstrip())
  1059. elif len(line) - len(line.lstrip()) != indent:
  1060. return None, None # bad shape; not an attribution
  1061. else:
  1062. # return index of line after last attribution line:
  1063. i += 1
  1064. return i, (indent or 0)
  1065. def parse_attribution(self, indented, line_offset):
  1066. text = '\n'.join(indented).rstrip()
  1067. lineno = self.state_machine.abs_line_number() + line_offset
  1068. textnodes, messages = self.inline_text(text, lin

Large files files are truncated, but you can click here to view the full file