/PAPER

http://github.com/fizx/parsley · #! · 36 lines · 28 code · 8 blank · 0 comment · 0 complexity · e91728b534f974edc14aab2c3d18915d MD5 · raw file

  1. Abstract
  2. ================================================================
  3. A common programming task is data extraction from xml and html documents. I introduce parsley, an embedded language (ala SQL, regular expressions) that improves the usability and/or speed of current extraction techniques.
  4. Introduction
  5. ================================================================
  6. Today, developers use a couple toolsets to do data extraction. Many developers use libraries like Hpricot for Ruby and Beautiful Soup for Python. These libraries allow extraction of xml subtrees via XPath or CSS selectors. These subtrees are futher refined using the scripting language, often with the help of regular expressions.
  7. Other developers use XSLT. While fast, mature, and conceptually elegant, XSLT
  8. - current techniques
  9. - benefits of standardization
  10. - best of current
  11. Features
  12. ================================================================
  13. - integrated grammars
  14. - with some expression examples
  15. - multiple elements, one pass / context switching
  16. - exslt / standard library
  17. - json
  18. - language integration
  19. - pruning
  20. - structural parsing
  21. Examples
  22. - Ruby/python/json
  23. - structural parse
  24. -
  25. Benchmarks
  26. - size comparision with XSLT
  27. - speed comparision with nokogiri, hpricot
  28. Conclusion