PageRenderTime 299ms CodeModel.GetById 181ms app.highlight 5ms RepoModel.GetById 107ms app.codeStats 0ms

#! | 84 lines | 61 code | 23 blank | 0 comment | 0 complexity | be47e22a7dfacd6b149094f817f3f626 MD5 | raw file
 1<html><textarea style="width:100%;height:100%">
 2Towards a universal scraping API
 3or, an introduction to parsley
 5Web scraping is a chore.  Scraper scripts are brittle and slow, and everyone writes their own custom implementation, resulting in countless hours of repeated work.  Let's work together to make it easier.  Let's do what regular expressions did for text processing, and what SQL did for databases.  Let's create a universal domain-specific language for web scraping.
 7What features do we need?  The must haves:
 9- Concise
10- Easy-to-learn
11- Powerful
12- Idiomatic
13- Portable
14- FAST!!!
16In order to make this easy to learn, let's keep the best of what's working today.  I really like Hpricot's ability to use either xml or css to specify tags to extract.  (For those that don't know, you can use "h1 a" [css] or "//h1//a" [xpath] to represent all of the hyperlinks inside paragraphs in a document).  Sometimes, i'd even like to mix xpath and css, i.e.: "substring-after(h1, ':')".  Regular expressions are *really* useful, so let's support them too.  Lets use the XPath2 syntax.
18Now for some examples:
20-	3rd paragraph: 
21	p:nth-child(3)
22- First sentence in that paragraph (period-delimited):
23	substring-before(p:nth-child(3), '.')
24- Any simple phone number in an ordered list called "numbers"
25	re:match(ul#numbers>li, '\d{3}-\d{4}', 'g')
27We support all of CSS3, XPath1, as well as all functions in XSLT 1.0 and EXSLT (required+regexp).
29I think this is a pretty good way to grab a single piece of data from a page.  It's simple and gives you all of the tools (CSS for simplicity, XPath for power, regex for detailed text handling) you are used to, in one expression.
31We'd like to make our scraper script both portable and fast.  For both these reasons, we need to be able to express the structure of the scraped data independently of the general-purpose programming language you happen to be working in.  Jumping from XPath to Python and back means multiple passes over the document, and Python idioms prevent easy use of your scraper by Rubyists.  If we can represent the entire scrape in a language-independent way, we can compile it into something that libxml2 can handle in one pass, giving screaming-fast (milliseconds per parse) performance.
33To describe the output structure, lets use json.  It's compact, and the Ruby/Python/etc bindings can use hashes/lists/dictionaries to represent the same structure.  We can also have the scraper output json or native data structures.  Here's an example script that grabs the title and all hyperlinks on a page:
35		{
36		  "title": "h1",
37		  "links": ["a"]
38		}
40Applying this to yields:
42		{
43		  "title": "Amnesia",
44		  "links": ["Yelp", "Welcome", "About Me", ... ]
45		}
47You'll note that the output structure mirrors the input structure.  In the Ruby binding, you can get both input and output natively:
49		> require "open-uri"
50		> require "parsley"
51		>{"title" => "h1", "links" => ["a"]}).parse(:url => "")
52		#=> {"title"=>"Amnesia", "links"=>["Yelp", "Welcome", "About Me"]}
54We'll also add both explicit and implicit grouping  Here's an extension of the previous example with explicit grouping:
56		{
57		  "title": "h1",
58		  "links(a)": [{
59				"text": ".",
60				"link": "@href"
61			}]
62		}
64The json structure in the output still mirrors the input, but now you can get both the link text and the href.
66Pages like craigslist are slightly trickier to group.  Elements on this page go h4, p, p, p, h4, p, p, p. To group this, you could do:
68		{
69			"entry(p)":[{
70				"title": ".",
71				"date": "preceding::h4"
72			}]
73		}
75If you instead wanted to group by date, you could use implicit grouping.  It's implicit, because the parenthesized filter is omitted.  Grouping happens by page order. We treat the first single (i.e. non-square-bracketed) value (the h4 in the below example) as the beginning of a new group, and adds following values to the group (i.e.: [h4, p, p, p], [h4, p, p], [h4, p]).  
77		{
78			"entry":[{
79				"date": "h4",
80				"title": ["p"]
81			}]
82		}