PageRenderTime 25ms CodeModel.GetById 20ms RepoModel.GetById 1ms app.codeStats 0ms

/README.md

https://gitlab.com/learn-co-curriculum/cs-application-crawling-lab
Markdown | 307 lines | 182 code | 125 blank | 0 comment | 0 complexity | c59ba4b90ff6993dbaefe715622911ed MD5 | raw file
  1. # cs-application-crawling-lab
  2. ## Learning goals
  3. 1. Analyze the performance of Web indexing algorithms.
  4. 2. Implement a Web crawler.
  5. ## Overview
  6. In this lab, we present our solution to the previous lab and analyze the performance of Web indexing algorithms. Then we build a simple Web crawler.
  7. ## Our Redis-backed indexer
  8. In our solution, we store two kinds of structures in Redis:
  9. * For each search term, we have a `URLSet`, which is a Redis Set of URLs that contain the search term.
  10. * For each URL, we have a `TermCounter`, which is a Redis Hash that maps each search term to the number of times it appears.
  11. We discussed these data types in the previous lab. You can also [read about Redis Sets and Hashes here](http://redis.io/topics/data-types).
  12. In `JedisIndex`, we provide a function that takes a search term and returns the Redis key of its URLSet:
  13. ```java
  14. private String urlSetKey(String term) {
  15. return "URLSet:" + term;
  16. }
  17. ```
  18. And a function that takes a URL and returns the Redis key of its `TermCounter`:
  19. ```java
  20. private String termCounterKey(String url) {
  21. return "TermCounter:" + url;
  22. }
  23. ```
  24. Here's our implementation of `indexPage`, which takes a URL and a JSoup `Elements` object that contains the DOM tree of the paragraphs we want to index:
  25. ```java
  26. public void indexPage(String url, Elements paragraphs) {
  27. System.out.println("Indexing " + url);
  28. // make a TermCounter and count the terms in the paragraphs
  29. TermCounter tc = new TermCounter(url);
  30. tc.processElements(paragraphs);
  31. // push the contents of the TermCounter to Redis
  32. pushTermCounterToRedis(tc);
  33. }
  34. ```
  35. To index a page, we
  36. 1. Make a Java `TermCounter` for the contents of the page, using code from a previous lab.
  37. 2. Push the contents of the `TermCounter` to Redis.
  38. Here's the new code that pushes a `TermCounter` to Redis:
  39. ```java
  40. public List<Object> pushTermCounterToRedis(TermCounter tc) {
  41. Transaction t = jedis.multi();
  42. String url = tc.getLabel();
  43. String hashname = termCounterKey(url);
  44. // if this page has already been indexed; delete the old hash
  45. t.del(hashname);
  46. // for each term, add an entry in the termcounter and a new
  47. // member of the index
  48. for (String term: tc.keySet()) {
  49. Integer count = tc.get(term);
  50. t.hset(hashname, term, count.toString());
  51. t.sadd(urlSetKey(term), url);
  52. }
  53. List<Object> res = t.exec();
  54. return res;
  55. }
  56. ```
  57. This method uses a `Transaction` to collect the operations and send them to the server all at once, which is much faster than sending a series of small operations.
  58. It loops through the terms in the `TermCounter`. For each one it
  59. 1. Finds or creates a `TermCounter` on Redis, then adds a field for the new term.
  60. 2. Finds or creates a `URLSet` on Redis, then adds the current URL.
  61. If the page has already been indexed, we delete its old `TermCounter` before pushing the new contents.
  62. That's it for indexing new pages.
  63. The second part of the lab asked you to write `getCounts`, which takes a search term and returns a map from each URL where the term appears to the number of times it appears there. Here is our solution:
  64. ```java
  65. public Map<String, Integer> getCounts(String term) {
  66. Map<String, Integer> map = new HashMap<String, Integer>();
  67. Set<String> urls = getURLs(term);
  68. for (String url: urls) {
  69. Integer count = getCount(url, term);
  70. map.put(url, count);
  71. }
  72. return map;
  73. }
  74. ```
  75. This method uses two helpers:
  76. * `getURLs` takes a search term and returns the Set of URLs where the term appears.
  77. * `getCount` takes a URL and a term and returns the number of times the term appears at the given URL.
  78. Here are the implementations:
  79. ```java
  80. public Set<String> getURLs(String term) {
  81. Set<String> set = jedis.smembers(urlSetKey(term));
  82. return set;
  83. }
  84. public Integer getCount(String url, String term) {
  85. String redisKey = termCounterKey(url);
  86. String count = jedis.hget(redisKey, term);
  87. return new Integer(count);
  88. }
  89. ```
  90. Because of the way we designed the index, these methods are simple and efficient.
  91. ## Analysis of lookup
  92. Suppose we have indexed `N` pages and discovered `M` unique seach terms. How long will it take to look up a search term? Think about your answer before you continue.
  93. To look up a search term, we run `getCounts`, which
  94. 1. Creates a map.
  95. 2. Runs `getURLs` to get a Set of URLs.
  96. 3. For each URL in the Set, it runs `getCount` and adds an entry to a HashMap.
  97. `getURLs` takes time proportional to the number of URLs that contain the search term. For rare terms, that might be a small number, but for common terms it might be as large as `N`.
  98. Inside the loop, we run `getCount`, which finds a `TermCounter` on Redis, looks up a term, and adds an entry to a HashMap. Those are all constant time operations, so the overall complexity of `getCounts` is O(N) in the worst case. However, in practice the runtime is proportional to the number of pages that contain the term, which is normally much less than `N`.
  99. This algorithm is about as efficient as it can be, in terms of algorithmic complexity, but it is very slow because it sends many small operations to Redis. You can make it much faster using a `Transaction`. You might want to do that as an exercise, or you can see our solution in `RedisIndex.java`.
  100. ## Analysis of indexing
  101. Using the data structures we designed, how long will it take to index a page? Again, think about your answer before you continue.
  102. To index a page, we traverse its DOM tree, find all the `TextNode` objects, and split up the strings into search terms. That all takes time proportional to the number of words on the page.
  103. For each term, we increment a counter in a HashMap, which is a constant time operation. So making the `TermCounter` takes time proportional to the number of words on the page.
  104. Pushing the `TermCounter` to Redis requires deleting a `TermCounter`, which is linear in the number of unique terms. Then for each term we have to
  105. 1. Add an element to a `URLSet`, and
  106. 2. Add an element to a Redis `TermCounter`.
  107. Both of these are constant time operations, so the total time to push the `TermCounter` is linear in the number of unique search terms.
  108. In summary, making the `TermCounter` is proportional to the number of words on the page. Pushing the `TermCounter` to Redis is proportional to the number of unique terms.
  109. Since the number of words on the page usually exceeds the number of unique search terms, the overall complexity is proportional to the number of words on the page. In theory a page might contain all search terms in the index, so the worst case performance is O(M), but we don't expect to see the worse case in practice.
  110. This analysis suggests a way to improve performance: we should probably avoid indexing very common words. First of all, they take up a lot of time and space, because they appear in almost every `URLSet` and `TermCounter`. Furthermore, they are not very useful because they don't help identify relevant pages.
  111. Most search engines avoid indexing common words, which are known in this context as [stop words](https://en.wikipedia.org/wiki/Stop_words).
  112. ## Graph traversal
  113. If you did the "Getting to Philosophy" lab, you already have a program that reads a Wikipedia page, finds the first link, uses the link to load the next page, and repeats. This program is a specialized kind of crawler, but when people say "Web crawler" they usually mean a program that:
  114. * Loads a starting page and indexes the contents,
  115. * Finds all the links on the page and adds the linked URLs to a queue, and
  116. * Works its way through the queue, loading pages, indexing them, and adding new URLs to the queue.
  117. * If it finds a URL in the queue that has already been indexed, it skips it.
  118. You can think of the Web as a [graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) where each page is a node and each link is a directed edge from one node to another. Starting from a source node, a crawler traverses this graph, visiting each reachable node once.
  119. The behavior of the queue determines what kind of traversal the crawler performs:
  120. * If the queue is first-in-first-out (FIFO), the crawler performs a breadth-first traversal.
  121. * If the queue is last-in-first-out (LIFO), the crawler performs a depth-first traversal.
  122. * More generally, the items in the queue might be prioritized. For example, we might want to give higher priority to pages that have not been indexed for a long time.
  123. You can [read more about graph traversal here](https://en.wikipedia.org/wiki/Graph_traversal)
  124. ## Making a crawler
  125. Now it's time to write the crawler.
  126. When you check out the repository for this lab, you should find a file structure similar to what you saw in previous labs. The top level directory contains `CONTRIBUTING.md`, `LICENSE.md`, `README.md`, and the directory with the code for this lab, `javacs-lab11`.
  127. In the subdirectory `javacs-lab11/src/com/flatironschool/javacs` you'll find the source files for this lab:
  128. * `WikiCrawler.java`, which contains starter code for your crawler.
  129. * `WikiCrawlerTest.java`, which contains test code for `WikiCrawler`.
  130. * `JedisIndex.java`, which is our solution to the previous lab.
  131. You'll also find some of the helper classes we've used in previous lavs
  132. * `JedisMaker.java`
  133. * `WikiFetcher.java`
  134. * `TermCounter.java`
  135. * `WikiNodeIterable.java`
  136. And as usual, in `javacs-lab11`, you'll find the Ant build file `build.xml`.
  137. Before you run `JedisMaker`, you have to provide a file with information about your Redis server. If you did this in the previous lab, you can just copy it over. Otherwise you can find instructions in the previous lab.
  138. Run `ant build` to compile the source files, then run `ant JedisMaker` to make sure it is configured to connect to your Redis server.
  139. Now run `ant test` to run `WikiCrawlerTest`. It should fail, because you have work to do!
  140. Here's the beginning of the `WikiCrawler` class we provided:
  141. ```java
  142. public class WikiCrawler {
  143. public final String source;
  144. private JedisIndex index;
  145. private Queue<String> queue = new LinkedList<String>();
  146. final static WikiFetcher wf = new WikiFetcher();
  147. public WikiCrawler(String source, JedisIndex index) {
  148. this.source = source;
  149. this.index = index;
  150. queue.offer(source);
  151. }
  152. public int queueSize() {
  153. return queue.size();
  154. }
  155. ```
  156. The instance variables are
  157. * `source` is the URL where we start crawling.
  158. * `index` is the `JedisIndex` where the results should go.
  159. * `queue` is a `LinkedList` where we keep track of URLs that have been discovered but not yet indexed.
  160. * `wf` is the `WikiFetcher` we'll use to read and parse Web pages.
  161. Your job is to fill in `crawl`. Here's the prototype.
  162. ```java
  163. public String crawl(boolean testing) throws IOException {}
  164. ```
  165. The parameter `testing` will be `true` when this method is called from `WikiCrawlerTest` and should be `false` otherwise.
  166. When testing is `true`, the `crawl` method should:
  167. * Choose and remove a URL from the queue in FIFO order.
  168. * Read the contents of the page using `WikiFetcher.readWikipedia`, which reads cached copies of pages we have included in this repository for testing purposes (to avoid problems if the Wikipedia version changes).
  169. * It should index pages regardless of whether they are already indexed.
  170. * It should find all the internal links on the page and add them to the queue in the order they appear. "Internal links" are links to other Wikipedia pages.
  171. * And it should return the URL of the page it indexed.
  172. When testing is `false`, this method should:
  173. * Choose and remove a URL from the queue in FIFO order.
  174. * If the URL is already indexed, it should not index it again, and should return `null`.
  175. * Otherwise it should read the contents of the page using `WikiFetcher.fetchWikipedia`, which reads current content from the Web.
  176. * Then it should index the page, add links to the queue, and return the URL of the page it indexed.
  177. `WikiCrawlerTest` loads the queue with about 200 links and then invokes `crawl` three times. After each invocation, it checks the return value and the new length of the queue.
  178. When your crawler is working as specified, this test should pass. Good luck!
  179. ## Resources
  180. * [Stop words](https://en.wikipedia.org/wiki/Stop_words)
  181. * [Graphs](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics))
  182. * [Graph traversal](https://en.wikipedia.org/wiki/Graph_traversal)