/boto-2.5.2/docs/source/cloudsearch_tut.rst

# · ReStructuredText · 264 lines · 196 code · 68 blank · 0 comment · 0 complexity · 7509bd998b71f9b6df35a53c81a60b12 MD5 · raw file

  1. .. cloudsearch_tut:
  2. ===============================================
  3. An Introduction to boto's Cloudsearch interface
  4. ===============================================
  5. This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial
  6. assumes that you have boto already downloaded and installed.
  7. .. _Cloudsearch: http://aws.amazon.com/cloudsearch/
  8. Creating a Domain
  9. -----------------
  10. >>> import boto
  11. >>> our_ip = '192.168.1.0'
  12. >>> conn = boto.connect_cloudsearch()
  13. >>> domain = conn.create_domain('demo')
  14. >>> # Allow our IP address to access the document and search services
  15. >>> policy = domain.get_access_policies()
  16. >>> policy.allow_search_ip(our_ip)
  17. >>> policy.allow_doc_ip(our_ip)
  18. >>> # Create an 'text' index field called 'username'
  19. >>> uname_field = domain.create_index_field('username', 'text')
  20. >>> # But it would be neat to drill down into different countries
  21. >>> loc_field = domain.create_index_field('location', 'text', facet=True)
  22. >>> # Epoch time of when the user last did something
  23. >>> time_field = domain.create_index_field('last_activity', 'uint', default=0)
  24. >>> follower_field = domain.create_index_field('follower_count', 'uint', default=0)
  25. >>> domain.create_rank_expression('recently_active', 'last_activity') # We'll want to be able to just show the most recently active users
  26. >>> domain.create_rank_expression('activish', 'text_relevance + ((follower_count/(time() - last_activity))*1000)') # Let's get trickier and combine text relevance with a really dynamic expression
  27. Viewing and Adjusting Stemming for a Domain
  28. --------------------------------------------
  29. A stemming dictionary maps related words to a common stem. A stem is
  30. typically the root or base word from which variants are derived. For
  31. example, run is the stem of running and ran. During indexing, Amazon
  32. CloudSearch uses the stemming dictionary when it performs
  33. text-processing on text fields. At search time, the stemming
  34. dictionary is used to perform text-processing on the search
  35. request. This enables matching on variants of a word. For example, if
  36. you map the term running to the stem run and then search for running,
  37. the request matches documents that contain run as well as running.
  38. To get the current stemming dictionary defined for a domain, use the
  39. ``get_stemming`` method of the Domain object.
  40. >>> stems = domain.get_stemming()
  41. >>> stems
  42. {u'stems': {}}
  43. >>>
  44. This returns a dictionary object that can be manipulated directly to
  45. add additional stems for your search domain by adding pairs of term:stem
  46. to the stems dictionary.
  47. >>> stems['stems']['running'] = 'run'
  48. >>> stems['stems']['ran'] = 'run'
  49. >>> stems
  50. {u'stems': {u'ran': u'run', u'running': u'run'}}
  51. >>>
  52. This has changed the value locally. To update the information in
  53. Amazon CloudSearch, you need to save the data.
  54. >>> stems.save()
  55. You can also access certain CloudSearch-specific attributes related to
  56. the stemming dictionary defined for your domain.
  57. >>> stems.status
  58. u'RequiresIndexDocuments'
  59. >>> stems.creation_date
  60. u'2012-05-01T12:12:32Z'
  61. >>> stems.update_date
  62. u'2012-05-01T12:12:32Z'
  63. >>> stems.update_version
  64. 19
  65. >>>
  66. The status indicates that, because you have changed the stems associated
  67. with the domain, you will need to re-index the documents in the domain
  68. before the new stems are used.
  69. Viewing and Adjusting Stopwords for a Domain
  70. --------------------------------------------
  71. Stopwords are words that should typically be ignored both during
  72. indexing and at search time because they are either insignificant or
  73. so common that including them would result in a massive number of
  74. matches.
  75. To view the stopwords currently defined for your domain, use the
  76. ``get_stopwords`` method of the Domain object.
  77. >>> stopwords = domain.get_stopwords()
  78. >>> stopwords
  79. {u'stopwords': [u'a',
  80. u'an',
  81. u'and',
  82. u'are',
  83. u'as',
  84. u'at',
  85. u'be',
  86. u'but',
  87. u'by',
  88. u'for',
  89. u'in',
  90. u'is',
  91. u'it',
  92. u'of',
  93. u'on',
  94. u'or',
  95. u'the',
  96. u'to',
  97. u'was']}
  98. >>>
  99. You can add additional stopwords by simply appending the values to the
  100. list.
  101. >>> stopwords['stopwords'].append('foo')
  102. >>> stopwords['stopwords'].append('bar')
  103. >>> stopwords
  104. Similarly, you could remove currently defined stopwords from the list.
  105. To save the changes, use the ``save`` method.
  106. >>> stopwords.save()
  107. The stopwords object has similar attributes defined above for stemming
  108. that provide additional information about the stopwords in your domain.
  109. Viewing and Adjusting Stopwords for a Domain
  110. --------------------------------------------
  111. You can configure synonyms for terms that appear in the data you are
  112. searching. That way, if a user searches for the synonym rather than
  113. the indexed term, the results will include documents that contain the
  114. indexed term.
  115. If you want two terms to match the same documents, you must define
  116. them as synonyms of each other. For example:
  117. cat, feline
  118. feline, cat
  119. To view the synonyms currently defined for your domain, use the
  120. ``get_synonyms`` method of the Domain object.
  121. >>> synonyms = domain.get_synsonyms()
  122. >>> synonyms
  123. {u'synonyms': {}}
  124. >>>
  125. You can define new synonyms by adding new term:synonyms entries to the
  126. synonyms dictionary object.
  127. >>> synonyms['synonyms']['cat'] = ['feline', 'kitten']
  128. >>> synonyms['synonyms']['dog'] = ['canine', 'puppy']
  129. To save the changes, use the ``save`` method.
  130. >>> synonyms.save()
  131. The synonyms object has similar attributes defined above for stemming
  132. that provide additional information about the stopwords in your domain.
  133. Adding Documents to the Index
  134. -----------------------------
  135. Now, we can add some documents to our new search domain.
  136. >>> doc_service = domain.get_document_service()
  137. >>> # Presumably get some users from your db of choice.
  138. >>> users = [
  139. {
  140. 'id': 1,
  141. 'username': 'dan',
  142. 'last_activity': 1334252740,
  143. 'follower_count': 20,
  144. 'location': 'USA'
  145. },
  146. {
  147. 'id': 2,
  148. 'username': 'dankosaur',
  149. 'last_activity': 1334252904,
  150. 'follower_count': 1,
  151. 'location': 'UK'
  152. },
  153. {
  154. 'id': 3,
  155. 'username': 'danielle',
  156. 'last_activity': 1334252969,
  157. 'follower_count': 100,
  158. 'location': 'DE'
  159. },
  160. {
  161. 'id': 4,
  162. 'username': 'daniella',
  163. 'last_activity': 1334253279,
  164. 'follower_count': 7,
  165. 'location': 'USA'
  166. }
  167. ]
  168. >>> for user in users:
  169. >>> doc_service.add(user['id'], user['last_activity'], user)
  170. >>> result = doc_service.commit() # Actually post the SDF to the document service
  171. The result is an instance of `cloudsearch.CommitResponse` which will
  172. makes the plain dictionary response a nice object (ie result.adds,
  173. result.deletes) and raise an exception for us if all of our documents
  174. weren't actually committed.
  175. Searching Documents
  176. -------------------
  177. Now, let's try performing a search.
  178. >>> # Get an instance of cloudsearch.SearchServiceConnection
  179. >>> search_service = domain.get_search_service()
  180. >>> # Horray wildcard search
  181. >>> query = "username:'dan*'"
  182. >>> results = search_service.search(bq=query, rank=['-recently_active'], start=0, size=10)
  183. >>> # Results will give us back a nice cloudsearch.SearchResults object that looks as
  184. >>> # close as possible to pysolr.Results
  185. >>> print "Got %s results back." % results.hits
  186. >>> print "User ids are:"
  187. >>> for result in results:
  188. >>> print result['id']
  189. Deleting Documents
  190. ------------------
  191. >>> import time
  192. >>> from datetime import datetime
  193. >>> doc_service = domain.get_document_service()
  194. >>> # Again we'll cheat and use the current epoch time as our version number
  195. >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))
  196. >>> service.commit()