PageRenderTime 18ms CodeModel.GetById 9ms app.highlight 6ms RepoModel.GetById 1ms app.codeStats 1ms

/boto-2.5.2/docs/source/cloudsearch_tut.rst

Relevant Search: With Applications for Solr and Elasticsearch

For more in depth reading about search, ranking and generally everything you could ever want to know about how lucene, elasticsearch or solr work under the hood I highly suggest this book. Easily one of the most interesting technical books I have read in a long time. If you are tasked with solving search relevance problems even if not in Solr or Elasticsearch it should be your first reference. Amazon Affiliate Link
#
ReStructuredText | 264 lines | 196 code | 68 blank | 0 comment | 0 complexity | 7509bd998b71f9b6df35a53c81a60b12 MD5 | raw file
  1.. cloudsearch_tut:
  2
  3===============================================
  4An Introduction to boto's Cloudsearch interface
  5===============================================
  6
  7This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial
  8assumes that you have boto already downloaded and installed.
  9
 10.. _Cloudsearch: http://aws.amazon.com/cloudsearch/
 11
 12Creating a Domain
 13-----------------
 14
 15    >>> import boto
 16
 17    >>> our_ip = '192.168.1.0'
 18
 19    >>> conn = boto.connect_cloudsearch()
 20    >>> domain = conn.create_domain('demo')
 21
 22    >>> # Allow our IP address to access the document and search services
 23    >>> policy = domain.get_access_policies()
 24    >>> policy.allow_search_ip(our_ip)
 25    >>> policy.allow_doc_ip(our_ip)
 26
 27    >>> # Create an 'text' index field called 'username'
 28    >>> uname_field = domain.create_index_field('username', 'text')
 29    
 30    >>> # But it would be neat to drill down into different countries    
 31    >>> loc_field = domain.create_index_field('location', 'text', facet=True)
 32    
 33    >>> # Epoch time of when the user last did something
 34    >>> time_field = domain.create_index_field('last_activity', 'uint', default=0)
 35    
 36    >>> follower_field = domain.create_index_field('follower_count', 'uint', default=0)
 37
 38    >>> domain.create_rank_expression('recently_active', 'last_activity')  # We'll want to be able to just show the most recently active users
 39    
 40    >>> domain.create_rank_expression('activish', 'text_relevance + ((follower_count/(time() - last_activity))*1000)')  # Let's get trickier and combine text relevance with a really dynamic expression
 41
 42Viewing and Adjusting Stemming for a Domain
 43--------------------------------------------
 44
 45A stemming dictionary maps related words to a common stem. A stem is
 46typically the root or base word from which variants are derived. For
 47example, run is the stem of running and ran. During indexing, Amazon
 48CloudSearch uses the stemming dictionary when it performs
 49text-processing on text fields. At search time, the stemming
 50dictionary is used to perform text-processing on the search
 51request. This enables matching on variants of a word. For example, if
 52you map the term running to the stem run and then search for running,
 53the request matches documents that contain run as well as running.
 54
 55To get the current stemming dictionary defined for a domain, use the
 56``get_stemming`` method of the Domain object.
 57
 58    >>> stems = domain.get_stemming()
 59    >>> stems
 60    {u'stems': {}}
 61    >>>
 62
 63This returns a dictionary object that can be manipulated directly to
 64add additional stems for your search domain by adding pairs of term:stem
 65to the stems dictionary.
 66
 67    >>> stems['stems']['running'] = 'run'
 68    >>> stems['stems']['ran'] = 'run'
 69    >>> stems
 70    {u'stems': {u'ran': u'run', u'running': u'run'}}
 71    >>>
 72
 73This has changed the value locally.  To update the information in
 74Amazon CloudSearch, you need to save the data.
 75
 76    >>> stems.save()
 77
 78You can also access certain CloudSearch-specific attributes related to
 79the stemming dictionary defined for your domain.
 80
 81    >>> stems.status
 82    u'RequiresIndexDocuments'
 83    >>> stems.creation_date
 84    u'2012-05-01T12:12:32Z'
 85    >>> stems.update_date
 86    u'2012-05-01T12:12:32Z'
 87    >>> stems.update_version
 88    19
 89    >>>
 90
 91The status indicates that, because you have changed the stems associated
 92with the domain, you will need to re-index the documents in the domain
 93before the new stems are used.
 94
 95Viewing and Adjusting Stopwords for a Domain
 96--------------------------------------------
 97
 98Stopwords are words that should typically be ignored both during
 99indexing and at search time because they are either insignificant or
100so common that including them would result in a massive number of
101matches.
102
103To view the stopwords currently defined for your domain, use the
104``get_stopwords`` method of the Domain object.
105
106    >>> stopwords = domain.get_stopwords()
107    >>> stopwords
108    {u'stopwords': [u'a',
109     u'an',
110     u'and',
111     u'are',
112     u'as',
113     u'at',
114     u'be',
115     u'but',
116     u'by',
117     u'for',
118     u'in',
119     u'is',
120     u'it',
121     u'of',
122     u'on',
123     u'or',
124     u'the',
125     u'to',
126     u'was']}
127     >>>
128
129You can add additional stopwords by simply appending the values to the
130list.
131
132    >>> stopwords['stopwords'].append('foo')
133    >>> stopwords['stopwords'].append('bar')
134    >>> stopwords
135
136Similarly, you could remove currently defined stopwords from the list.
137To save the changes, use the ``save`` method.
138
139    >>> stopwords.save()
140
141The stopwords object has similar attributes defined above for stemming
142that provide additional information about the stopwords in your domain.
143
144
145Viewing and Adjusting Stopwords for a Domain
146--------------------------------------------
147
148You can configure synonyms for terms that appear in the data you are
149searching. That way, if a user searches for the synonym rather than
150the indexed term, the results will include documents that contain the
151indexed term.
152
153If you want two terms to match the same documents, you must define
154them as synonyms of each other. For example:
155
156    cat, feline
157    feline, cat
158
159To view the synonyms currently defined for your domain, use the
160``get_synonyms`` method of the Domain object.
161
162    >>> synonyms = domain.get_synsonyms()
163    >>> synonyms
164    {u'synonyms': {}}
165    >>>
166
167You can define new synonyms by adding new term:synonyms entries to the
168synonyms dictionary object.
169
170    >>> synonyms['synonyms']['cat'] = ['feline', 'kitten']
171    >>> synonyms['synonyms']['dog'] = ['canine', 'puppy']
172
173To save the changes, use the ``save`` method.
174
175    >>> synonyms.save()
176
177The synonyms object has similar attributes defined above for stemming
178that provide additional information about the stopwords in your domain.
179
180Adding Documents to the Index
181-----------------------------
182
183Now, we can add some documents to our new search domain.
184
185    >>> doc_service = domain.get_document_service()
186
187    >>> # Presumably get some users from your db of choice.
188    >>> users = [
189        {
190            'id': 1,
191            'username': 'dan',
192            'last_activity': 1334252740,
193            'follower_count': 20,
194            'location': 'USA'
195        },
196        {
197            'id': 2,
198            'username': 'dankosaur',
199            'last_activity': 1334252904,
200            'follower_count': 1,
201            'location': 'UK'
202        },
203        {
204            'id': 3,
205            'username': 'danielle',
206            'last_activity': 1334252969,
207            'follower_count': 100,
208            'location': 'DE'
209        },
210        {
211            'id': 4,
212            'username': 'daniella',
213            'last_activity': 1334253279,
214            'follower_count': 7,
215            'location': 'USA'
216        }
217    ]
218
219    >>> for user in users:
220    >>>     doc_service.add(user['id'], user['last_activity'], user)
221
222    >>> result = doc_service.commit()  # Actually post the SDF to the document service
223
224The result is an instance of `cloudsearch.CommitResponse` which will
225makes the plain dictionary response a nice object (ie result.adds,
226result.deletes) and raise an exception for us if all of our documents
227weren't actually committed.
228
229
230Searching Documents
231-------------------
232
233Now, let's try performing a search.
234
235    >>> # Get an instance of cloudsearch.SearchServiceConnection
236    >>> search_service = domain.get_search_service()
237
238    >>> # Horray wildcard search
239    >>> query = "username:'dan*'"
240
241
242    >>> results = search_service.search(bq=query, rank=['-recently_active'], start=0, size=10)
243    
244    >>> # Results will give us back a nice cloudsearch.SearchResults object that looks as
245    >>> # close as possible to pysolr.Results
246
247    >>> print "Got %s results back." % results.hits
248    >>> print "User ids are:"
249    >>> for result in results:
250    >>>     print result['id']
251
252
253Deleting Documents
254------------------
255
256    >>> import time
257    >>> from datetime import datetime
258
259    >>> doc_service = domain.get_document_service()
260
261    >>> # Again we'll cheat and use the current epoch time as our version number
262     
263    >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))
264    >>> service.commit()