cloudsearch_tut.rst | searchcode

/boto-2.5.2/docs/source/cloudsearch_tut.rst

# · ReStructuredText · 264 lines · 196 code · 68 blank · 0 comment · 0 complexity · 7509bd998b71f9b6df35a53c81a60b12 MD5 · raw file

.. cloudsearch_tut:

===============================================
An Introduction to boto's Cloudsearch interface
===============================================

This tutorial focuses on the boto interface to AWS' Cloudsearch_. This tutorial
assumes that you have boto already downloaded and installed.

.. _Cloudsearch: http://aws.amazon.com/cloudsearch/

Creating a Domain
-----------------

    >>> import boto

    >>> our_ip = '192.168.1.0'

    >>> conn = boto.connect_cloudsearch()
    >>> domain = conn.create_domain('demo')

    >>> # Allow our IP address to access the document and search services
    >>> policy = domain.get_access_policies()
    >>> policy.allow_search_ip(our_ip)
    >>> policy.allow_doc_ip(our_ip)

    >>> # Create an 'text' index field called 'username'
    >>> uname_field = domain.create_index_field('username', 'text')
    
    >>> # But it would be neat to drill down into different countries    
    >>> loc_field = domain.create_index_field('location', 'text', facet=True)
    
    >>> # Epoch time of when the user last did something
    >>> time_field = domain.create_index_field('last_activity', 'uint', default=0)
    
    >>> follower_field = domain.create_index_field('follower_count', 'uint', default=0)

    >>> domain.create_rank_expression('recently_active', 'last_activity')  # We'll want to be able to just show the most recently active users
    
    >>> domain.create_rank_expression('activish', 'text_relevance + ((follower_count/(time() - last_activity))*1000)')  # Let's get trickier and combine text relevance with a really dynamic expression

Viewing and Adjusting Stemming for a Domain
--------------------------------------------

A stemming dictionary maps related words to a common stem. A stem is
typically the root or base word from which variants are derived. For
example, run is the stem of running and ran. During indexing, Amazon
CloudSearch uses the stemming dictionary when it performs
text-processing on text fields. At search time, the stemming
dictionary is used to perform text-processing on the search
request. This enables matching on variants of a word. For example, if
you map the term running to the stem run and then search for running,
the request matches documents that contain run as well as running.

To get the current stemming dictionary defined for a domain, use the
``get_stemming`` method of the Domain object.

    >>> stems = domain.get_stemming()
    >>> stems
    {u'stems': {}}
    >>>

This returns a dictionary object that can be manipulated directly to
add additional stems for your search domain by adding pairs of term:stem
to the stems dictionary.

    >>> stems['stems']['running'] = 'run'
    >>> stems['stems']['ran'] = 'run'
    >>> stems
    {u'stems': {u'ran': u'run', u'running': u'run'}}
    >>>

This has changed the value locally.  To update the information in
Amazon CloudSearch, you need to save the data.

    >>> stems.save()

You can also access certain CloudSearch-specific attributes related to
the stemming dictionary defined for your domain.

    >>> stems.status
    u'RequiresIndexDocuments'
    >>> stems.creation_date
    u'2012-05-01T12:12:32Z'
    >>> stems.update_date
    u'2012-05-01T12:12:32Z'
    >>> stems.update_version
    19
    >>>

The status indicates that, because you have changed the stems associated
with the domain, you will need to re-index the documents in the domain
before the new stems are used.

Viewing and Adjusting Stopwords for a Domain
--------------------------------------------

Stopwords are words that should typically be ignored both during
indexing and at search time because they are either insignificant or
so common that including them would result in a massive number of
matches.

To view the stopwords currently defined for your domain, use the
``get_stopwords`` method of the Domain object.

    >>> stopwords = domain.get_stopwords()
    >>> stopwords
    {u'stopwords': [u'a',
     u'an',
     u'and',
     u'are',
     u'as',
     u'at',
     u'be',
     u'but',
     u'by',
     u'for',
     u'in',
     u'is',
     u'it',
     u'of',
     u'on',
     u'or',
     u'the',
     u'to',
     u'was']}
     >>>

You can add additional stopwords by simply appending the values to the
list.

    >>> stopwords['stopwords'].append('foo')
    >>> stopwords['stopwords'].append('bar')
    >>> stopwords

Similarly, you could remove currently defined stopwords from the list.
To save the changes, use the ``save`` method.

    >>> stopwords.save()

The stopwords object has similar attributes defined above for stemming
that provide additional information about the stopwords in your domain.


Viewing and Adjusting Stopwords for a Domain
--------------------------------------------

You can configure synonyms for terms that appear in the data you are
searching. That way, if a user searches for the synonym rather than
the indexed term, the results will include documents that contain the
indexed term.

If you want two terms to match the same documents, you must define
them as synonyms of each other. For example:

    cat, feline
    feline, cat

To view the synonyms currently defined for your domain, use the
``get_synonyms`` method of the Domain object.

    >>> synonyms = domain.get_synsonyms()
    >>> synonyms
    {u'synonyms': {}}
    >>>

You can define new synonyms by adding new term:synonyms entries to the
synonyms dictionary object.

    >>> synonyms['synonyms']['cat'] = ['feline', 'kitten']
    >>> synonyms['synonyms']['dog'] = ['canine', 'puppy']

To save the changes, use the ``save`` method.

    >>> synonyms.save()

The synonyms object has similar attributes defined above for stemming
that provide additional information about the stopwords in your domain.

Adding Documents to the Index
-----------------------------

Now, we can add some documents to our new search domain.

    >>> doc_service = domain.get_document_service()

    >>> # Presumably get some users from your db of choice.
    >>> users = [
        {
            'id': 1,
            'username': 'dan',
            'last_activity': 1334252740,
            'follower_count': 20,
            'location': 'USA'
        },
        {
            'id': 2,
            'username': 'dankosaur',
            'last_activity': 1334252904,
            'follower_count': 1,
            'location': 'UK'
        },
        {
            'id': 3,
            'username': 'danielle',
            'last_activity': 1334252969,
            'follower_count': 100,
            'location': 'DE'
        },
        {
            'id': 4,
            'username': 'daniella',
            'last_activity': 1334253279,
            'follower_count': 7,
            'location': 'USA'
        }
    ]

    >>> for user in users:
    >>>     doc_service.add(user['id'], user['last_activity'], user)

    >>> result = doc_service.commit()  # Actually post the SDF to the document service

The result is an instance of `cloudsearch.CommitResponse` which will
makes the plain dictionary response a nice object (ie result.adds,
result.deletes) and raise an exception for us if all of our documents
weren't actually committed.


Searching Documents
-------------------

Now, let's try performing a search.

    >>> # Get an instance of cloudsearch.SearchServiceConnection
    >>> search_service = domain.get_search_service()

    >>> # Horray wildcard search
    >>> query = "username:'dan*'"


    >>> results = search_service.search(bq=query, rank=['-recently_active'], start=0, size=10)
    
    >>> # Results will give us back a nice cloudsearch.SearchResults object that looks as
    >>> # close as possible to pysolr.Results

    >>> print "Got %s results back." % results.hits
    >>> print "User ids are:"
    >>> for result in results:
    >>>     print result['id']


Deleting Documents
------------------

    >>> import time
    >>> from datetime import datetime

    >>> doc_service = domain.get_document_service()

    >>> # Again we'll cheat and use the current epoch time as our version number
     
    >>> doc_service.delete(4, int(time.mktime(datetime.utcnow().timetuple())))
    >>> service.commit()