PageRenderTime 25ms CodeModel.GetById 20ms app.highlight 3ms RepoModel.GetById 1ms app.codeStats 0ms

/boto-2.5.2/docs/source/emr_tut.rst

#
ReStructuredText | 108 lines | 75 code | 33 blank | 0 comment | 0 complexity | 7c05034e778c43f98c364d9a58dae218 MD5 | raw file
  1.. _emr_tut:
  2
  3=====================================================
  4An Introduction to boto's Elastic Mapreduce interface
  5=====================================================
  6
  7This tutorial focuses on the boto interface to Elastic Mapreduce from
  8Amazon Web Services.  This tutorial assumes that you have already
  9downloaded and installed boto.
 10
 11Creating a Connection
 12---------------------
 13The first step in accessing Elastic Mapreduce is to create a connection
 14to the service.  There are two ways to do this in boto.  The first is:
 15
 16>>> from boto.emr.connection import EmrConnection
 17>>> conn = EmrConnection('<aws access key>', '<aws secret key>')
 18
 19At this point the variable conn will point to an EmrConnection object.
 20In this example, the AWS access key and AWS secret key are passed in to
 21the method explicitly.  Alternatively, you can set the environment variables:
 22
 23AWS_ACCESS_KEY_ID - Your AWS Access Key ID \
 24AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key
 25
 26and then call the constructor without any arguments, like this:
 27
 28>>> conn = EmrConnection()
 29
 30There is also a shortcut function in the boto package called connect_emr
 31that may provide a slightly easier means of creating a connection:
 32
 33>>> import boto
 34>>> conn = boto.connect_emr()
 35
 36In either case, conn points to an EmrConnection object which we will use
 37throughout the remainder of this tutorial.
 38
 39Creating Streaming JobFlow Steps
 40--------------------------------
 41Upon creating a connection to Elastic Mapreduce you will next 
 42want to create one or more jobflow steps.  There are two types of steps, streaming
 43and custom jar, both of which have a class in the boto Elastic Mapreduce implementation.
 44
 45Creating a streaming step that runs the AWS wordcount example, itself written in Python, can be accomplished by:
 46
 47>>> from boto.emr.step import StreamingStep
 48>>> step = StreamingStep(name='My wordcount example',
 49...                      mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
 50...                      reducer='aggregate',
 51...                      input='s3n://elasticmapreduce/samples/wordcount/input',
 52...                      output='s3n://<my output bucket>/output/wordcount_output')
 53
 54where <my output bucket> is a bucket you have created in S3.
 55
 56Note that this statement does not run the step, that is accomplished later when we create a jobflow.  
 57
 58Additional arguments of note to the streaming jobflow step are cache_files, cache_archive and step_args.  The options cache_files and cache_archive enable you to use the Hadoops distributed cache to share files amongst the instances that run the step.  The argument step_args allows one to pass additional arguments to Hadoop streaming, for example modifications to the Hadoop job configuration.
 59
 60Creating Custom Jar Job Flow Steps
 61----------------------------------
 62
 63The second type of jobflow step executes tasks written with a custom jar.  Creating a custom jar step for the AWS CloudBurst example can be accomplished by:
 64
 65>>> from boto.emr.step import JarStep
 66>>> step = JarStep(name='Coudburst example',
 67...                jar='s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar',
 68...                step_args=['s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br',
 69...                           's3n://elasticmapreduce/samples/cloudburst/input/100k.br',
 70...                           's3n://<my output bucket>/output/cloudfront_output',
 71...                            36, 3, 0, 1, 240, 48, 24, 24, 128, 16])
 72
 73Note that this statement does not actually run the step, that is accomplished later when we create a jobflow.  Also note that this JarStep does not include a main_class argument since the jar MANIFEST.MF has a Main-Class entry.
 74
 75Creating JobFlows
 76-----------------
 77Once you have created one or more jobflow steps, you will next want to create and run a jobflow.  Creating a jobflow that executes either of the steps we created above can be accomplished by:
 78
 79>>> import boto
 80>>> conn = boto.connect_emr()
 81>>> jobid = conn.run_jobflow(name='My jobflow', 
 82...                          log_uri='s3://<my log uri>/jobflow_logs', 
 83...                          steps=[step])
 84
 85The method will not block for the completion of the jobflow, but will immediately return.  The status of the jobflow can be determined by:
 86
 87>>> status = conn.describe_jobflow(jobid)
 88>>> status.state
 89u'STARTING'
 90
 91One can then use this state to block for a jobflow to complete.  Valid jobflow states currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING.
 92
 93In some cases you may not have built all of the steps prior to running the jobflow.  In these cases additional steps can be added to a jobflow by running:
 94
 95>>> conn.add_jobflow_steps(jobid, [second_step])
 96
 97If you wish to add additional steps to a running jobflow you may want to set the keep_alive parameter to True in run_jobflow so that the jobflow does not automatically terminate when the first step completes.
 98
 99The run_jobflow method has a number of important parameters that are worth investigating.  They include parameters to change the number and type of EC2 instances on which the jobflow is executed, set a SSH key for manual debugging and enable AWS console debugging.
100
101Terminating JobFlows
102--------------------
103By default when all the steps of a jobflow have finished or failed the jobflow terminates.  However, if you set the keep_alive parameter to True or just want to halt the execution of a jobflow early you can terminate a jobflow by:
104
105>>> import boto
106>>> conn = boto.connect_emr()
107>>> conn.terminate_jobflow('<jobflow id>') 
108