PageRenderTime 21ms CodeModel.GetById 15ms app.highlight 3ms RepoModel.GetById 1ms app.codeStats 0ms

/doc/tutorials/manual/split.rst

https://code.google.com/p/ruffus/
ReStructuredText | 177 lines | 112 code | 65 blank | 0 comment | 0 complexity | 2e5ce700004f9a9eed2d4705a104a9cc MD5 | raw file
  1.. include:: ../../global.inc
  2.. include:: chapter_numbers.inc
  3
  4.. _manual.split:
  5
  6###################################################################################
  7|manual.split.chapter_num|: `Splitting up large tasks / files with` **@split**
  8###################################################################################
  9    .. hlist::
 10    
 11        * :ref:`Manual overview <manual>` 
 12        * :ref:`@split <decorators.split>` syntax in detail
 13
 14    A common requirement in computational pipelines is to split up a large task into
 15    small jobs which can be run on different processors, (or sent to a computational
 16    cluster). Very often, the number of jobs depends dynamically on the size of the
 17    task, and cannot be known for sure beforehand. 
 18
 19    *Ruffus* uses the :ref:`@split <decorators.split>` decorator to indicate that
 20    the :term:`task` function will produce an indeterminate number of output files.
 21    
 22    
 23    
 24    .. index:: 
 25        pair: @split; Manual
 26    
 27
 28=================
 29**@split**
 30=================
 31This example is borrowed from :ref:`step 4 <Simple_Tutorial_5th_step>` of the simple tutorial.
 32
 33    .. note :: See :ref:`accompanying Python Code <Simple_Tutorial_5th_step_code>` 
 34    
 35**************************************************************************************
 36Splitting up a long list of random numbers to calculate their variance
 37**************************************************************************************
 38
 39    .. csv-table:: 
 40        :widths: 1,99
 41        :class: borderless
 42
 43        ".. centered::
 44            Step 5 from the tutorial:
 45
 46        .. image:: ../../images/simple_tutorial_step5_sans_key.png", "
 47            Suppose we had a list of 100,000 random numbers in the file ``random_numbers.list``:
 48            
 49                ::
 50                
 51                    import random
 52                    f = open('random_numbers.list', 'w')
 53                    for i in range(NUMBER_OF_RANDOMS):
 54                        f.write('%g\n' % (random.random() * 100.0))
 55            
 56            
 57            We might want to calculate the sample variance more quickly by splitting them 
 58            into ``NNN`` parcels of 1000 numbers each and working on them in parallel. 
 59            In this case we known that ``NNN == 100`` but usually the number of resulting files
 60            is only apparent after we have finished processing our starting file."
 61    
 62
 63    Our pipeline function needs to take the random numbers file ``random_numbers.list``,
 64    read the random numbers from it, and write to a new file every 100 lines.
 65    
 66    The *Ruffus* decorator :ref:`@split<decorators.split>` is designed specifically for 
 67    splitting up *inputs* into an indeterminate ``NNN`` number of *outputs*:
 68    
 69        .. image:: ../../images/simple_tutorial_split.png
 70        
 71    .. ::
 72    
 73        ::
 74        
 75            @split("random_numbers.list", "*.chunks")
 76            def step_4_split_numbers_into_chunks (input_file_name, output_files):
 77                #
 78                """code goes here"""
 79            
 80
 81    Ruffus will set 
 82
 83        | ``input_file_name`` to ``"random_numbers.list"``
 84        | ``output_files`` to all files which match ``*.chunks`` (i.e. ``"1.chunks"``, ``"2.chunks"`` etc.).
 85    
 86
 87    
 88.. _manual.split.output_files:
 89
 90=================
 91Output files
 92=================
 93
 94    The *output* (second) parameter of **@split** usually contains a 
 95    |glob|_ pattern like the ``*.chunks`` above. 
 96
 97    .. note::
 98        **Ruffus** is quite relaxed about the contents of the ``output`` parameter.
 99        Strings are treated as file names. Strings containing |glob|_ pattern are expanded.
100        Other types are passed verbatim to the decorated task function.
101    
102    The files which match the |glob|_ will be passed as the actual parameters to the job
103    function. Thus, the first time you run the example code ``*.chunks`` will return an empty list because
104    no ``.chunks`` files have been created, resulting in the following:
105    
106        ::
107        
108            step_4_split_numbers_into_chunks ("random_numbers.list", [])
109    
110    After that ``*.chunks`` will match the list of current ``.chunks`` files created by
111    the previous pipeline run. 
112
113
114
115    File names in *output* are generally out of date or superfluous. They are useful 
116    mainly for cleaning-up detritus from previous runs 
117    (have a look at :ref:`step_4_split_numbers_into_chunks(...) <Simple_Tutorial_5th_step_code>`).
118    
119    .. note ::
120
121        It is important, nevertheless, to specify correctly the list of *output* files.
122        Otherwise, dependent tasks will not know what files you have created, and it will
123        not be possible automatically to chain together the *ouput* of this pipeline task into the
124        *inputs* of the next step.
125        
126        You can specify multiple |glob|_ patterns to match *all* the files which are the
127        result of the splitting task function. These can even cover different directories, 
128        or groups of file names. This is a more extreme example:
129        
130            ::
131            
132                @split("input.file", ['a*.bits', 'b*.pieces', 'somewhere_else/c*.stuff'])
133                def split_function (input_filename, output_files):
134                    "Code to split up 'input.file'"
135                    
136
137
138    The actual resulting files of this task function are not constrained by the file names
139    in the *output* parameter of the function. The whole point of **@split** is that number 
140    of resulting output files cannot be known beforehand, after all. 
141    
142******************
143Example
144******************
145
146    
147    Suppose random_numbers.list can be split into four pieces, this function will create
148        ``1.chunks``, ``2.chunks``, ``3.chunks``, ``4.chunks``
149        
150    Subsequently, we receive a larger ``random_numbers.list`` which should be split into 10
151    pieces. If the pipeline is called again, the task function receives the following parameters:
152    
153        ::
154        
155            step_4_split_numbers_into_chunks("random_numbers.list", 
156                                             ["1.chunks",               #   previously created files
157                                              "2.chunks",               #
158                                              "3.chunks",               #
159                                              "4.chunks" ])             #
160
161
162    This doesn't stop the function from creating the extra ``5.chunks``, ``6.chunks`` etc.
163
164    .. note::
165    
166        Any tasks **@follow**\ ing and specifying 
167        ``step_4_split_numbers_into_chunks(...)`` as its *inputs* parameter is going to receive
168        ``1.chunks``, ``...``, ``10.chunks`` and not merely the first four files.
169
170        In other words, dependent / down-stream tasks which obtain output files automatically 
171        from the task decorated by **@split** receive the most current file list. 
172        The |glob|_ patterns will be matched again to see exactly what files the task function
173        has created in reality *after* the task completes.
174
175    
176
177