split.rst | searchcode

/doc/tutorials/manual/split.rst

https://code.google.com/p/ruffus/ · ReStructuredText · 177 lines · 112 code · 65 blank · 0 comment · 0 complexity · 2e5ce700004f9a9eed2d4705a104a9cc MD5 · raw file

.. include:: ../../global.inc
.. include:: chapter_numbers.inc

.. _manual.split:

###################################################################################
|manual.split.chapter_num|: `Splitting up large tasks / files with` **@split**
###################################################################################
    .. hlist::
    
        * :ref:`Manual overview <manual>` 
        * :ref:`@split <decorators.split>` syntax in detail

    A common requirement in computational pipelines is to split up a large task into
    small jobs which can be run on different processors, (or sent to a computational
    cluster). Very often, the number of jobs depends dynamically on the size of the
    task, and cannot be known for sure beforehand. 

    *Ruffus* uses the :ref:`@split <decorators.split>` decorator to indicate that
    the :term:`task` function will produce an indeterminate number of output files.
    
    
    
    .. index:: 
        pair: @split; Manual
    

=================
**@split**
=================
This example is borrowed from :ref:`step 4 <Simple_Tutorial_5th_step>` of the simple tutorial.

    .. note :: See :ref:`accompanying Python Code <Simple_Tutorial_5th_step_code>` 
    
**************************************************************************************
Splitting up a long list of random numbers to calculate their variance
**************************************************************************************

    .. csv-table:: 
        :widths: 1,99
        :class: borderless

        ".. centered::
            Step 5 from the tutorial:

        .. image:: ../../images/simple_tutorial_step5_sans_key.png", "
            Suppose we had a list of 100,000 random numbers in the file ``random_numbers.list``:
            
                ::
                
                    import random
                    f = open('random_numbers.list', 'w')
                    for i in range(NUMBER_OF_RANDOMS):
                        f.write('%g\n' % (random.random() * 100.0))
            
            
            We might want to calculate the sample variance more quickly by splitting them 
            into ``NNN`` parcels of 1000 numbers each and working on them in parallel. 
            In this case we known that ``NNN == 100`` but usually the number of resulting files
            is only apparent after we have finished processing our starting file."
    

    Our pipeline function needs to take the random numbers file ``random_numbers.list``,
    read the random numbers from it, and write to a new file every 100 lines.
    
    The *Ruffus* decorator :ref:`@split<decorators.split>` is designed specifically for 
    splitting up *inputs* into an indeterminate ``NNN`` number of *outputs*:
    
        .. image:: ../../images/simple_tutorial_split.png
        
    .. ::
    
        ::
        
            @split("random_numbers.list", "*.chunks")
            def step_4_split_numbers_into_chunks (input_file_name, output_files):
                #
                """code goes here"""
            

    Ruffus will set 

        | ``input_file_name`` to ``"random_numbers.list"``
        | ``output_files`` to all files which match ``*.chunks`` (i.e. ``"1.chunks"``, ``"2.chunks"`` etc.).
    

    
.. _manual.split.output_files:

=================
Output files
=================

    The *output* (second) parameter of **@split** usually contains a 
    |glob|_ pattern like the ``*.chunks`` above. 

    .. note::
        **Ruffus** is quite relaxed about the contents of the ``output`` parameter.
        Strings are treated as file names. Strings containing |glob|_ pattern are expanded.
        Other types are passed verbatim to the decorated task function.
    
    The files which match the |glob|_ will be passed as the actual parameters to the job
    function. Thus, the first time you run the example code ``*.chunks`` will return an empty list because
    no ``.chunks`` files have been created, resulting in the following:
    
        ::
        
            step_4_split_numbers_into_chunks ("random_numbers.list", [])
    
    After that ``*.chunks`` will match the list of current ``.chunks`` files created by
    the previous pipeline run. 



    File names in *output* are generally out of date or superfluous. They are useful 
    mainly for cleaning-up detritus from previous runs 
    (have a look at :ref:`step_4_split_numbers_into_chunks(...) <Simple_Tutorial_5th_step_code>`).
    
    .. note ::

        It is important, nevertheless, to specify correctly the list of *output* files.
        Otherwise, dependent tasks will not know what files you have created, and it will
        not be possible automatically to chain together the *ouput* of this pipeline task into the
        *inputs* of the next step.
        
        You can specify multiple |glob|_ patterns to match *all* the files which are the
        result of the splitting task function. These can even cover different directories, 
        or groups of file names. This is a more extreme example:
        
            ::
            
                @split("input.file", ['a*.bits', 'b*.pieces', 'somewhere_else/c*.stuff'])
                def split_function (input_filename, output_files):
                    "Code to split up 'input.file'"
                    


    The actual resulting files of this task function are not constrained by the file names
    in the *output* parameter of the function. The whole point of **@split** is that number 
    of resulting output files cannot be known beforehand, after all. 
    
******************
Example
******************

    
    Suppose random_numbers.list can be split into four pieces, this function will create
        ``1.chunks``, ``2.chunks``, ``3.chunks``, ``4.chunks``
        
    Subsequently, we receive a larger ``random_numbers.list`` which should be split into 10
    pieces. If the pipeline is called again, the task function receives the following parameters:
    
        ::
        
            step_4_split_numbers_into_chunks("random_numbers.list", 
                                             ["1.chunks",               #   previously created files
                                              "2.chunks",               #
                                              "3.chunks",               #
                                              "4.chunks" ])             #


    This doesn't stop the function from creating the extra ``5.chunks``, ``6.chunks`` etc.

    .. note::
    
        Any tasks **@follow**\ ing and specifying 
        ``step_4_split_numbers_into_chunks(...)`` as its *inputs* parameter is going to receive
        ``1.chunks``, ``...``, ``10.chunks`` and not merely the first four files.

        In other words, dependent / down-stream tasks which obtain output files automatically 
        from the task decorated by **@split** receive the most current file list. 
        The |glob|_ patterns will be matched again to see exactly what files the task function
        has created in reality *after* the task completes.