PageRenderTime 29ms CodeModel.GetById 22ms app.highlight 4ms RepoModel.GetById 1ms app.codeStats 0ms

/doc/tutorials/simple_tutorial/step2_files.rst

https://code.google.com/p/ruffus/
ReStructuredText | 239 lines | 146 code | 93 blank | 0 comment | 0 complexity | 2ebbb3944526cf9b6fdb204492d67341 MD5 | raw file
  1.. include:: ../../global.inc
  2.. _Simple_Tutorial_2nd_step:
  3
  4.. index:: 
  5    pair: @files; Tutorial
  6
  7###################################################################
  8Step 2: Passing parameters to the pipeline
  9###################################################################
 10
 11   * :ref:`Simple tutorial overview <Simple_Tutorial>` 
 12   * :ref:`@files syntax in detail <decorators.files>`
 13
 14.. note::
 15    Remember to look at the example code:
 16
 17    * :ref:`Python Code for step 2 <Simple_Tutorial_2nd_step_code>` 
 18
 19***************************************
 20Overview
 21***************************************
 22    | The python functions which do the actual work of each stage  or
 23      :term:`task` of a **Ruffus** pipeline are written by you.
 24    | The role of **Ruffus** is to make sure these functions are called in the right order, 
 25      with the right parameters, running in parallel using multiprocessing if desired.
 26    | This step of the tutorial explains how these pipeline parameters are specified.
 27
 28    By letting **Ruffus** manage your pipeline parameters, you will get the following features
 29    for free: 
 30    
 31        #. only out-of-date parts of the pipeline will be re-run
 32        #. multiple jobs can be run in parallel (on different processors if possible)
 33        #. pipeline stages can be chained together automatically
 34    
 35
 36    Let us start with the simplest case where a pipeline stage consists of a single
 37    job with one *input*, one *output*, and an optional number of extra parameters:
 38    
 39
 40************************************
 41*@files*
 42************************************
 43    The :ref:`@files <decorators.files>` decorator provides parameters to a task.        
 44                                                                                               
 45                                                                                               
 46    .. note::
 47    
 48        All strings in the first two parameters are treatd as *input* and *output* files (respectively). 
 49            
 50    Let us provide `input` and `outputs` to our new pipeline:                                 
 51           
 52        .. image:: ../../images/simple_tutorial_files1.png
 53
 54        
 55    This is exactly equivalent to the following function call:
 56
 57        ::
 58                
 59            pipeline_task('task1.input', 'task1.output', 'optional_1.extra', 'optional_2.extra')
 60
 61        
 62        
 63    and the output from **Ruffus** will thus be:
 64    
 65        .. image:: ../../images/simple_tutorial_files2.png
 66
 67.. ::
 68            >>> pipeline_run([pipeline_task])
 69
 70                Job = [task1.input -> task1.output, optional_1.extra, optional_2.extra] completed
 71            Completed Task = pipeline_task
 72        
 73        
 74************************************
 75Specifying jobs in parallel
 76************************************
 77    | Each :term:`task` function of the pipeline can also be a recipe or 
 78      `rule <http://www.gnu.org/software/make/manual/make.html#Rule-Introduction>`_  
 79      which can be applied at the same time to many different parameters.
 80    | For example, one can have a *compile task* which will compile any source code, or
 81      a *count_lines task* which will count the number of lines in any file.
 82      
 83    | In the example above, we have made a single output by supplying a single input parameter.
 84     You can use much the same syntax to apply the same recipe to *multiple* inputs at 
 85     the same time. 
 86
 87    .. note ::
 88    
 89        Each time a separate set of parameters is forwarded to your task function,
 90        Ruffus calls this a :term:`job`. Each task can thus have many jobs which 
 91        can be run in parallel.
 92    
 93    Instead of providing a single *input*, and a single *output*, we are going to specify
 94    the parameters for *two* jobs at once:
 95    
 96
 97    .. image:: ../../images/simple_tutorial_files3.png
 98    
 99
100    To run this example, copy and paste the code :ref:`here<Simple_Tutorial_2nd_step_code>` into your python interpreter.
101
102    
103            
104    This is exactly equivalent to the following function calls:
105
106        ::
107                
108            second_task('job1.stage1', "job1.stage2", "    1st_job")
109            second_task('job2.stage1', "job2.stage2", "    2nd_job")
110    
111    The result of running this should look familiar:
112        ::
113            
114            Start Task = second_task
115                1st_job
116                Job = [job1.stage1 -> job1.stage2,     1st_job] completed
117                2nd_job
118                Job = [job2.stage1 -> job2.stage2,     2nd_job] completed
119            Completed Task = second_task
120
121************************************
122Multi-tasking
123************************************
124
125    Though, the two jobs have been specified in parallel, **Ruffus** defaults to running
126    each of them successively. With modern CPUs, it is often a lot faster to run parts
127    of your pipeline in parallel, all at the same time.
128    
129    To do this, all you have to do is to add a multiprocess parameter to pipeline_run::
130    
131            >>> pipeline_run([pipeline_task], multiprocess = 5)
132            
133    In this case, ruffus will try to run up to 5 jobs at the same time. Since our second
134    task only has two jobs, these will be started simultaneously.
135    
136
137
138************************************
139Up-to-date jobs are not re-run
140************************************
141        
142    | A job will be run only if the output file timestamps are out of date.                          
143    | If you ran the same code a second time,
144
145        ::
146        
147            >>> pipeline_run([pipeline_task])
148
149
150    | nothing would happen because 
151    | ``job1.stage2`` is more recent than ``job1.stage1`` and
152    | ``job2.stage2`` is more recent than ``job2.stage1``.
153        
154    However, if you subsequently modified ``job1.stage1`` and re-ran the pipeline:
155        ::
156    
157            open("job1.stage1", "w")
158            pipeline_run([second_task], verbose =2, multiprocess = 5)
159        
160    
161    You would see the following:
162        .. image:: ../../images/simple_tutorial_files4.png
163    
164.. index:: 
165    pair: input / output parameters; Tutorial
166    
167***************************************
168*Input* and *output* data for each job
169***************************************
170
171    In the above examples, the *input* and *output* parameters are single file names. In a real
172    computational pipeline, the task parameters could be all sorts of data, from
173    lists of files, to numbers, sets or tuples. Ruffus imposes few constraints on what *you*
174    would like to send to each stage of your pipeline. 
175
176    **Ruffus** will, however, look inside each
177    of your *input* and *output* parameters to see if they contain any names of up to date files. 
178
179    If the *input* parameter contains a |glob|_ pattern,
180    that will even be expanded to the matching file names.
181    
182    
183    For example, 
184    
185        | the *input* parameter for our task function might be all files which match the glob ``*.input`` plus the number ``2``
186        | the *output* parameter could be a tuple nested inside a list : ``["task1.output1", ("task1.output2", "task1.output3")]``
187    
188    Running the following code:
189    
190        ::
191            
192            from ruffus import *            
193
194            @files(["*.input", 2], ["task1.output1", ("task1.output2", "task1.output3")])
195            def pipeline_task(inputs, outputs):
196                pass
197        
198            # make sure the input files are there
199            open("task1a.input", "w")        
200            open("task1b.input", "w")        
201        
202            pipeline_run([pipeline_task])
203
204    will result in the following function call:
205
206        ::
207                
208            pipeline_task(["task1a.input", "task1b.input", 2], ["task1.output1", ("task1.output2", "task1.output3")])
209    
210
211    and will give the following results:
212    
213        .. image:: ../../images/simple_tutorial_files5.png
214    
215        .. ::
216            
217          ::    
218
219            >>> pipeline_run([pipeline_task])
220
221                Job = [[task1a.input, task1b.input, 2] -> [task1.output1, (task1.output2, task1.output3)]] completed
222            Completed Task = pipeline_task
223            
224    
225
226    The files 
227        ::
228                
229            "task1a.input"
230            "task1b.input"
231         
232        and ::
233        
234            "task1.output1"
235            "task1.output2"
236            "task1.output3"
237            
238    will be used to check if the task is up to date. The number ``2`` is ignored for this purpose.
239