/doc/tutorials/simple_tutorial/step2_files.rst
https://code.google.com/p/ruffus/ · ReStructuredText · 239 lines · 146 code · 93 blank · 0 comment · 0 complexity · 2ebbb3944526cf9b6fdb204492d67341 MD5 · raw file
- .. include:: ../../global.inc
- .. _Simple_Tutorial_2nd_step:
- .. index::
- pair: @files; Tutorial
- ###################################################################
- Step 2: Passing parameters to the pipeline
- ###################################################################
- * :ref:`Simple tutorial overview <Simple_Tutorial>`
- * :ref:`@files syntax in detail <decorators.files>`
- .. note::
- Remember to look at the example code:
- * :ref:`Python Code for step 2 <Simple_Tutorial_2nd_step_code>`
- ***************************************
- Overview
- ***************************************
- | The python functions which do the actual work of each stage or
- :term:`task` of a **Ruffus** pipeline are written by you.
- | The role of **Ruffus** is to make sure these functions are called in the right order,
- with the right parameters, running in parallel using multiprocessing if desired.
- | This step of the tutorial explains how these pipeline parameters are specified.
- By letting **Ruffus** manage your pipeline parameters, you will get the following features
- for free:
-
- #. only out-of-date parts of the pipeline will be re-run
- #. multiple jobs can be run in parallel (on different processors if possible)
- #. pipeline stages can be chained together automatically
-
- Let us start with the simplest case where a pipeline stage consists of a single
- job with one *input*, one *output*, and an optional number of extra parameters:
-
- ************************************
- *@files*
- ************************************
- The :ref:`@files <decorators.files>` decorator provides parameters to a task.
-
-
- .. note::
-
- All strings in the first two parameters are treatd as *input* and *output* files (respectively).
-
- Let us provide `input` and `outputs` to our new pipeline:
-
- .. image:: ../../images/simple_tutorial_files1.png
-
- This is exactly equivalent to the following function call:
- ::
-
- pipeline_task('task1.input', 'task1.output', 'optional_1.extra', 'optional_2.extra')
-
-
- and the output from **Ruffus** will thus be:
-
- .. image:: ../../images/simple_tutorial_files2.png
- .. ::
- >>> pipeline_run([pipeline_task])
- Job = [task1.input -> task1.output, optional_1.extra, optional_2.extra] completed
- Completed Task = pipeline_task
-
-
- ************************************
- Specifying jobs in parallel
- ************************************
- | Each :term:`task` function of the pipeline can also be a recipe or
- `rule <http://www.gnu.org/software/make/manual/make.html#Rule-Introduction>`_
- which can be applied at the same time to many different parameters.
- | For example, one can have a *compile task* which will compile any source code, or
- a *count_lines task* which will count the number of lines in any file.
-
- | In the example above, we have made a single output by supplying a single input parameter.
- You can use much the same syntax to apply the same recipe to *multiple* inputs at
- the same time.
- .. note ::
-
- Each time a separate set of parameters is forwarded to your task function,
- Ruffus calls this a :term:`job`. Each task can thus have many jobs which
- can be run in parallel.
-
- Instead of providing a single *input*, and a single *output*, we are going to specify
- the parameters for *two* jobs at once:
-
- .. image:: ../../images/simple_tutorial_files3.png
-
- To run this example, copy and paste the code :ref:`here<Simple_Tutorial_2nd_step_code>` into your python interpreter.
-
-
- This is exactly equivalent to the following function calls:
- ::
-
- second_task('job1.stage1', "job1.stage2", " 1st_job")
- second_task('job2.stage1', "job2.stage2", " 2nd_job")
-
- The result of running this should look familiar:
- ::
-
- Start Task = second_task
- 1st_job
- Job = [job1.stage1 -> job1.stage2, 1st_job] completed
- 2nd_job
- Job = [job2.stage1 -> job2.stage2, 2nd_job] completed
- Completed Task = second_task
- ************************************
- Multi-tasking
- ************************************
- Though, the two jobs have been specified in parallel, **Ruffus** defaults to running
- each of them successively. With modern CPUs, it is often a lot faster to run parts
- of your pipeline in parallel, all at the same time.
-
- To do this, all you have to do is to add a multiprocess parameter to pipeline_run::
-
- >>> pipeline_run([pipeline_task], multiprocess = 5)
-
- In this case, ruffus will try to run up to 5 jobs at the same time. Since our second
- task only has two jobs, these will be started simultaneously.
-
- ************************************
- Up-to-date jobs are not re-run
- ************************************
-
- | A job will be run only if the output file timestamps are out of date.
- | If you ran the same code a second time,
- ::
-
- >>> pipeline_run([pipeline_task])
- | nothing would happen because
- | ``job1.stage2`` is more recent than ``job1.stage1`` and
- | ``job2.stage2`` is more recent than ``job2.stage1``.
-
- However, if you subsequently modified ``job1.stage1`` and re-ran the pipeline:
- ::
-
- open("job1.stage1", "w")
- pipeline_run([second_task], verbose =2, multiprocess = 5)
-
-
- You would see the following:
- .. image:: ../../images/simple_tutorial_files4.png
-
- .. index::
- pair: input / output parameters; Tutorial
-
- ***************************************
- *Input* and *output* data for each job
- ***************************************
- In the above examples, the *input* and *output* parameters are single file names. In a real
- computational pipeline, the task parameters could be all sorts of data, from
- lists of files, to numbers, sets or tuples. Ruffus imposes few constraints on what *you*
- would like to send to each stage of your pipeline.
- **Ruffus** will, however, look inside each
- of your *input* and *output* parameters to see if they contain any names of up to date files.
- If the *input* parameter contains a |glob|_ pattern,
- that will even be expanded to the matching file names.
-
-
- For example,
-
- | the *input* parameter for our task function might be all files which match the glob ``*.input`` plus the number ``2``
- | the *output* parameter could be a tuple nested inside a list : ``["task1.output1", ("task1.output2", "task1.output3")]``
-
- Running the following code:
-
- ::
-
- from ruffus import *
- @files(["*.input", 2], ["task1.output1", ("task1.output2", "task1.output3")])
- def pipeline_task(inputs, outputs):
- pass
-
- # make sure the input files are there
- open("task1a.input", "w")
- open("task1b.input", "w")
-
- pipeline_run([pipeline_task])
- will result in the following function call:
- ::
-
- pipeline_task(["task1a.input", "task1b.input", 2], ["task1.output1", ("task1.output2", "task1.output3")])
-
- and will give the following results:
-
- .. image:: ../../images/simple_tutorial_files5.png
-
- .. ::
-
- ::
- >>> pipeline_run([pipeline_task])
- Job = [[task1a.input, task1b.input, 2] -> [task1.output1, (task1.output2, task1.output3)]] completed
- Completed Task = pipeline_task
-
-
- The files
- ::
-
- "task1a.input"
- "task1b.input"
-
- and ::
-
- "task1.output1"
- "task1.output2"
- "task1.output3"
-
- will be used to check if the task is up to date. The number ``2`` is ignored for this purpose.