/doc/tutorials/manual/split.rst

https://code.google.com/p/ruffus/ · ReStructuredText · 177 lines · 112 code · 65 blank · 0 comment · 0 complexity · 2e5ce700004f9a9eed2d4705a104a9cc MD5 · raw file

  1. .. include:: ../../global.inc
  2. .. include:: chapter_numbers.inc
  3. .. _manual.split:
  4. ###################################################################################
  5. |manual.split.chapter_num|: `Splitting up large tasks / files with` **@split**
  6. ###################################################################################
  7. .. hlist::
  8. * :ref:`Manual overview <manual>`
  9. * :ref:`@split <decorators.split>` syntax in detail
  10. A common requirement in computational pipelines is to split up a large task into
  11. small jobs which can be run on different processors, (or sent to a computational
  12. cluster). Very often, the number of jobs depends dynamically on the size of the
  13. task, and cannot be known for sure beforehand.
  14. *Ruffus* uses the :ref:`@split <decorators.split>` decorator to indicate that
  15. the :term:`task` function will produce an indeterminate number of output files.
  16. .. index::
  17. pair: @split; Manual
  18. =================
  19. **@split**
  20. =================
  21. This example is borrowed from :ref:`step 4 <Simple_Tutorial_5th_step>` of the simple tutorial.
  22. .. note :: See :ref:`accompanying Python Code <Simple_Tutorial_5th_step_code>`
  23. **************************************************************************************
  24. Splitting up a long list of random numbers to calculate their variance
  25. **************************************************************************************
  26. .. csv-table::
  27. :widths: 1,99
  28. :class: borderless
  29. ".. centered::
  30. Step 5 from the tutorial:
  31. .. image:: ../../images/simple_tutorial_step5_sans_key.png", "
  32. Suppose we had a list of 100,000 random numbers in the file ``random_numbers.list``:
  33. ::
  34. import random
  35. f = open('random_numbers.list', 'w')
  36. for i in range(NUMBER_OF_RANDOMS):
  37. f.write('%g\n' % (random.random() * 100.0))
  38. We might want to calculate the sample variance more quickly by splitting them
  39. into ``NNN`` parcels of 1000 numbers each and working on them in parallel.
  40. In this case we known that ``NNN == 100`` but usually the number of resulting files
  41. is only apparent after we have finished processing our starting file."
  42. Our pipeline function needs to take the random numbers file ``random_numbers.list``,
  43. read the random numbers from it, and write to a new file every 100 lines.
  44. The *Ruffus* decorator :ref:`@split<decorators.split>` is designed specifically for
  45. splitting up *inputs* into an indeterminate ``NNN`` number of *outputs*:
  46. .. image:: ../../images/simple_tutorial_split.png
  47. .. ::
  48. ::
  49. @split("random_numbers.list", "*.chunks")
  50. def step_4_split_numbers_into_chunks (input_file_name, output_files):
  51. #
  52. """code goes here"""
  53. Ruffus will set
  54. | ``input_file_name`` to ``"random_numbers.list"``
  55. | ``output_files`` to all files which match ``*.chunks`` (i.e. ``"1.chunks"``, ``"2.chunks"`` etc.).
  56. .. _manual.split.output_files:
  57. =================
  58. Output files
  59. =================
  60. The *output* (second) parameter of **@split** usually contains a
  61. |glob|_ pattern like the ``*.chunks`` above.
  62. .. note::
  63. **Ruffus** is quite relaxed about the contents of the ``output`` parameter.
  64. Strings are treated as file names. Strings containing |glob|_ pattern are expanded.
  65. Other types are passed verbatim to the decorated task function.
  66. The files which match the |glob|_ will be passed as the actual parameters to the job
  67. function. Thus, the first time you run the example code ``*.chunks`` will return an empty list because
  68. no ``.chunks`` files have been created, resulting in the following:
  69. ::
  70. step_4_split_numbers_into_chunks ("random_numbers.list", [])
  71. After that ``*.chunks`` will match the list of current ``.chunks`` files created by
  72. the previous pipeline run.
  73. File names in *output* are generally out of date or superfluous. They are useful
  74. mainly for cleaning-up detritus from previous runs
  75. (have a look at :ref:`step_4_split_numbers_into_chunks(...) <Simple_Tutorial_5th_step_code>`).
  76. .. note ::
  77. It is important, nevertheless, to specify correctly the list of *output* files.
  78. Otherwise, dependent tasks will not know what files you have created, and it will
  79. not be possible automatically to chain together the *ouput* of this pipeline task into the
  80. *inputs* of the next step.
  81. You can specify multiple |glob|_ patterns to match *all* the files which are the
  82. result of the splitting task function. These can even cover different directories,
  83. or groups of file names. This is a more extreme example:
  84. ::
  85. @split("input.file", ['a*.bits', 'b*.pieces', 'somewhere_else/c*.stuff'])
  86. def split_function (input_filename, output_files):
  87. "Code to split up 'input.file'"
  88. The actual resulting files of this task function are not constrained by the file names
  89. in the *output* parameter of the function. The whole point of **@split** is that number
  90. of resulting output files cannot be known beforehand, after all.
  91. ******************
  92. Example
  93. ******************
  94. Suppose random_numbers.list can be split into four pieces, this function will create
  95. ``1.chunks``, ``2.chunks``, ``3.chunks``, ``4.chunks``
  96. Subsequently, we receive a larger ``random_numbers.list`` which should be split into 10
  97. pieces. If the pipeline is called again, the task function receives the following parameters:
  98. ::
  99. step_4_split_numbers_into_chunks("random_numbers.list",
  100. ["1.chunks", # previously created files
  101. "2.chunks", #
  102. "3.chunks", #
  103. "4.chunks" ]) #
  104. This doesn't stop the function from creating the extra ``5.chunks``, ``6.chunks`` etc.
  105. .. note::
  106. Any tasks **@follow**\ ing and specifying
  107. ``step_4_split_numbers_into_chunks(...)`` as its *inputs* parameter is going to receive
  108. ``1.chunks``, ``...``, ``10.chunks`` and not merely the first four files.
  109. In other words, dependent / down-stream tasks which obtain output files automatically
  110. from the task decorated by **@split** receive the most current file list.
  111. The |glob|_ patterns will be matched again to see exactly what files the task function
  112. has created in reality *after* the task completes.