/doc/design.rst

https://code.google.com/p/ruffus/ · ReStructuredText · 328 lines · 232 code · 96 blank · 0 comment · 0 complexity · f3e64a0dd364d75b26bdca78993d2fca MD5 · raw file

  1. .. Design:
  2. .. include:: global.inc
  3. .. index::
  4. pair: Design; Ruffus
  5. ###############################
  6. Design & Architecture
  7. ###############################
  8. The *ruffus* module has the following design goals:
  9. * Simplicity.
  10. * Intuitive
  11. * Lightweight
  12. * Unintrusive
  13. * Flexible/Powerful
  14. Computational pipelines, especially in science, are best thought of in terms of data
  15. flowing through successive, dependent stages (**ruffus** calls these :term:`task`\ s).
  16. Traditionally, files have been used to
  17. link pipelined stages together. This means that computational pipelines can be managed
  18. using traditional software construction (`build`) systems.
  19. =================================================
  20. `GNU Make`
  21. =================================================
  22. The grand-daddy of these is UNIX `make <http://en.wikipedia.org/wiki/Make_(software)>`_.
  23. `GNU make <http://www.gnu.org/software/make/>`_ is ubiquitous in the linux world for
  24. installing and compiling software.
  25. It has been widely used to build computational pipelines because it supports:
  26. * Stopping and restarting computational processes
  27. * Running multiple, even thousands of jobs in parallel
  28. .. _design.make_syntax_ugly:
  29. ******************************************************
  30. Deficiencies of `make` / `gmake`
  31. ******************************************************
  32. However, make and `GNU make <http://www.gnu.org/software/make/>`_ use a specialised (domain-specific)
  33. language, which has is been much criticised because of poor support for modern
  34. programming languages features, such as variable scope, pattern matching, debugging.
  35. Make scripts require large amounts of often obscure shell scripting
  36. and makefiles can quickly become unmaintainable.
  37. .. _design.scons_and_rake:
  38. =================================================
  39. `Scons`, `Rake` and other `Make` alternatives
  40. =================================================
  41. Many attempts have been made to produce a more modern version of make, with less of its
  42. historical baggage. These include the Java-based `Apache ant <http://ant.apache.org/>`_ which is specified in xml.
  43. More interesting are a new breed of build systems whose scripts are written in modern programming
  44. languages, rather than a specially-invented "build" specificiation syntax.
  45. These include the Python `scons <http://www.scons.org/>`_, Ruby `rake <http://rake.rubyforge.org/>`_ and
  46. its python port `Smithy <http://packages.python.org/Smithy/>`_.
  47. The great advantages are that computation pipelines do not need to be artificially parcelled out
  48. between (the often second-class) workflow management code, and the logic which does the real computation
  49. in the pipeline. It also means that workflow management can use all the standard language and library
  50. features, for example, to read in directories, match file names using regular expressions and so on.
  51. **Ruffus** is much like scons in that the modern dynamic programming language python is used seamlessly
  52. throughout its pipeline scripts.
  53. .. _design.implicit_dependencies:
  54. **************************************************************************
  55. Implicit dependencies: disadvantages of `make` / `scons` / `rake`
  56. **************************************************************************
  57. Although Python `scons <http://www.scons.org/>`_ and Ruby `rake <http://rake.rubyforge.org/>`_
  58. are in many ways more powerful and easier to use for building software, they are still an
  59. imperfect fit to the world of computational pipelines.
  60. This is a result of the way dependencies are specified, an essential part of their design inherited
  61. from `GNU make <http://www.gnu.org/software/make/>`_.
  62. The order of operations in all of these tools is specified in a *declarative* rather than
  63. *imperative* manner. This means that the sequence of steps that a build should take are
  64. not spelled out explicity and directly. Instead recipes are provided for turning input files
  65. of each type to another.
  66. So, for example, knowing that ``a->b``, ``b->c``, ``c->d``, the build
  67. system can infer how to get from ``a`` to ``d`` by performing the necessary operations in the correct order.
  68. This is immensely powerful for three reasons:
  69. #) The plumbing, such as dependency checking, passing output
  70. from one stage to another, are handled automatically by the build system. (This is the whole point!)
  71. #) The same *recipe* can be re-used at different points in the build.
  72. #) | Intermediate files do not need to be retained.
  73. | Given the automatic inference that ``a->b->c->d``,
  74. we don't need to keep ``b`` and ``c`` files around once ``d`` has been produced.
  75. |
  76. The disadvantage is that because stages are specified only indirectly, in terms of
  77. file name matches, the flow through a complex build or a pipeline can be difficult to trace, and nigh
  78. impossible to debug when there are problems.
  79. .. _design.explicit_dependencies_in_ruffus:
  80. **************************************************************************
  81. Explicit dependencies in `Ruffus`
  82. **************************************************************************
  83. **Ruffus** takes a different approach. The order of operations is specified explicitly rather than inferred
  84. indirectly from the input and output types. So, for example, we would explicitly specify three successive and
  85. linked operations ``a->b``, ``b->c``, ``c->d``. The build system knows that the operations always proceed in
  86. this order.
  87. Looking at a **Ruffus** script, it is always clear immediately what is the succession of computational steps
  88. which will be taken.
  89. **Ruffus** values clarity over syntactic cleverness.
  90. .. _design.static_dependencies:
  91. **************************************************************************
  92. Static dependencies: What `make` / `scons` / `rake` can't do (easily)
  93. **************************************************************************
  94. `GNU make <http://www.gnu.org/software/make/>`_, `scons <http://www.scons.org/>`_ and `rake <http://rake.rubyforge.org/>`_
  95. work by infer a static dependency (diacyclic) graph between all the files which
  96. are used by a computational pipeline. These tools locate the target that they are supposed
  97. to build and work backward through the dependency graph from that target,
  98. rebuilding anything that is out of date.This is perfect for building software,
  99. where the list of files data files can be computed **statically** at the beginning of the build.
  100. This is not ideal matches for scientific computational pipelines because:
  101. * | Though the *stages* of a pipeline (i.e. `compile` or `DNA alignment`) are
  102. invariably well-specified in advance, the number of
  103. operations (*job*\s) involved at each stage may not be.
  104. |
  105. * | A common approach is to break up large data sets into manageable chunks which
  106. can be operated on in parallel in computational clusters or farms
  107. (See `embarassingly parallel problems <http://en.wikipedia.org/wiki/Embarrassingly_parallel>`_).
  108. | This means that the number of parallel operations or jobs varies with the data (the number of manageable chunks),
  109. and dependency trees cannot be calculated statically beforehand.
  110. |
  111. Computational pipelines require **dynamic** dependencies which are not calculated up-front, but
  112. at each stage of the pipeline
  113. This is a *known* issue with traditional build systems each of which has partial strategies to work around
  114. this problem:
  115. * gmake always builds the dependencies when first invoked, so dynamic dependencies require (complex!) recursive calls to gmake
  116. * `Rake dependencies unknown prior to running tasks <http://objectmix.com/ruby/759716-rake-dependencies-unknown-prior-running-tasks-2.html>`_.
  117. * `Scons: Using a Source Generator to Add Targets Dynamically <http://www.scons.org/wiki/DynamicSourceGenerator>`_
  118. **Ruffus** explicitly and straightforwardly handles tasks which produce an indeterminate (i.e. runtime dependent)
  119. number of output, using its **@split**, **@transform**, **merge** function annotations.
  120. =============================================================================
  121. Managing pipelines stage-by-stage using **Ruffus**
  122. =============================================================================
  123. **Ruffus** manages pipeline stages directly.
  124. #) | The computational operations for each stage of the pipeline are written by you, in
  125. separate python functions.
  126. | (These correspond to `gmake pattern rules <http://www.gnu.org/software/make/manual/make.html#Pattern-Rules>`_)
  127. |
  128. #) | The dependencies between pipeline stages (python functions) are specified up-front.
  129. | These can be displayed as a flow chart.
  130. .. image:: images/front_page_flowchart.png
  131. #) **Ruffus** makes sure pipeline stage functions are called in the right order,
  132. with the right parameters, running in parallel using multiprocessing if necessary.
  133. #) Data file timestamps can be used to automatically determine if all or any parts
  134. of the pipeline are out-of-date and need to be rerun.
  135. #) Separate pipeline stages, and operations within each pipeline stage,
  136. can be run in parallel provided they are not inter-dependent.
  137. Another way of looking at this is that **ruffus** re-constructs datafile dependencies dynamically
  138. on-the-fly when it gets to each stage of the pipeline, giving much more flexibility.
  139. **************************************************************************
  140. Disadvantages of the Ruffus design
  141. **************************************************************************
  142. Are there any disadvantages to this trade-off for additional clarity?
  143. #) Each pipeline stage needs to take the right input and output. For example if we specified the
  144. steps in the wrong order: ``a->b``, ``c->d``, ``b->c``, then no useful output would be produced.
  145. #) We cannot re-use the same recipes in different parts of the pipeline
  146. #) Intermediate files need to be retained.
  147. In our experience, it is always obvious when pipeline operations are in the wrong order, precisely because the
  148. order of computation is the very essense of the design of each pipeline. Ruffus produces extra diagnostics when
  149. no output is created in a pipeline stage (usually happens for incorrectly specified regular expressions.)
  150. Re-use of recipes is as simple as an extra call to common function code.
  151. Finally, some users have proposed future enhancements to **Ruffus** to handle unnecessary temporary / intermediate files.
  152. .. index::
  153. pair: Design; Comparison of Ruffus with alternatives
  154. =================================================
  155. Alternatives to **Ruffus**
  156. =================================================
  157. A comparison of more make-like tools is available from `Ian Holmes' group <http://biowiki.org/MakeComparison>`_.
  158. Build systems include:
  159. * `GNU make <http://www.gnu.org/software/make/>`_
  160. * `scons <http://www.scons.org/>`_
  161. * `ant <http://ant.apache.org/>`_
  162. * `rake <http://rake.rubyforge.org/>`_
  163. There are also complete workload managements systems such as Condor.
  164. Various bioinformatics pipelines are also available, including that used by the
  165. leading genome annotation website Ensembl, Pegasys, GPIPE, Taverna, Wildfire, MOWserv,
  166. Triana, Cyrille2 etc. These all are either hardwired to specific databases, and tasks,
  167. or have steep learning curves for both the scientist/developer and the IT system
  168. administrators.
  169. **Ruffus** is designed to be lightweight and unintrusive enough to use for writing pipelines
  170. with just 10 lines of code.
  171. .. seealso::
  172. **Bioinformatics workload managements systems**
  173. Condor:
  174. http://www.cs.wisc.edu/condor/description.html
  175. Ensembl Analysis pipeline:
  176. http://www.ncbi.nlm.nih.gov/pubmed/15123589
  177. Pegasys:
  178. http://www.ncbi.nlm.nih.gov/pubmed/15096276
  179. GPIPE:
  180. http://www.biomedcentral.com/pubmed/15096276
  181. Taverna:
  182. http://www.ncbi.nlm.nih.gov/pubmed/15201187
  183. Wildfire:
  184. http://www.biomedcentral.com/pubmed/15788106
  185. MOWserv:
  186. http://www.biomedcentral.com/pubmed/16257987
  187. Triana:
  188. http://dx.doi.org/10.1007/s10723-005-9007-3
  189. Cyrille2:
  190. http://www.biomedcentral.com/1471-2105/9/96
  191. .. index::
  192. single: Acknowledgements
  193. **************************************************
  194. Acknowledgements
  195. **************************************************
  196. * Bruce Eckel's insightful article on
  197. `A Decorator Based Build System <http://www.artima.com/weblogs/viewpost.jsp?thread=241209>`_
  198. was the obvious inspiration for the use of decorators in *Ruffus*.
  199. The rest of the *Ruffus* takes uses a different approach. In particular:
  200. #. *Ruffus* uses task-based not file-based dependencies
  201. #. *Ruffus* tries to have minimal impact on the functions it decorates.
  202. Bruce Eckel's design wraps functions in "rule" objects.
  203. *Ruffus* tasks are added as attributes of the functions which can be still be
  204. called normally. This is how *Ruffus* decorators can be layered in any order
  205. onto the same task.
  206. * Languages like c++ and Java would probably use a "mixin" approach.
  207. Python's easy support for reflection and function references,
  208. as well as the necessity of marshalling over process boundaries, dictated the
  209. internal architecture of *Ruffus*.
  210. * The `Boost Graph library <http://www.boost.org>`_ for text book implementations of directed
  211. graph traversals.
  212. * `Graphviz <http://www.graphviz.org/>`_. Just works. Wonderful.
  213. * Andreas Heger, Christoffer Nell?ker and Grant Belgard for driving Ruffus towards
  214. ever simpler syntax.
  215. .. index::
  216. pair: Ruffus; Etymology
  217. pair: Ruffus; Name origins
  218. **************************************************
  219. Whence the name *Ruffus*?
  220. **************************************************
  221. .. image:: images/wikimedia_cyl_ruffus.jpg
  222. **Cylindrophis ruffus** is the name of the
  223. `red-tailed pipe snake <http://en.wikipedia.org/wiki/Cylindrophis_ruffus>`_ (bad python-y pun)
  224. which can be found in `Hong Kong <http://www.discoverhongkong.com/eng/index.html>`_ where the original author comes from.
  225. Be careful not to step on one when running down country park lanes at full speed
  226. in Hong Kong: this snake is a `rare breed <http://www.hkras.org/eng/info/hkspp.htm>`_!
  227. *Ruffus* is a shy creature, and pretends to be a cobra by putting up its red tail and ducking its
  228. head in its coils when startled. It is not venomous and is
  229. `Mostly Harmless <http://en.wikipedia.org/wiki/Mostly_Harmless>`_.
  230. *Ruffus* does most of its work at night and sleeps during the day:
  231. typical of many (but alas not all) python programmers!
  232. The original image is from `wikimedia <http://upload.wikimedia.org/wikipedia/commons/a/a1/Cyl_ruffus_061212_2025_tdp.jpg>`_