PageRenderTime 29ms CodeModel.GetById 18ms app.highlight 7ms RepoModel.GetById 1ms app.codeStats 0ms

/doc/design.rst

https://code.google.com/p/ruffus/
ReStructuredText | 328 lines | 232 code | 96 blank | 0 comment | 0 complexity | f3e64a0dd364d75b26bdca78993d2fca MD5 | raw file
  1.. Design:
  2
  3.. include:: global.inc
  4
  5.. index:: 
  6    pair: Design; Ruffus
  7
  8###############################
  9Design & Architecture
 10###############################
 11
 12    The *ruffus* module has the following design goals:
 13    
 14        * Simplicity.
 15        * Intuitive
 16        * Lightweight
 17        * Unintrusive
 18        * Flexible/Powerful
 19
 20
 21    Computational pipelines, especially in science, are best thought of in terms of data
 22    flowing through successive, dependent stages (**ruffus** calls these :term:`task`\ s).
 23    Traditionally, files have been used to 
 24    link pipelined stages together. This means that computational pipelines can be managed 
 25    using traditional software construction (`build`) systems.
 26
 27=================================================
 28`GNU Make`
 29=================================================
 30    The grand-daddy of these is UNIX `make <http://en.wikipedia.org/wiki/Make_(software)>`_.
 31    `GNU make <http://www.gnu.org/software/make/>`_ is ubiquitous in the linux world for 
 32    installing and compiling software.
 33    It has been widely used to build computational pipelines because it supports:
 34
 35    * Stopping and restarting computational processes
 36    * Running multiple, even thousands of jobs in parallel
 37
 38.. _design.make_syntax_ugly:
 39
 40******************************************************
 41Deficiencies of `make` / `gmake`
 42******************************************************
 43
 44    However, make and `GNU make <http://www.gnu.org/software/make/>`_ use a specialised (domain-specific) 
 45    language, which has is been much criticised because of poor support for modern 
 46    programming languages features, such as variable scope, pattern matching, debugging. 
 47    Make scripts require large amounts of often obscure shell scripting
 48    and makefiles can quickly become unmaintainable.
 49    
 50.. _design.scons_and_rake:
 51
 52=================================================
 53`Scons`, `Rake` and other `Make` alternatives
 54=================================================
 55    
 56    Many attempts have been made to produce a more modern version of make, with less of its
 57    historical baggage. These include the Java-based `Apache ant <http://ant.apache.org/>`_ which is specified in xml.
 58
 59    More interesting are a new breed of build systems whose scripts are written in modern programming
 60    languages, rather than a specially-invented "build" specificiation syntax.
 61    These include the Python `scons <http://www.scons.org/>`_, Ruby `rake <http://rake.rubyforge.org/>`_ and
 62    its python port `Smithy <http://packages.python.org/Smithy/>`_.
 63
 64    The great advantages are that computation pipelines do not need to be artificially parcelled out
 65    between (the often second-class) workflow management code, and the logic which does the real computation
 66    in the pipeline. It also means that workflow management can use all the standard language and library 
 67    features, for example, to read in directories, match file names using regular expressions and so on.
 68
 69    **Ruffus** is much like scons in that the modern dynamic programming language python is used seamlessly
 70    throughout its pipeline scripts.
 71    
 72.. _design.implicit_dependencies:
 73
 74**************************************************************************
 75Implicit dependencies: disadvantages of `make` / `scons` / `rake`
 76**************************************************************************
 77
 78    Although Python `scons <http://www.scons.org/>`_ and Ruby `rake <http://rake.rubyforge.org/>`_
 79    are in many ways more powerful and easier to use for building software, they are still an 
 80    imperfect fit to the world of computational pipelines.
 81
 82    This is a result of the way dependencies are specified, an essential part of their design inherited
 83    from `GNU make <http://www.gnu.org/software/make/>`_.
 84
 85    The order of operations in all of these tools is specified in a *declarative* rather than 
 86    *imperative* manner. This means that the sequence of steps that a build should take are
 87    not spelled out explicity and directly. Instead recipes are provided for turning input files
 88    of each type to another. 
 89
 90    So, for example, knowing that ``a->b``, ``b->c``, ``c->d``, the build
 91    system can infer how to get from ``a`` to ``d`` by performing the necessary operations in the correct order.
 92
 93    This is immensely powerful for three reasons:
 94     #) The plumbing, such as dependency checking, passing output 
 95        from one stage to another, are handled automatically by the build system. (This is the whole point!)
 96     #) The same *recipe* can be re-used at different points in the build.
 97     #) | Intermediate files do not need to be retained. 
 98        | Given the automatic inference that ``a->b->c->d``,
 99          we don't need to keep ``b`` and ``c`` files around once ``d`` has been produced.
100        |
101
102
103    The disadvantage is that because stages are specified only indirectly, in terms of 
104    file name matches, the flow through a complex build or a pipeline can be difficult to trace, and nigh 
105    impossible to debug when there are problems.
106
107
108.. _design.explicit_dependencies_in_ruffus:
109
110**************************************************************************
111Explicit dependencies in `Ruffus`
112**************************************************************************
113
114    **Ruffus** takes a different approach. The order of operations is specified explicitly rather than inferred
115    indirectly from the input and output types. So, for example, we would explicitly specify three successive and
116    linked operations ``a->b``, ``b->c``, ``c->d``. The build system knows that the operations always proceed in
117    this order.
118
119    Looking at a **Ruffus** script, it is always clear immediately what is the succession of computational steps 
120    which will be taken.
121
122    **Ruffus** values clarity over syntactic cleverness.
123
124.. _design.static_dependencies:
125
126**************************************************************************
127Static dependencies: What `make` / `scons` / `rake` can't do (easily)
128**************************************************************************
129
130    `GNU make <http://www.gnu.org/software/make/>`_, `scons <http://www.scons.org/>`_ and `rake <http://rake.rubyforge.org/>`_
131    work by infer a static dependency (diacyclic) graph between all the files which 
132    are used by a computational pipeline. These tools locate the target that they are supposed 
133    to build and work backward through the dependency graph from that target, 
134    rebuilding anything that is out of date.This is perfect for building software,
135    where the list of files data files can be computed **statically** at the beginning of the build.
136
137    This is not ideal matches for scientific computational pipelines because:
138    
139        *  | Though the *stages* of a pipeline (i.e. `compile` or `DNA alignment`) are 
140             invariably well-specified in advance, the number of
141             operations (*job*\s) involved at each stage may not be.
142           |
143           
144        *  | A common approach is to break up large data sets into manageable chunks which
145             can be operated on in parallel in computational clusters or farms 
146             (See `embarassingly parallel problems <http://en.wikipedia.org/wiki/Embarrassingly_parallel>`_).
147           | This means that the number of parallel operations or jobs varies with the data (the number of manageable chunks),
148             and dependency trees cannot be calculated statically beforehand.
149           |
150
151    Computational pipelines require **dynamic** dependencies which are not calculated up-front, but
152    at each stage of the pipeline
153
154    This is a *known* issue with traditional build systems each of which has partial strategies to work around
155    this problem:
156
157        * gmake always builds the dependencies when first invoked, so dynamic dependencies require (complex!) recursive calls to gmake
158        * `Rake dependencies unknown prior to running tasks <http://objectmix.com/ruby/759716-rake-dependencies-unknown-prior-running-tasks-2.html>`_.
159        * `Scons: Using a Source Generator to Add Targets Dynamically <http://www.scons.org/wiki/DynamicSourceGenerator>`_
160
161
162    **Ruffus** explicitly and straightforwardly handles tasks which produce an indeterminate (i.e. runtime dependent) 
163    number of output, using its **@split**, **@transform**, **merge** function annotations.
164
165=============================================================================
166Managing pipelines stage-by-stage using **Ruffus**
167=============================================================================
168    **Ruffus** manages pipeline stages directly.
169    
170        #) | The computational operations for each stage of the pipeline are written by you, in
171             separate python functions. 
172           | (These correspond to `gmake pattern rules <http://www.gnu.org/software/make/manual/make.html#Pattern-Rules>`_)
173           |
174
175        #) | The dependencies between pipeline stages (python functions) are specified up-front.
176           | These can be displayed as a flow chart. 
177
178           .. image:: images/front_page_flowchart.png
179
180        #) **Ruffus** makes sure pipeline stage functions are called in the right order, 
181           with the right parameters, running in parallel using multiprocessing if necessary.
182
183        #) Data file timestamps can be used to automatically determine if all or any parts
184           of the pipeline are out-of-date and need to be rerun.
185
186        #) Separate pipeline stages, and operations within each pipeline stage,
187           can be run in parallel provided they are not inter-dependent.
188           
189    Another way of looking at this is that **ruffus** re-constructs datafile dependencies dynamically
190    on-the-fly when it gets to each stage of the pipeline, giving much more flexibility.
191
192**************************************************************************
193Disadvantages of the Ruffus design
194**************************************************************************
195    Are there any disadvantages to this trade-off for additional clarity?
196
197        #) Each pipeline stage needs to take the right input and output. For example if we specified the 
198           steps in the wrong order: ``a->b``, ``c->d``, ``b->c``, then no useful output would be produced.
199        #) We cannot re-use the same recipes in different parts of the pipeline
200        #) Intermediate files need to be retained.
201
202
203    In our experience, it is always obvious when pipeline operations are in the wrong order, precisely because the
204    order of computation is the very essense of the design of each pipeline. Ruffus produces extra diagnostics when
205    no output is created in a pipeline stage (usually happens for incorrectly specified regular expressions.)
206
207    Re-use of recipes is as simple as an extra call to common function code. 
208
209    Finally, some users have proposed future enhancements to **Ruffus** to handle unnecessary temporary / intermediate files.
210
211
212.. index:: 
213    pair: Design; Comparison of Ruffus with alternatives
214    
215=================================================
216Alternatives to **Ruffus**
217=================================================
218
219    A comparison of more make-like tools is available from `Ian Holmes' group <http://biowiki.org/MakeComparison>`_.
220
221    Build systems include:
222    
223            * `GNU make <http://www.gnu.org/software/make/>`_
224            * `scons <http://www.scons.org/>`_
225            * `ant <http://ant.apache.org/>`_
226            * `rake <http://rake.rubyforge.org/>`_
227            
228    There are also complete workload managements systems such as Condor. 
229    Various bioinformatics pipelines are also available, including that used by the
230    leading genome annotation website Ensembl, Pegasys, GPIPE, Taverna, Wildfire, MOWserv,
231    Triana, Cyrille2 etc. These all are either hardwired to specific databases, and tasks,
232    or have steep learning curves for both the scientist/developer and the IT system
233    administrators.
234    
235    **Ruffus** is designed to be lightweight and unintrusive enough to use for writing pipelines
236    with just 10 lines of code.
237
238
239.. seealso::
240
241
242   **Bioinformatics workload managements systems**
243   
244    Condor:
245        http://www.cs.wisc.edu/condor/description.html
246    
247    Ensembl Analysis pipeline:
248        http://www.ncbi.nlm.nih.gov/pubmed/15123589
249    
250    
251    Pegasys:
252        http://www.ncbi.nlm.nih.gov/pubmed/15096276
253    
254    GPIPE:
255        http://www.biomedcentral.com/pubmed/15096276
256    
257    Taverna:
258        http://www.ncbi.nlm.nih.gov/pubmed/15201187
259    
260    Wildfire:
261        http://www.biomedcentral.com/pubmed/15788106
262    
263    MOWserv:
264        http://www.biomedcentral.com/pubmed/16257987
265    
266    Triana:
267        http://dx.doi.org/10.1007/s10723-005-9007-3
268    
269    Cyrille2:
270        http://www.biomedcentral.com/1471-2105/9/96
271    
272    
273.. index:: 
274    single: Acknowledgements
275
276**************************************************
277Acknowledgements
278**************************************************
279 *  Bruce Eckel's insightful article on 
280    `A Decorator Based Build System <http://www.artima.com/weblogs/viewpost.jsp?thread=241209>`_
281    was the obvious inspiration for the use of decorators in *Ruffus*. 
282
283    The rest of the *Ruffus* takes uses a different approach. In particular:
284        #. *Ruffus* uses task-based not file-based dependencies
285        #. *Ruffus* tries to have minimal impact on the functions it decorates.
286           
287           Bruce Eckel's design wraps functions in "rule" objects. 
288           
289           *Ruffus* tasks are added as attributes of the functions which can be still be
290           called normally. This is how *Ruffus* decorators can be layered in any order 
291           onto the same task.
292
293 *  Languages like c++ and Java would probably use a "mixin" approach. 
294    Python's easy support for reflection and function references, 
295    as well as the necessity of marshalling over process boundaries, dictated the
296    internal architecture of *Ruffus*.
297 *  The `Boost Graph library <http://www.boost.org>`_ for text book implementations of directed
298    graph traversals.
299 *  `Graphviz <http://www.graphviz.org/>`_. Just works. Wonderful.
300 *  Andreas Heger, Christoffer Nell?ker and Grant Belgard for driving Ruffus towards
301    ever simpler syntax.
302
303   
304.. index:: 
305    pair: Ruffus; Etymology
306    pair: Ruffus; Name origins
307    
308**************************************************
309Whence the name *Ruffus*?
310**************************************************
311
312.. image:: images/wikimedia_cyl_ruffus.jpg
313
314**Cylindrophis ruffus** is the name of the 
315`red-tailed pipe snake <http://en.wikipedia.org/wiki/Cylindrophis_ruffus>`_ (bad python-y pun)
316which can be found in `Hong Kong <http://www.discoverhongkong.com/eng/index.html>`_ where the original author comes from.
317Be careful not to step on one when running down country park lanes at full speed 
318in Hong Kong: this snake is a `rare breed <http://www.hkras.org/eng/info/hkspp.htm>`_!
319
320*Ruffus* is a shy creature, and pretends to be a cobra by putting up its red tail and ducking its
321head in its coils when startled. It is not venomous and is 
322`Mostly Harmless <http://en.wikipedia.org/wiki/Mostly_Harmless>`_.
323*Ruffus* does most of its work at night and sleeps during the day:
324typical of many (but alas not all) python programmers!
325
326
327The original image is from `wikimedia <http://upload.wikimedia.org/wikipedia/commons/a/a1/Cyl_ruffus_061212_2025_tdp.jpg>`_
328