/CHANGES.TXT
https://code.google.com/p/ruffus/ · Plain Text · 593 lines · 502 code · 91 blank · 0 comment · 0 complexity · be4ea61b290fb379d8d324e03f692c3c MD5 · raw file
- = v. 2.3=
- _03/October/2011_
- ==`@active_if` turns off tasks at runtime==
- * Issue 36
- * Design and initial implementation from Jacob Biesinger
- * Takes one or more parameters which can be either booleans or functions or callable objects which return True / False
- * The expressions inside @active_if are evaluated each time
- `pipeline_run`, `pipeline_printout` or `pipeline_printout_graph` is called.
- * Dormant tasks behave as if they are up to date and have no output.
- ==Command line parsing for pipelines==
- * From "Issue 44"
- * Added ruffus/cmdline.py
- * Supports both argparse (python 2.7) and optparse (python 2.6):
- * The following options are defined by default:
- {{{
- --verbose
- --version
- --log_file
- -t, --target_tasks
- -j, --jobs
- -n, --just_print
- --flowchart
- --key_legend_in_graph
- --draw_graph_horizontally
- --flowchart_format
- --forced_tasks
- }}}
- * Usage with argparse (Python > 2.7):
- {{{
- from ruffus import *
- parser = cmdline.get_argparse( description='WHAT DOES THIS PIPELINE DO?')
- # for example...
- parser.add_argument("--input_file")
- options = parser.parse_args()
- # optional logger which can be passed to ruffus tasks
- logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)
- #_____________________________________________________________________________________
- # pipelined functions go here
- #_____________________________________________________________________________________
- cmdline.run (options)
- }}}
- * Usage with optparse (Python 2.6):
- {{{
- from ruffus import *
- parser = cmdline.get_optgparse(version="%prog 1.0", usage = "\n\n %prog [options]")
- # for example...
- parser.add_option("-c", "--custom", dest="custom", action="count")
- (options, remaining_args) = parser.parse_args()
- # logger which can be passed to ruffus tasks
- logger, logger_mutex = cmdline.setup_logging ("this_program", options.log_file, options.verbose)
- #_____________________________________________________________________________________
- # pipelined functions go here
- #_____________________________________________________________________________________
- cmdline.run (options)
- }}}
- ==Optionally terminate pipeline after first exception==
- * To have all exceptions interrupt immediately:
- {{{
- pipeline_run(..., exceptions_terminate_immediately = True)
- }}}
- * From "Issue 43"
- * By default ruffus accumulates `NN` errors before interrupting the pipeline prematurely. `NN` is the specified parallelism for `pipeline_run(...)`.
- * By default, a pipeline will only be interrupted immediately if exceptions of type `ruffus.JobSignalledBreak` are thrown.
- ==Display exceptions without delay==
- * To see exceptions as they occur:
- {{{
- pipeline_run(..., log_exceptions = True)
- }}}
- * From "Issue 43"
- * By default, Ruffus re-throws exceptions in ensemble after pipeline termination.
- * `logger.error(...)` will be invoked with the string representation of the each exception, and associated stack trace.
- * The default logger prints to sys.stderr, but this can be changed to any class from the logging module or compatible object via `pipeline_run(..., logger = ???)`
- ==`@split` operations now show the 1->many output in pipeline_printout==
- * From "Issue 45"
- * New output
- {{{
- Task = split_animals
- Job = [None
- -> cows
- -> horses
- -> pigs
- , any_extra_parameters]
- }}}
- ==Improved display from `pipeline_printout()`==
- * File date and time are displayed in human readable form and out of date
- files are flagged with asterisks.
- = v. 2.2=
- _21/July/2010_
- ==Parameter substitution for `inputs(...)` / `add_inputs(...)`==
- `glob`s and tasks can be added as the prerequisites / input files using
- `inputs(...)` and `add_inputs(...)`. `glob` expansions will take place when the task
- is run.
- ==Simplifying `@transform` syntax with suffix==
- Regular expressions within ruffus are very powerful, and can allow files to be moved
- from one directory to another and renamed at will.<br><br>
- However, using consistent file extensions and
- `@transform(..., suffix(...))` makes the code much simpler and easier to read. <br><br>
- Previously, `suffix(...)` did not cooperately well with `inputs(...)`.
- For example, finding the corresponding header file (``'.h'``) for the matching input
- required a complicated `regex(...)` regular expression and `input(...)`. This simple case,
- e.g. matching ``'something.c'`` with ``'something.h'``, is now much easier in Ruffus.<br><br>
- For example:
- {{{
- source_files = ["something.c", "more_code.c"]
- @transform(source_files, suffix(".c"), add_inputs(r"\1.h", "common.h"), ".o")
- def compile(input_files, output_file):
- ( source_file,
- header_file,
- common_header) = input_files
- # call compiler to make object file
- }}}
- This is equivalent to calling:
- {{{
- compile(["something.c", "something.h", "common.h"], "something.o")
- compile(["more_code.c", "more_code.h", "common.h"], "more_code.o")
- }}}
- The `\1` matches everything *but* the suffix and will be applied to both `glob`s and file names.<br>
- For simplicity and compatibility with previous versions, there is always an implied `r"\1"` before
- the output parameters. I.e. output parameters strings are *always* substituted.<br>
-
- ==Advanced form of `@split`:==
- The standard `@split` divided one set of inputs into multiple outputs (the number of which
- can be determined at runtime).<br>
- This is a `one->many` operation.<br><br>
- An advanced form of `@split` has been added which can split each of several files further.<br>
- In other words, this is a `many->"many more"` operation.<br><br>
- For example, given three starting files:
- {{{
- original_files = ["original_0.file",
- "original_1.file",
- "original_2.file"]
- }}}
- We can split each into its own set of sub-sections:
- {{{
- @split(original_files,
- regex(r"starting_(\d+).fa"), # match starting files
- r"files.split.\1.*.fa" # glob pattern
- r"\1") # index of original file
- def split_files(input_file, output_files, original_index):
- """
- Code to split each input_file
- "original_0.file" -> "files.split.0.*.fa"
- "original_1.file" -> "files.split.1.*.fa"
- "original_2.file" -> "files.split.2.*.fa"
- """
- }}}
- This is, conceptually, the reverse of the @collate(...) decorator
- ==Ruffus will complain about unescaped regular expression special characters:==
- Ruffus uses ``'\1'`` and ``'\2'`` in regular expression substitutions. Even seasoned python
- users may not remember that these have to be 'escaped' in strings. The best option is
- to use 'raw' python strings e.g. `r"\1_substitutes\2correctly\3four\4times"`.<br>
- Ruffus will throw an exception if it sees an unescaped ``'\1'`` or ``'\2'`` in a file name,
- which should catch most of these bugs.
- ==Flowchart changes:==
- Changed to nicer colours, symbols etc. for a more professional look.
- Colours, size and resolution are now fully customisable. An svg bug in firefox has
- been worked around so that font size are displayed correctly
- {{{
- pipeline_printout_graph( #...
- user_colour_scheme = {
- "colour_scheme_index":1
- "Task to run" : {"fillcolor":"blue"},
- pipeline_name : "My flowchart",
- size : (11,8),
- dpi : 120)})
- }}}
- ==Bug Fix:==
- * From "Issue 27"
- Previously, Ruffus paused for one second after each job.
- This accomodates poor (one second) timestamp precision in some older file systems (ext3?),
- and makes sure that output from the previous tasks has a different
- timestamp from that of the following task.<br><br>
- Unfortunately, Ruffus (was too clever by half and) paused only when the jobs were less
- than a second in duration.
- Output files may be created at the end of a task, and
- the timestamps checked at the beginning of the following task. We thus *always* need a
- gap of > 1 seconds between tasks in older filesystems, whether the jobs are long or short.<br><br>
- The fix is to introduce a pause before the first job of each task.
- (See `one_second_per_job` in `task.py:make_job_parameter_generator(...)`)<br><br>
- As previously, if you are using a modern file system (e.g. ext4 / JFS / NTFS), you can avoid these unnecessary pauses by setting the `one_second_per_job` flag:
- {{{
- pipeline_run(one_second_per_job=False)
- }}}
- * From "Issue 30"
- @split with empty input files crashes Ruffus
- ==Documentation changes:==
- * New bioinformatics example
- * New contributed Gallery of flowcharts
- = v. 2.1.1=
- _12/March/2010_
- ==Bug Fix:==
- * From "Issue 26"
- The code "with job_limit_semaphore" breaks compatability with python 2.5<br>
- Thanks to patch from S. Binet
- * From "Issue 25"
- @merge forwarding single arguments to @merge erroneously as lists<br>
- Thanks to A. heger.
- * @transform(..., suffix(...), inputs(...))
- Suffix substitution should not have been taking place within `inputs()`. This
- makes it pass a file name to `inputs()` without suffix substitution.
- `Regex()` regular expression substitution continues to take place within `inputs()`<br>
- However, see changes in v. 2.2
- ==Documentation changes:==
- * New step in tutorial emphasising the value of Pipeline_printout(...) in pipeline development
- * pipeline_printout discussion in the manual.
- * @jobs_limit directive described in the manual.
- * Advance uses of @split described in the manual.
- * touch_files_only parameter described in the manual.
- * `add_inputs(...)` parameter described in the manual.
- ==`@transform(.., add_inputs(...))`==
- * `inputs(...)` allowed the addition of extra input dependencies / parameters for each job.
- For example, compiling a source file might require pulling in a corresponding
- header file.
- However, replacing all the input parameters always seemed a very blunt instrument
- just to inject an extra dependency (e.g. a header file).
- * `add_inputs(...)`, as the name suggests, just adds the additional items as the input parameter
- {{{
- from ruffus import *
- @transform(["a.input", "b.input"], suffix(".input"), add_inputs("just.1.more","just.2.more"), ".output")
- def task(i, o):
- ""
- }}}
- produces:
- {{{
- Job = [[a.input, just.1.more, just.2.more] ->a.output]
- Job = [[b.input, just.1.more, just.2.more] ->b.output]
- }}}
- * like `inputs`, `add_inputs` accepts strings, tasks and globs
- This minor syntactic change promises to add much clarity to some of our
- Ruffus code.
- * `add_inputs()` is available for `@transform`, `@collate` and `@split`
-
- = v. 2.1.0=
- _2/March/2010_
- ==Bug Fix:==
- * From "Issue 25".
- Regression for v. 2.0.10
- @files forwarding single arguments as lists.
- (Thanks to A. Heger)
- ==@jobs_limit directive==
- * Some tasks are resource intensive and too many jobs should not be run at the
- same time. Examples include disk intensive operations such as unzipping, or
- downloading from FTP sites.
- Adding
- {{{
- @jobs_limit(4)
- @transform(new_data_list, suffix(".big_data.gz"), ".big_data")
- def unzip(i, o):
- "unzip code goes here"
- }}}
- would limit the unzip operation to 4 jobs at a time, even if the rest of the
- pipeline runs highly in parallel.
- (Thanks to R. Young for suggesting this.)
- = v. 2.0.10=
- _27/February/2010_
- ==pipeline_run(..., touch_files_only = True)==
- * This will only `touch` output files for each job without running the
- python function. I.e. The output files are updated if they are old, or created
- if missing.
- This can be useful for simulating a pipeline run so that all files look as
- if they are up-to-date.
- Caveats:
- * This may not work correctly where output files are only determined at runtime, e.g. with @split
- * Only the output from pipelined jobs which are currently out-of-date will be touched.
- In other words, the pipeline runs *as normal*, the only difference is that the
- output files are touched instead of being created by the python task functions
- which would otherwise have been invoked.
- ==parameter substitution for inputs(...)==
- * The inputs(...) parameter in @transform, @collate can now take tasks and globs,
- and these will be substituted appropriately (after regular expression replacement).
- ==Bug Fix:==
- * From "Issue 21".
- Empty @files specifications no longer throw exceptions.
- If verbose logging is on, a warning is printed.
- (Thanks to A. Heger)
- = v. 2.0.9=
- _25/February/2010_
- ==Bug Fix:==
- * From "Issue 10".
- Source code directory under svn is now in /ruffus rather than src/ruffus
- (Thanks to P.J. Davis)
- * Better display of @split parameters when logging output
- The output parameters in @split should not be expanded
- if they are wildcards. This was previously handled as a special case. Now
- all parameter factories return two sets of parameters:
- The first to go to jobs, the second for displaying in trace logs.
- * Pipeline_printout defaults to verbose of 1. Verbose of 0 does nothing
- (Thanks to L.S.G)
- * The "Start Task" log message at verbosity of 3 was misleading.
- This is only when the task enters the queue.
- If there are multiple independent tasks, they may all enter the queue at the
- same time even with multiprocess=1. Jobs will be run one at a time.
- (Thanks to C. Nellaker.)
- ==Advanced form of split:==
- * Previously split only takes 1 set of input (tasks/files/globs) and split these into an indeterminate number of output.
- The new advanced form of split takes multiple input, and splits EACH of these
- further. I.e. it is like a combination of @split and @transform.
- For example:
- {{{
- @split(get_files, regex(r"(.+).original"), r"\1.*.split")
- def split_files(i, o):
- pass
- }}}
- This experimental feature will be in beta without documentation. Caveat utilitor!
-
- = v. 2.0.8=
- _22/January/2010_
- ==Bug Fix:==
- * Now accepts unicode file names:
- Change `isinstance(x,str)` to `isinstance(x, basestring)`
- (Thanks to P.J. Davis for contributing this.)
- * inputs(...) now raises an exception when passed multiple arguments.
- If the input parameter is being passed a tuple, add an extra set of enclosing
- brackets. Documentation updated accordingly.
- (Thanks to Z. Harris for spotting this.)
- * tasks where regular expressions are incorrectly specified are a great source of frustration
- and puzzlement.
- Now if no regular expression matches occur, a warning is printed
- (Thanks to C. Nellaker for suggesting this)
- = v. 2.0.7=
- _11/December/2009_
- ==Bug Fix:==
- * graph printout blows up because of missing run time data error
- (Thanks to A. Heger for reporting this!)
- = v. 2.0.6=
- _10/December/2009_
- ==Bug Fix:==
- * several minor bugs
- * better error messages when eoncountering decorator problems when checking if the pipeline is uptodate
- * Exception when output specifications in @split were expanded (unnecessarily) in logging.
- (Thanks to N. Spies for reporting this!)
- = v. 2.0.4=
- _22/November/2009_
- ==Bug Fix:==
- * task.get_job_names() dies for jobs with no parameters
- * JobSignalledBreak was not exported
- = v. 2.0.3=
- _18/November/2009_
- ==Bug Fix:==
- * @transform accepts single file names. Thanks Chris N.
- = v. 2.0.2=
- _18/November/2009_
- ==Better Logging:==
- * pipeline_printout output much prettier
- * pipeline_run at high verbose levels
- Shows which tasks are being checked
- to see if they are up-to-date or not
- ==Documentation:==
- * New tutorial
- * New manual
- * pretty code figures
- = v. 2.0.1=
- _18/November/2009_
- All unit tests passed
- ==Bug Fix:==
- * Numerous bugs to do with ordering of glob / job output consistency
- = v. 2.0.1 beta4=
- _16/November/2009_
- ==Bug Fix:==
- * Fixed problems with tasks depending on @split
- = v. 2.0 beta=
- _30/October/2009_
- With the experience and feedback over the past few months, I have reworked **Ruffus**
- completely mainly to make the syntax more transparent, with fewer gotchas.
- Previous limitations to do with parameters have been removed.
- The experience with what *Ruffus* calls "Indicator Objects" has been very positive
- and there are more of them.
- These are dummy classes with obvious names like "regex" or "suffix" which indicate the
- type of optional parameters much like named parameters.
- ==New Decorators:==
- * @split
- * @merge
- * @transform
- * @collate
- ==Deprecated Decorators:==
- * @files_re
- Functionality is divided among the new decorators
-
- ==New Features:==
- * Files can be chained from task to task, implicit dependencies are inferred automatically
- * Limitations on parameters removed. Any degree of nesting is allowed.
- * Strings contain glob letters ``[]?*`` automatically inferred as globs and expanded
- * input and output parameters containing strings assumed to be filenames, whatever the nested data structures they are found in
- ==Documentation:==
- * New documentation almost complete
- * New Simplified 7 step tutorial
- * New manual work in progress
- ==Bug Fix:==
- * Scheduling errors
- = v. 1.1.4=
- _15/October/2009_
- ==New Feature:==
- * Tasks can get their input by automatically chaining to the output from one or more parent tasks using the `@files_re`
- * Added example showing how files can be split up into multiple jobs and then recombined
- # Run `test/test_filesre_split_and_combine.py` with `-v|--verbose` `-s|--start_again`
- # Run with `-D|--debug` to test.
- * Documentation to follow
- ==Bug Fix:==
- * Scheduling race conditions
- = v. 1.1.3=
- _14/October/2009_
- ==Bug Fix:==
- * Minor (but show stopping) bug in task.generic_job_descriptor
- = v. 1.1.2=
- _9/October/2009_
-
- ==Bug Fix:==
- * Nasty (long standing) bug for single job tasks only decorated with `@follows(mkdir(...))` to be caught in an infinite loop
- ==Code Changes:==
- * Add example of combining multiple input files depending on a regular expression pattern.
- # Run `test/test_filesre_combine.py` with -v (verbose)
- # Run with -D (debug) to test.
- = v. 1.1.1=
- _8/October/2009_
-
- ==New Feature:==
- * _Combine multiple input files using a regular expression_
- * Added `combine` syntax to `@files_re` decorators:
- * Documentation to follow...
- * Example from `src/ruffus/test/test_branching_dependencies.py`:
- {{{
- @files_re('*.*', '(.*/)(.*\.[345]$)', combine(r'\1\2'), r'\1final.6')
- def test(input_files, output_files):
- pass`
- }}}
- * will take all files in the current directory
- * will identify files which end in `.3`, `.4` and `.5` as input files
- * will use `final.6` as the output file
- * `input_files == [a.3, a.4, b.3, b.5]` (for example)
- * `output_files == [final.6]`
-
- ==Bug Fix:==
- * All (known) bugs for running jobs from independent tasks in parallel
- = v. 1.0.9=
- _8/October/2009_
-
- ==New Feature:==
- _Multitasking independent tasks_
- * In a major piece of retooling, jobs from independent tasks which do not depend on each other will be run in parallel.
- * This involves major changes to the scheduling code.
- * Please contact me asap if anything breaks.
- ==Code Changes:==
- * Add example of independent tasks running concurrently in
- `test/test_branching_dependencies.py`
- * Run with -v (verbose) and -j 1 or -j 10 to show the indeterminancy of multiprocessing.
- * Run with -D (debug) to test.
- = v. 1.0.8=
- _12/August/2009_
-
- ==Documentation:==
- * Errors fixed. Thanks to Matteo Bertini!
- ==Code Changes:==
- * Added functions which print out job parameters more prettily.
- * `task.shorten_filenames_encoder`
- * `task.ignore_unknown_encoder`
- * Parameters which look like file paths will only have the file part printed
- (i.e. `"/a/b/c" -> 'c'`)
- * Test scripts `simpler_with_shared_logging.py` and `test_follows_mkdir.py`
- have been changed to test for this.
- = v. 1.0.7=
- _17/June/2009_
-
- ==Code Changes:==
- * Added `proxy_logger` module for accessing a shared log across multiple jobs in
- different processes.
- = v. 1.0.6=
- _12/June/2009_
-
- ==Bug fix:==
- * _Ruffus_ version module (`ruffus_version.py`) links fixed
- Soft links in linux do not travel well
- * `mkdir` now can take a list of strings
- added test case
- ==Documentation:==
- * Added history of changes
- = v. 1.0.5=
- _11/June/2009_
- ==Bug fix:==
- * Changed "graph_printout" to `pipeline_printout_graph` in documentation.
- This function had been renamed in the code but not in the documentation :-(
- ==Documentation:==
- * Added example for sharing synchronising data between jobs.
- This shows how different jobs can write to a common log file while still leaveraging the full power of _ruffus_.
-
- ==Code Changes:==
- * The graph and print_dependencies modules are no longer exported by default from task.
- Please email me if this breaks anything.
- * More informative error message when refer to unadorned (without _Ruffus_ decorators) python functions as pipelined Tasks
- * Added Ruffus version module `ruffus_version.py`
- = v. 1.0.4=
- _05/June/2009_
- ==Bug fix: ==
- * `task.task_names_to_tasks` did not include tasks specified by function rather than name
- * `task.run_all_jobs_in_task` did not work properly without multiprocessing (# of jobs = 1)
- * `task.pipeline_run` only uses multiprocessing pools if `multiprocess` (# of jobs) > 1
-
- ==Changes to allow python 2.4/2.5 to run:==
- * `setup.py` changed to remove dependency
- * `simplejson` can be loaded instead of python 2.6 `json` module
- * Changed `NamedTemporaryFile` to `mkstemp` because delete parameter is not available before python 2.6
- ==Windows programmes==
- It is necessary to protect the "entry point" of the program under windows.
- Otherwise, a new process with be created recursively, like the magicians's apprentice
- See: http://docs.python.org/library/multiprocessing.html#multiprocessing-programming
- = v. 1.0.3=
- _04/June/2009_
- ==Documentation ==
-
- Including SGE `qrsh` workaround in FAQ.
- = v. 1.0.1=
- _22/May/2009_
- ==Add simple tutorial.==
-
- No major bugs so far...!!
- = v. 1.0.0 beta =
- _28/April/2009_
- Initial Release in Oxford