/CHANGES.TXT

https://code.google.com/p/ruffus/ · Plain Text · 593 lines · 502 code · 91 blank · 0 comment · 0 complexity · be4ea61b290fb379d8d324e03f692c3c MD5 · raw file

  1. = v. 2.3=
  2. _03/October/2011_
  3. ==`@active_if` turns off tasks at runtime==
  4. * Issue 36
  5. * Design and initial implementation from Jacob Biesinger
  6. * Takes one or more parameters which can be either booleans or functions or callable objects which return True / False
  7. * The expressions inside @active_if are evaluated each time
  8. `pipeline_run`, `pipeline_printout` or `pipeline_printout_graph` is called.
  9. * Dormant tasks behave as if they are up to date and have no output.
  10. ==Command line parsing for pipelines==
  11. * From "Issue 44"
  12. * Added ruffus/cmdline.py
  13. * Supports both argparse (python 2.7) and optparse (python 2.6):
  14. * The following options are defined by default:
  15. {{{
  16. --verbose
  17. --version
  18. --log_file
  19. -t, --target_tasks
  20. -j, --jobs
  21. -n, --just_print
  22. --flowchart
  23. --key_legend_in_graph
  24. --draw_graph_horizontally
  25. --flowchart_format
  26. --forced_tasks
  27. }}}
  28. * Usage with argparse (Python > 2.7):
  29. {{{
  30. from ruffus import *
  31. parser = cmdline.get_argparse( description='WHAT DOES THIS PIPELINE DO?')
  32. # for example...
  33. parser.add_argument("--input_file")
  34. options = parser.parse_args()
  35. # optional logger which can be passed to ruffus tasks
  36. logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)
  37. #_____________________________________________________________________________________
  38. # pipelined functions go here
  39. #_____________________________________________________________________________________
  40. cmdline.run (options)
  41. }}}
  42. * Usage with optparse (Python 2.6):
  43. {{{
  44. from ruffus import *
  45. parser = cmdline.get_optgparse(version="%prog 1.0", usage = "\n\n %prog [options]")
  46. # for example...
  47. parser.add_option("-c", "--custom", dest="custom", action="count")
  48. (options, remaining_args) = parser.parse_args()
  49. # logger which can be passed to ruffus tasks
  50. logger, logger_mutex = cmdline.setup_logging ("this_program", options.log_file, options.verbose)
  51. #_____________________________________________________________________________________
  52. # pipelined functions go here
  53. #_____________________________________________________________________________________
  54. cmdline.run (options)
  55. }}}
  56. ==Optionally terminate pipeline after first exception==
  57. * To have all exceptions interrupt immediately:
  58. {{{
  59. pipeline_run(..., exceptions_terminate_immediately = True)
  60. }}}
  61. * From "Issue 43"
  62. * By default ruffus accumulates `NN` errors before interrupting the pipeline prematurely. `NN` is the specified parallelism for `pipeline_run(...)`.
  63. * By default, a pipeline will only be interrupted immediately if exceptions of type `ruffus.JobSignalledBreak` are thrown.
  64. ==Display exceptions without delay==
  65. * To see exceptions as they occur:
  66. {{{
  67. pipeline_run(..., log_exceptions = True)
  68. }}}
  69. * From "Issue 43"
  70. * By default, Ruffus re-throws exceptions in ensemble after pipeline termination.
  71. * `logger.error(...)` will be invoked with the string representation of the each exception, and associated stack trace.
  72. * The default logger prints to sys.stderr, but this can be changed to any class from the logging module or compatible object via `pipeline_run(..., logger = ???)`
  73. ==`@split` operations now show the 1->many output in pipeline_printout==
  74. * From "Issue 45"
  75. * New output
  76. {{{
  77. Task = split_animals
  78. Job = [None
  79. -> cows
  80. -> horses
  81. -> pigs
  82. , any_extra_parameters]
  83. }}}
  84. ==Improved display from `pipeline_printout()`==
  85. * File date and time are displayed in human readable form and out of date
  86. files are flagged with asterisks.
  87. = v. 2.2=
  88. _21/July/2010_
  89. ==Parameter substitution for `inputs(...)` / `add_inputs(...)`==
  90. `glob`s and tasks can be added as the prerequisites / input files using
  91. `inputs(...)` and `add_inputs(...)`. `glob` expansions will take place when the task
  92. is run.
  93. ==Simplifying `@transform` syntax with suffix==
  94. Regular expressions within ruffus are very powerful, and can allow files to be moved
  95. from one directory to another and renamed at will.<br><br>
  96. However, using consistent file extensions and
  97. `@transform(..., suffix(...))` makes the code much simpler and easier to read. <br><br>
  98. Previously, `suffix(...)` did not cooperately well with `inputs(...)`.
  99. For example, finding the corresponding header file (``'.h'``) for the matching input
  100. required a complicated `regex(...)` regular expression and `input(...)`. This simple case,
  101. e.g. matching ``'something.c'`` with ``'something.h'``, is now much easier in Ruffus.<br><br>
  102. For example:
  103. {{{
  104. source_files = ["something.c", "more_code.c"]
  105. @transform(source_files, suffix(".c"), add_inputs(r"\1.h", "common.h"), ".o")
  106. def compile(input_files, output_file):
  107. ( source_file,
  108. header_file,
  109. common_header) = input_files
  110. # call compiler to make object file
  111. }}}
  112. This is equivalent to calling:
  113. {{{
  114. compile(["something.c", "something.h", "common.h"], "something.o")
  115. compile(["more_code.c", "more_code.h", "common.h"], "more_code.o")
  116. }}}
  117. The `\1` matches everything *but* the suffix and will be applied to both `glob`s and file names.<br>
  118. For simplicity and compatibility with previous versions, there is always an implied `r"\1"` before
  119. the output parameters. I.e. output parameters strings are *always* substituted.<br>
  120. ==Advanced form of `@split`:==
  121. The standard `@split` divided one set of inputs into multiple outputs (the number of which
  122. can be determined at runtime).<br>
  123. This is a `one->many` operation.<br><br>
  124. An advanced form of `@split` has been added which can split each of several files further.<br>
  125. In other words, this is a `many->"many more"` operation.<br><br>
  126. For example, given three starting files:
  127. {{{
  128. original_files = ["original_0.file",
  129. "original_1.file",
  130. "original_2.file"]
  131. }}}
  132. We can split each into its own set of sub-sections:
  133. {{{
  134. @split(original_files,
  135. regex(r"starting_(\d+).fa"), # match starting files
  136. r"files.split.\1.*.fa" # glob pattern
  137. r"\1") # index of original file
  138. def split_files(input_file, output_files, original_index):
  139. """
  140. Code to split each input_file
  141. "original_0.file" -> "files.split.0.*.fa"
  142. "original_1.file" -> "files.split.1.*.fa"
  143. "original_2.file" -> "files.split.2.*.fa"
  144. """
  145. }}}
  146. This is, conceptually, the reverse of the @collate(...) decorator
  147. ==Ruffus will complain about unescaped regular expression special characters:==
  148. Ruffus uses ``'\1'`` and ``'\2'`` in regular expression substitutions. Even seasoned python
  149. users may not remember that these have to be 'escaped' in strings. The best option is
  150. to use 'raw' python strings e.g. `r"\1_substitutes\2correctly\3four\4times"`.<br>
  151. Ruffus will throw an exception if it sees an unescaped ``'\1'`` or ``'\2'`` in a file name,
  152. which should catch most of these bugs.
  153. ==Flowchart changes:==
  154. Changed to nicer colours, symbols etc. for a more professional look.
  155. Colours, size and resolution are now fully customisable. An svg bug in firefox has
  156. been worked around so that font size are displayed correctly
  157. {{{
  158. pipeline_printout_graph( #...
  159. user_colour_scheme = {
  160. "colour_scheme_index":1
  161. "Task to run" : {"fillcolor":"blue"},
  162. pipeline_name : "My flowchart",
  163. size : (11,8),
  164. dpi : 120)})
  165. }}}
  166. ==Bug Fix:==
  167. * From "Issue 27"
  168. Previously, Ruffus paused for one second after each job.
  169. This accomodates poor (one second) timestamp precision in some older file systems (ext3?),
  170. and makes sure that output from the previous tasks has a different
  171. timestamp from that of the following task.<br><br>
  172. Unfortunately, Ruffus (was too clever by half and) paused only when the jobs were less
  173. than a second in duration.
  174. Output files may be created at the end of a task, and
  175. the timestamps checked at the beginning of the following task. We thus *always* need a
  176. gap of > 1 seconds between tasks in older filesystems, whether the jobs are long or short.<br><br>
  177. The fix is to introduce a pause before the first job of each task.
  178. (See `one_second_per_job` in `task.py:make_job_parameter_generator(...)`)<br><br>
  179. As previously, if you are using a modern file system (e.g. ext4 / JFS / NTFS), you can avoid these unnecessary pauses by setting the `one_second_per_job` flag:
  180. {{{
  181. pipeline_run(one_second_per_job=False)
  182. }}}
  183. * From "Issue 30"
  184. @split with empty input files crashes Ruffus
  185. ==Documentation changes:==
  186. * New bioinformatics example
  187. * New contributed Gallery of flowcharts
  188. = v. 2.1.1=
  189. _12/March/2010_
  190. ==Bug Fix:==
  191. * From "Issue 26"
  192. The code "with job_limit_semaphore" breaks compatability with python 2.5<br>
  193. Thanks to patch from S. Binet
  194. * From "Issue 25"
  195. @merge forwarding single arguments to @merge erroneously as lists<br>
  196. Thanks to A. heger.
  197. * @transform(..., suffix(...), inputs(...))
  198. Suffix substitution should not have been taking place within `inputs()`. This
  199. makes it pass a file name to `inputs()` without suffix substitution.
  200. `Regex()` regular expression substitution continues to take place within `inputs()`<br>
  201. However, see changes in v. 2.2
  202. ==Documentation changes:==
  203. * New step in tutorial emphasising the value of Pipeline_printout(...) in pipeline development
  204. * pipeline_printout discussion in the manual.
  205. * @jobs_limit directive described in the manual.
  206. * Advance uses of @split described in the manual.
  207. * touch_files_only parameter described in the manual.
  208. * `add_inputs(...)` parameter described in the manual.
  209. ==`@transform(.., add_inputs(...))`==
  210. * `inputs(...)` allowed the addition of extra input dependencies / parameters for each job.
  211. For example, compiling a source file might require pulling in a corresponding
  212. header file.
  213. However, replacing all the input parameters always seemed a very blunt instrument
  214. just to inject an extra dependency (e.g. a header file).
  215. * `add_inputs(...)`, as the name suggests, just adds the additional items as the input parameter
  216. {{{
  217. from ruffus import *
  218. @transform(["a.input", "b.input"], suffix(".input"), add_inputs("just.1.more","just.2.more"), ".output")
  219. def task(i, o):
  220. ""
  221. }}}
  222. produces:
  223. {{{
  224. Job = [[a.input, just.1.more, just.2.more] ->a.output]
  225. Job = [[b.input, just.1.more, just.2.more] ->b.output]
  226. }}}
  227. * like `inputs`, `add_inputs` accepts strings, tasks and globs
  228. This minor syntactic change promises to add much clarity to some of our
  229. Ruffus code.
  230. * `add_inputs()` is available for `@transform`, `@collate` and `@split`
  231. = v. 2.1.0=
  232. _2/March/2010_
  233. ==Bug Fix:==
  234. * From "Issue 25".
  235. Regression for v. 2.0.10
  236. @files forwarding single arguments as lists.
  237. (Thanks to A. Heger)
  238. ==@jobs_limit directive==
  239. * Some tasks are resource intensive and too many jobs should not be run at the
  240. same time. Examples include disk intensive operations such as unzipping, or
  241. downloading from FTP sites.
  242. Adding
  243. {{{
  244. @jobs_limit(4)
  245. @transform(new_data_list, suffix(".big_data.gz"), ".big_data")
  246. def unzip(i, o):
  247. "unzip code goes here"
  248. }}}
  249. would limit the unzip operation to 4 jobs at a time, even if the rest of the
  250. pipeline runs highly in parallel.
  251. (Thanks to R. Young for suggesting this.)
  252. = v. 2.0.10=
  253. _27/February/2010_
  254. ==pipeline_run(..., touch_files_only = True)==
  255. * This will only `touch` output files for each job without running the
  256. python function. I.e. The output files are updated if they are old, or created
  257. if missing.
  258. This can be useful for simulating a pipeline run so that all files look as
  259. if they are up-to-date.
  260. Caveats:
  261. * This may not work correctly where output files are only determined at runtime, e.g. with @split
  262. * Only the output from pipelined jobs which are currently out-of-date will be touched.
  263. In other words, the pipeline runs *as normal*, the only difference is that the
  264. output files are touched instead of being created by the python task functions
  265. which would otherwise have been invoked.
  266. ==parameter substitution for inputs(...)==
  267. * The inputs(...) parameter in @transform, @collate can now take tasks and globs,
  268. and these will be substituted appropriately (after regular expression replacement).
  269. ==Bug Fix:==
  270. * From "Issue 21".
  271. Empty @files specifications no longer throw exceptions.
  272. If verbose logging is on, a warning is printed.
  273. (Thanks to A. Heger)
  274. = v. 2.0.9=
  275. _25/February/2010_
  276. ==Bug Fix:==
  277. * From "Issue 10".
  278. Source code directory under svn is now in /ruffus rather than src/ruffus
  279. (Thanks to P.J. Davis)
  280. * Better display of @split parameters when logging output
  281. The output parameters in @split should not be expanded
  282. if they are wildcards. This was previously handled as a special case. Now
  283. all parameter factories return two sets of parameters:
  284. The first to go to jobs, the second for displaying in trace logs.
  285. * Pipeline_printout defaults to verbose of 1. Verbose of 0 does nothing
  286. (Thanks to L.S.G)
  287. * The "Start Task" log message at verbosity of 3 was misleading.
  288. This is only when the task enters the queue.
  289. If there are multiple independent tasks, they may all enter the queue at the
  290. same time even with multiprocess=1. Jobs will be run one at a time.
  291. (Thanks to C. Nellaker.)
  292. ==Advanced form of split:==
  293. * Previously split only takes 1 set of input (tasks/files/globs) and split these into an indeterminate number of output.
  294. The new advanced form of split takes multiple input, and splits EACH of these
  295. further. I.e. it is like a combination of @split and @transform.
  296. For example:
  297. {{{
  298. @split(get_files, regex(r"(.+).original"), r"\1.*.split")
  299. def split_files(i, o):
  300. pass
  301. }}}
  302. This experimental feature will be in beta without documentation. Caveat utilitor!
  303. = v. 2.0.8=
  304. _22/January/2010_
  305. ==Bug Fix:==
  306. * Now accepts unicode file names:
  307. Change `isinstance(x,str)` to `isinstance(x, basestring)`
  308. (Thanks to P.J. Davis for contributing this.)
  309. * inputs(...) now raises an exception when passed multiple arguments.
  310. If the input parameter is being passed a tuple, add an extra set of enclosing
  311. brackets. Documentation updated accordingly.
  312. (Thanks to Z. Harris for spotting this.)
  313. * tasks where regular expressions are incorrectly specified are a great source of frustration
  314. and puzzlement.
  315. Now if no regular expression matches occur, a warning is printed
  316. (Thanks to C. Nellaker for suggesting this)
  317. = v. 2.0.7=
  318. _11/December/2009_
  319. ==Bug Fix:==
  320. * graph printout blows up because of missing run time data error
  321. (Thanks to A. Heger for reporting this!)
  322. = v. 2.0.6=
  323. _10/December/2009_
  324. ==Bug Fix:==
  325. * several minor bugs
  326. * better error messages when eoncountering decorator problems when checking if the pipeline is uptodate
  327. * Exception when output specifications in @split were expanded (unnecessarily) in logging.
  328. (Thanks to N. Spies for reporting this!)
  329. = v. 2.0.4=
  330. _22/November/2009_
  331. ==Bug Fix:==
  332. * task.get_job_names() dies for jobs with no parameters
  333. * JobSignalledBreak was not exported
  334. = v. 2.0.3=
  335. _18/November/2009_
  336. ==Bug Fix:==
  337. * @transform accepts single file names. Thanks Chris N.
  338. = v. 2.0.2=
  339. _18/November/2009_
  340. ==Better Logging:==
  341. * pipeline_printout output much prettier
  342. * pipeline_run at high verbose levels
  343. Shows which tasks are being checked
  344. to see if they are up-to-date or not
  345. ==Documentation:==
  346. * New tutorial
  347. * New manual
  348. * pretty code figures
  349. = v. 2.0.1=
  350. _18/November/2009_
  351. All unit tests passed
  352. ==Bug Fix:==
  353. * Numerous bugs to do with ordering of glob / job output consistency
  354. = v. 2.0.1 beta4=
  355. _16/November/2009_
  356. ==Bug Fix:==
  357. * Fixed problems with tasks depending on @split
  358. = v. 2.0 beta=
  359. _30/October/2009_
  360. With the experience and feedback over the past few months, I have reworked **Ruffus**
  361. completely mainly to make the syntax more transparent, with fewer gotchas.
  362. Previous limitations to do with parameters have been removed.
  363. The experience with what *Ruffus* calls "Indicator Objects" has been very positive
  364. and there are more of them.
  365. These are dummy classes with obvious names like "regex" or "suffix" which indicate the
  366. type of optional parameters much like named parameters.
  367. ==New Decorators:==
  368. * @split
  369. * @merge
  370. * @transform
  371. * @collate
  372. ==Deprecated Decorators:==
  373. * @files_re
  374. Functionality is divided among the new decorators
  375. ==New Features:==
  376. * Files can be chained from task to task, implicit dependencies are inferred automatically
  377. * Limitations on parameters removed. Any degree of nesting is allowed.
  378. * Strings contain glob letters ``[]?*`` automatically inferred as globs and expanded
  379. * input and output parameters containing strings assumed to be filenames, whatever the nested data structures they are found in
  380. ==Documentation:==
  381. * New documentation almost complete
  382. * New Simplified 7 step tutorial
  383. * New manual work in progress
  384. ==Bug Fix:==
  385. * Scheduling errors
  386. = v. 1.1.4=
  387. _15/October/2009_
  388. ==New Feature:==
  389. * Tasks can get their input by automatically chaining to the output from one or more parent tasks using the `@files_re`
  390. * Added example showing how files can be split up into multiple jobs and then recombined
  391. # Run `test/test_filesre_split_and_combine.py` with `-v|--verbose` `-s|--start_again`
  392. # Run with `-D|--debug` to test.
  393. * Documentation to follow
  394. ==Bug Fix:==
  395. * Scheduling race conditions
  396. = v. 1.1.3=
  397. _14/October/2009_
  398. ==Bug Fix:==
  399. * Minor (but show stopping) bug in task.generic_job_descriptor
  400. = v. 1.1.2=
  401. _9/October/2009_
  402. ==Bug Fix:==
  403. * Nasty (long standing) bug for single job tasks only decorated with `@follows(mkdir(...))` to be caught in an infinite loop
  404. ==Code Changes:==
  405. * Add example of combining multiple input files depending on a regular expression pattern.
  406. # Run `test/test_filesre_combine.py` with -v (verbose)
  407. # Run with -D (debug) to test.
  408. = v. 1.1.1=
  409. _8/October/2009_
  410. ==New Feature:==
  411. * _Combine multiple input files using a regular expression_
  412. * Added `combine` syntax to `@files_re` decorators:
  413. * Documentation to follow...
  414. * Example from `src/ruffus/test/test_branching_dependencies.py`:
  415. {{{
  416. @files_re('*.*', '(.*/)(.*\.[345]$)', combine(r'\1\2'), r'\1final.6')
  417. def test(input_files, output_files):
  418. pass`
  419. }}}
  420. * will take all files in the current directory
  421. * will identify files which end in `.3`, `.4` and `.5` as input files
  422. * will use `final.6` as the output file
  423. * `input_files == [a.3, a.4, b.3, b.5]` (for example)
  424. * `output_files == [final.6]`
  425. ==Bug Fix:==
  426. * All (known) bugs for running jobs from independent tasks in parallel
  427. = v. 1.0.9=
  428. _8/October/2009_
  429. ==New Feature:==
  430. _Multitasking independent tasks_
  431. * In a major piece of retooling, jobs from independent tasks which do not depend on each other will be run in parallel.
  432. * This involves major changes to the scheduling code.
  433. * Please contact me asap if anything breaks.
  434. ==Code Changes:==
  435. * Add example of independent tasks running concurrently in
  436. `test/test_branching_dependencies.py`
  437. * Run with -v (verbose) and -j 1 or -j 10 to show the indeterminancy of multiprocessing.
  438. * Run with -D (debug) to test.
  439. = v. 1.0.8=
  440. _12/August/2009_
  441. ==Documentation:==
  442. * Errors fixed. Thanks to Matteo Bertini!
  443. ==Code Changes:==
  444. * Added functions which print out job parameters more prettily.
  445. * `task.shorten_filenames_encoder`
  446. * `task.ignore_unknown_encoder`
  447. * Parameters which look like file paths will only have the file part printed
  448. (i.e. `"/a/b/c" -> 'c'`)
  449. * Test scripts `simpler_with_shared_logging.py` and `test_follows_mkdir.py`
  450. have been changed to test for this.
  451. = v. 1.0.7=
  452. _17/June/2009_
  453. ==Code Changes:==
  454. * Added `proxy_logger` module for accessing a shared log across multiple jobs in
  455. different processes.
  456. = v. 1.0.6=
  457. _12/June/2009_
  458. ==Bug fix:==
  459. * _Ruffus_ version module (`ruffus_version.py`) links fixed
  460. Soft links in linux do not travel well
  461. * `mkdir` now can take a list of strings
  462. added test case
  463. ==Documentation:==
  464. * Added history of changes
  465. = v. 1.0.5=
  466. _11/June/2009_
  467. ==Bug fix:==
  468. * Changed "graph_printout" to `pipeline_printout_graph` in documentation.
  469. This function had been renamed in the code but not in the documentation :-(
  470. ==Documentation:==
  471. * Added example for sharing synchronising data between jobs.
  472. This shows how different jobs can write to a common log file while still leaveraging the full power of _ruffus_.
  473. ==Code Changes:==
  474. * The graph and print_dependencies modules are no longer exported by default from task.
  475. Please email me if this breaks anything.
  476. * More informative error message when refer to unadorned (without _Ruffus_ decorators) python functions as pipelined Tasks
  477. * Added Ruffus version module `ruffus_version.py`
  478. = v. 1.0.4=
  479. _05/June/2009_
  480. ==Bug fix: ==
  481. * `task.task_names_to_tasks` did not include tasks specified by function rather than name
  482. * `task.run_all_jobs_in_task` did not work properly without multiprocessing (# of jobs = 1)
  483. * `task.pipeline_run` only uses multiprocessing pools if `multiprocess` (# of jobs) > 1
  484. ==Changes to allow python 2.4/2.5 to run:==
  485. * `setup.py` changed to remove dependency
  486. * `simplejson` can be loaded instead of python 2.6 `json` module
  487. * Changed `NamedTemporaryFile` to `mkstemp` because delete parameter is not available before python 2.6
  488. ==Windows programmes==
  489. It is necessary to protect the "entry point" of the program under windows.
  490. Otherwise, a new process with be created recursively, like the magicians's apprentice
  491. See: http://docs.python.org/library/multiprocessing.html#multiprocessing-programming
  492. = v. 1.0.3=
  493. _04/June/2009_
  494. ==Documentation ==
  495. Including SGE `qrsh` workaround in FAQ.
  496. = v. 1.0.1=
  497. _22/May/2009_
  498. ==Add simple tutorial.==
  499. No major bugs so far...!!
  500. = v. 1.0.0 beta =
  501. _28/April/2009_
  502. Initial Release in Oxford