/tools/fastx_toolkit/fastx_collapser.xml

https://bitbucket.org/cistrome/cistrome-harvard/ · XML · 88 lines · 59 code · 21 blank · 8 comment · 0 complexity · f65240917fb5b2de58a6573d18b4691c MD5 · raw file

  1. <tool id="cshl_fastx_collapser" name="Collapse">
  2. <description>sequences</description>
  3. <requirements><requirement type="package">fastx_toolkit</requirement></requirements>
  4. <command>zcat -f '$input' | fastx_collapser -v -o '$output'
  5. #if $input.ext == "fastqsanger":
  6. -Q 33
  7. #end if
  8. </command>
  9. <inputs>
  10. <param format="fasta,fastqsanger,fastqsolexa" name="input" type="data" label="Library to collapse" />
  11. </inputs>
  12. <!-- The order of sequences in the test output differ between 32 bit and 64 bit machines.
  13. <tests>
  14. <test>
  15. <param name="input" value="fasta_collapser1.fasta" />
  16. <output name="output" file="fasta_collapser1.out" />
  17. </test>
  18. </tests>
  19. -->
  20. <outputs>
  21. <data format="fasta" name="output" metadata_source="input" />
  22. </outputs>
  23. <help>
  24. **What it does**
  25. This tool collapses identical sequences in a FASTA file into a single sequence.
  26. --------
  27. **Example**
  28. Example Input File (Sequence "ATAT" appears multiple times)::
  29. >CSHL_2_FC0042AGLLOO_1_1_605_414
  30. TGCG
  31. >CSHL_2_FC0042AGLLOO_1_1_537_759
  32. ATAT
  33. >CSHL_2_FC0042AGLLOO_1_1_774_520
  34. TGGC
  35. >CSHL_2_FC0042AGLLOO_1_1_742_502
  36. ATAT
  37. >CSHL_2_FC0042AGLLOO_1_1_781_514
  38. TGAG
  39. >CSHL_2_FC0042AGLLOO_1_1_757_487
  40. TTCA
  41. >CSHL_2_FC0042AGLLOO_1_1_903_769
  42. ATAT
  43. >CSHL_2_FC0042AGLLOO_1_1_724_499
  44. ATAT
  45. Example Output file::
  46. >1-1
  47. TGCG
  48. >2-4
  49. ATAT
  50. >3-1
  51. TGGC
  52. >4-1
  53. TGAG
  54. >5-1
  55. TTCA
  56. .. class:: infomark
  57. Original Sequence Names / Lane descriptions (e.g. "CSHL_2_FC0042AGLLOO_1_1_742_502") are discarded.
  58. The output sequence name is composed of two numbers: the first is the sequence's number, the second is the multiplicity value.
  59. The following output::
  60. >2-4
  61. ATAT
  62. means that the sequence "ATAT" is the second sequence in the file, and it appeared 4 times in the input FASTA file.
  63. ------
  64. This tool is based on `FASTX-toolkit`__ by Assaf Gordon.
  65. .. __: http://hannonlab.cshl.edu/fastx_toolkit/
  66. </help>
  67. </tool>