PageRenderTime 51ms CodeModel.GetById 13ms RepoModel.GetById 0ms app.codeStats 1ms

/Bio/Tools/Infernal.pm

http://github.com/bioperl/bioperl-live
Perl | 492 lines | 440 code | 25 blank | 27 comment | 21 complexity | 2e5596861c8280666fae28ea17d64f04 MD5 | raw file
Possible License(s): GPL-3.0, AGPL-1.0
  1. #
  2. # BioPerl module for Bio::Tools::Infernal
  3. #
  4. # Please direct questions and support issues to <bioperl-l@bioperl.org>
  5. #
  6. # Cared for by Chris Fields <cjfields-at-uiuc-dot-edu>
  7. #
  8. # Copyright Chris Fields
  9. #
  10. # You may distribute this module under the same terms as perl itself
  11. # POD documentation - main docs before the code
  12. =head1 NAME
  13. Bio::Tools::Infernal - A parser for Infernal output
  14. =head1 SYNOPSIS
  15. use Bio::Tools::Infernal;
  16. my $parser = Bio::Tools::Infernal->new(-file => $rna_output,
  17. -motiftag => 'misc_binding'
  18. -desctag => 'Lysine riboswitch',
  19. -cm => 'RF00168',
  20. -rfam => 'RF00168',
  21. -minscore => 20);
  22. #parse the results, get a Bio::SeqFeature::FeaturePair
  23. while( my $motif = $parser->next_prediction) {
  24. # do something here
  25. }
  26. =head1 DESCRIPTION
  27. This is a highly experimental parser for Infernal output from the cmsearch
  28. program. At some point it is anticipated that this will morph into a proper
  29. SearchIO parser, along with the related RNAMotif and ERPIN tools.
  30. The Infernal suite of programs are used for generating RNA CM (covariance
  31. models) and searching sequences using CMs to locate potentially similar
  32. structures. The program is under active development; it is anticiapted that
  33. this will support the latest version available.
  34. This parser has been tested and is capable of parsing Infernal 0.7 and 0.71
  35. output. However, future Infernal versions may break parsing as the output is
  36. constantly evolving, so keep an eye on this space for additional notes.
  37. Currently data is parsed into a Bio::SeqFeature::FeaturePair object, consisting
  38. of a query (the covariance model) and the hit (sequence searched).
  39. Model data is accessible via the following:
  40. Data SeqFeature::FeaturePair Note
  41. --------------------------------------------------------------------------
  42. primary tag $sf->primary_tag Rfam ID (if passed to new())
  43. start $sf->start Based on CM length
  44. end $sf->end Based on CM length
  45. score $sf->score Bit score
  46. strand $sf->strand 0 (CM does not have a strand)
  47. seqid $sf->seq_id Rfam ID (if passed to new())
  48. display name $sf->feature1->display_name CM name (if passed to new())
  49. source $sf->feature1->source tag 'Infernal' followed by version
  50. Hit data is accessible via the following:
  51. Data SeqFeature::FeaturePair Note
  52. ------------------------------------------------------------------
  53. start $sf->hstart
  54. end $sf->hend
  55. score(bits) $sf->hscore
  56. strand $sf->hstrand
  57. seqid $sf->hseqid
  58. Primary Tag $sf->hprimary_tag
  59. Source Tag $sf->hsource_tag
  60. Added FeaturePair tags are :
  61. secstructure - entire description line (in case the regex used for
  62. sequence ID doesn't adequately catch the name
  63. model - name of the descriptor file (may include path to file)
  64. midline - contains structural information from the descriptor
  65. used as a query
  66. hit - sequence of motif, separated by spaces according to
  67. matches to the structure in the descriptor (in
  68. SecStructure).
  69. seqname - raw sequence name (for downstream parsing if needed)
  70. An additional parameter ('minscore') is added due to the huge number
  71. of spurious hits generated by cmsearch. This screens data, only building
  72. and returning objects when a minimal bitscore is present.
  73. See t/rnamotif.t for example usage.
  74. =head1 FEEDBACK
  75. =head2 Mailing Lists
  76. User feedback is an integral part of the evolution of this and other
  77. Bioperl modules. Send your comments and suggestions preferably to
  78. the Bioperl mailing list. Your participation is much appreciated.
  79. bioperl-l@bioperl.org - General discussion
  80. http://bioperl.org/wiki/Mailing_lists - About the mailing lists
  81. =head2 Support
  82. Please direct usage questions or support issues to the mailing list:
  83. I<bioperl-l@bioperl.org>
  84. rather than to the module maintainer directly. Many experienced and
  85. reponsive experts will be able look at the problem and quickly
  86. address it. Please include a thorough description of the problem
  87. with code and data examples if at all possible.
  88. =head2 Reporting Bugs
  89. Report bugs to the Bioperl bug tracking system to help us keep track
  90. of the bugs and their resolution. Bug reports can be submitted via the
  91. web:
  92. https://redmine.open-bio.org/projects/bioperl/
  93. =head1 AUTHOR - Chris Fields
  94. Email cjfields-at-uiuc-dot-edu
  95. =head1 APPENDIX
  96. The rest of the documentation details each of the object methods.
  97. Internal methods are usually preceded with a _
  98. =cut
  99. # Let the code begin...
  100. package Bio::Tools::Infernal;
  101. use strict;
  102. use Bio::SeqFeature::Generic;
  103. use Bio::SeqFeature::FeaturePair;
  104. use Data::Dumper;
  105. use base qw(Bio::Tools::AnalysisResult);
  106. our($MotifTag,$SrcTag,$DescTag) = qw(misc_binding Infernal infernal);
  107. our $MINSCORE = 0;
  108. our $DEFAULT_VERSION = '0.71';
  109. =head2 new
  110. Title : new
  111. Usage : my $obj = Bio::Tools::Infernal->new();
  112. Function: Builds a new Bio::Tools::Infernal object
  113. Returns : an instance of Bio::Tools::Infernal
  114. Args : -fh/-file - for input filehandle/filename
  115. -motiftag - primary tag used in gene features (default 'misc_binding')
  116. -desctag - tag used for display_name name (default 'infernal')
  117. -srctag - source tag used in all features (default 'Infernal')
  118. -rfam - Rfam id number
  119. -cm - covariance model used in analysis (may be same as rfam #)
  120. -minscore - minimum score (simple screener, since Infernal generates
  121. a ton of spurious hits)
  122. -version - Infernal program version
  123. =cut
  124. # yes, this is actually _initialize, but the args are passed to
  125. # the constructor.
  126. # see Bio::Tools::AnalysisResult for further details
  127. sub _initialize {
  128. my($self,@args) = @_;
  129. $self->warn('Use of this module is deprecated. Use Bio::SearchIO::infernal instead');
  130. $self->SUPER::_initialize(@args);
  131. my ($motiftag,$desctag,$srctag,$rfam,$cm,$ms,$ver) =
  132. $self->SUPER::_rearrange([qw(MOTIFTAG
  133. DESCTAG
  134. SRCTAG
  135. RFAM
  136. CM
  137. MINSCORE
  138. VERSION
  139. )],@args);
  140. $self->motif_tag(defined $motiftag ? $motiftag : $MotifTag);
  141. $self->source_tag(defined $srctag ? $srctag : $SrcTag);
  142. $self->desc_tag(defined $desctag ? $desctag : $DescTag);
  143. $cm && $self->covariance_model($cm);
  144. $rfam && $self->rfam($rfam);
  145. $self->program_version(defined $ver ? $ver : $DEFAULT_VERSION);
  146. $self->minscore(defined $ms ? $ms : $MINSCORE);
  147. }
  148. =head2 motif_tag
  149. Title : motif_tag
  150. Usage : $obj->motif_tag($newval)
  151. Function: Get/Set the value used for 'motif_tag', which is used for setting the
  152. primary_tag.
  153. Default is 'misc_binding' as set by the global $MotifTag.
  154. 'misc_binding' is used here because a conserved RNA motif is capable
  155. of binding proteins (regulatory proteins), antisense RNA (siRNA),
  156. small molecules (riboswitches), or nothing at all (tRNA,
  157. terminators, etc.). It is recommended that this be changed to other
  158. tags ('misc_RNA', 'protein_binding', 'tRNA', etc.) where appropriate.
  159. For more information, see:
  160. http://www.ncbi.nlm.nih.gov/collab/FT/index.html
  161. Returns : value of motif_tag (a scalar)
  162. Args : on set, new value (a scalar or undef, optional)
  163. =cut
  164. sub motif_tag{
  165. my $self = shift;
  166. return $self->{'motif_tag'} = shift if @_;
  167. return $self->{'motif_tag'};
  168. }
  169. =head2 source_tag
  170. Title : source_tag
  171. Usage : $obj->source_tag($newval)
  172. Function: Get/Set the value used for the 'source_tag'.
  173. Default is 'Infernal' as set by the global $SrcTag
  174. Returns : value of source_tag (a scalar)
  175. Args : on set, new value (a scalar or undef, optional)
  176. =cut
  177. sub source_tag{
  178. my $self = shift;
  179. return $self->{'source_tag'} = shift if @_;
  180. return $self->{'source_tag'};
  181. }
  182. =head2 desc_tag
  183. Title : desc_tag
  184. Usage : $obj->desc_tag($newval)
  185. Function: Get/Set the value used for the query motif. This will be placed in
  186. the tag '-display_name'. Default is 'infernal' as set by the global
  187. $DescTag. Use this to manually set the descriptor (motif searched for).
  188. Since there is no way for this module to tell what the motif is from the
  189. name of the descriptor file or the Infernal output, this should
  190. be set every time an Infernal object is instantiated for clarity
  191. Returns : value of exon_tag (a scalar)
  192. Args : on set, new value (a scalar or undef, optional)
  193. =cut
  194. sub desc_tag{
  195. my $self = shift;
  196. return $self->{'desc_tag'} = shift if @_;
  197. return $self->{'desc_tag'};
  198. }
  199. =head2 covariance_model
  200. Title : covariance_model
  201. Usage : $obj->covariance_model($newval)
  202. Function: Get/Set the value used for the covariance model used in the analysis.
  203. Returns : value of exon_tag (a scalar)
  204. Args : on set, new value (a scalar or undef, optional)
  205. =cut
  206. sub covariance_model{
  207. my $self = shift;
  208. return $self->{'_cmodel'} = shift if @_;
  209. return $self->{'_cmodel'};
  210. }
  211. =head2 rfam
  212. Title : rfam
  213. Usage : $obj->rfam($newval)
  214. Function: Get/Set the Rfam accession number
  215. Returns : value of exon_tag (a scalar)
  216. Args : on set, new value (a scalar or undef, optional)
  217. =cut
  218. sub rfam {
  219. my $self = shift;
  220. return $self->{'_rfam'} = shift if @_;
  221. return $self->{'_rfam'};
  222. }
  223. =head2 minscore
  224. Title : minscore
  225. Usage : $obj->minscore($newval)
  226. Function: Get/Set the minimum score threshold for generating SeqFeatures
  227. Returns : value of exon_tag (a scalar)
  228. Args : on set, new value (a scalar or undef, optional)
  229. =cut
  230. sub minscore {
  231. my $self = shift;
  232. return $self->{'_minscore'} = shift if @_;
  233. return $self->{'_minscore'};
  234. }
  235. =head2 program_version
  236. Title : program_version
  237. Usage : $obj->program_version($newval)
  238. Function: Get/Set the Infernal program version
  239. Returns : value of exon_tag (a scalar)
  240. Args : on set, new value (a scalar or undef, optional)
  241. Note: this is set to $DEFAULT_VERSION by, um, default
  242. =cut
  243. sub program_version {
  244. my $self = shift;
  245. return $self->{'_program_version'} = shift if @_;
  246. return $self->{'_program_version'};
  247. }
  248. =head2 analysis_method
  249. Usage : $obj->analysis_method();
  250. Purpose : Inherited method. Overridden to ensure that the name matches
  251. /Infernal/i.
  252. Returns : String
  253. Argument : n/a
  254. =cut
  255. sub analysis_method {
  256. my ($self, $method) = @_;
  257. if($method && ($method !~ /Infernal/i)) {
  258. $self->throw("method $method not supported in " . ref($self));
  259. }
  260. return $self->SUPER::analysis_method($method);
  261. }
  262. =head2 next_feature
  263. Title : next_feature
  264. Usage : while($gene = $obj->next_feature()) {
  265. # do something
  266. }
  267. Function: Returns the next gene structure prediction of the RNAMotif result
  268. file. Call this method repeatedly until FALSE is returned.
  269. The returned object is actually a SeqFeatureI implementing object.
  270. This method is required for classes implementing the
  271. SeqAnalysisParserI interface, and is merely an alias for
  272. next_prediction() at present.
  273. Returns : A Bio::Tools::Prediction::Gene object.
  274. Args : None (at present)
  275. =cut
  276. sub next_feature {
  277. my ($self,@args) = @_;
  278. # even though next_prediction doesn't expect any args (and this method
  279. # does neither), we pass on args in order to be prepared if this changes
  280. # ever
  281. return $self->next_prediction(@args);
  282. }
  283. =head2 next_prediction
  284. Title : next_prediction
  285. Usage : while($gene = $obj->next_prediction()) {
  286. # do something
  287. }
  288. Function: Returns the next gene structure prediction of the RNAMotif result
  289. file. Call this method repeatedly until FALSE is returned.
  290. Returns : A Bio::SeqFeature::Generic object
  291. Args : None (at present)
  292. =cut
  293. sub next_prediction {
  294. my ($self) = @_;
  295. my ($start, $end, $strand, $score);
  296. my %hsp_key = ('0' => 'structure',
  297. '1' => 'model',
  298. '2' => 'midline',
  299. '3' => 'hit');
  300. my $line;
  301. PARSER:
  302. while($line = $self->_readline) {
  303. next if $line =~ m{^\s+$};
  304. if ($line =~ m{^sequence:\s+(\S+)} ){
  305. $self->_current_hit($1);
  306. next PARSER;
  307. } elsif ($line =~ m{^hit\s+\d+\s+:\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+bits}xms) {
  308. $strand = 1;
  309. ($start, $end, $score) = ($1, $2, $3);
  310. if ($start > $end) {
  311. ($start, $end) = ($end, $start);
  312. $strand = -1;
  313. }
  314. #$self->debug(sprintf("Hit: %-30s\n\tStrand:%-4d Start:%-6d End:%-6d Score:%-10s\n",
  315. # $self->_current_hit, $strand, $start, $end, $score));
  316. } elsif ($line =~ m{^(\s+)[<>\{\}\(\)\[\]:_,-\.]+}xms) { # start of HSP
  317. $self->_pushback($line); # set up for loop
  318. # what is length of the gap to the structure data?
  319. my $offset = length($1);
  320. my ($ct, $strln) = 0;
  321. my $hsp;
  322. HSP:
  323. while ($line = $self->_readline) {
  324. next if $line =~ m{^\s*$}; # toss empty lines
  325. chomp $line;
  326. # exit loop if at end of file or upon next hit/HSP
  327. if (!defined($line) || $line =~ m{^(sequence|hit)}) {
  328. $self->_pushback($line);
  329. last HSP;
  330. }
  331. # iterate to keep track of each line (4 lines per hsp block)
  332. my $iterator = $ct%4;
  333. # strlen set only with structure lines (proper length)
  334. $strln = length($line) if $iterator == 0;
  335. # only grab the data needed (hit start and stop in hit line above;
  336. # query start, end are from the actual query length (entire hit is
  337. # mapped to CM data, so all CM data is represented
  338. # yes this is kinda clumsy, but I'll probably morph this into
  339. # a proper SearchIO module soon. For now this works...
  340. my $data = substr($line, $offset, $strln-$offset);
  341. $hsp->{ $hsp_key{$iterator} } .= $data;
  342. $ct++;
  343. }
  344. if ($self->minscore < $score) {
  345. my ($name, $program, $rfam, $cm, $dt, $st, $mt) =
  346. ($self->_current_hit, $self->analysis_method, $self->rfam,
  347. $self->covariance_model, $self->desc_tag, $self->source_tag,
  348. $self->motif_tag);
  349. my $ver = $self->program_version || '';
  350. my $qid = $name;
  351. if ($name =~ m{(?:gb|gi|emb|dbj|sp|pdb|bbs|ref|lcl)\|(\d+)((?:\:|\|)\w+\|(\S*.\d+)\|)?}xms) {
  352. $qid = $1;
  353. }
  354. my $fp = Bio::SeqFeature::FeaturePair->new();
  355. my $strlen = $hsp->{'model'} =~ tr{A-Za-z}{A-Za-z}; # gaps don't count
  356. my $qf = Bio::SeqFeature::Generic->new( -primary_tag => $mt,
  357. -source_tag => "$st $ver",
  358. -display_name => $cm || '',
  359. -score => $score,
  360. -start => 1,
  361. -end => $strlen,
  362. -seq_id => $rfam || '',
  363. -strand => 0, # covariance model does not have a strand
  364. );
  365. my $hf = Bio::SeqFeature::Generic->new( -primary_tag => $mt,
  366. -source_tag => "$st $ver",
  367. -display_name => $dt || '',
  368. -score => $score,
  369. -start => $start,
  370. -end => $end,
  371. -seq_id => $qid,
  372. -strand => $strand
  373. );
  374. $fp->feature1($qf);
  375. $fp->feature2($hf); # should emphasis be on the hit?
  376. $fp->add_tag_value('secstructure', $hsp->{'structure'});
  377. $fp->add_tag_value('model', $hsp->{'model'});
  378. $fp->add_tag_value('midline', $hsp->{'midline'});
  379. $fp->add_tag_value('hit', $hsp->{'hit'});
  380. $fp->add_tag_value('seq_name', $name);
  381. $fp->display_name($dt);
  382. return $fp;
  383. } else {
  384. next PARSER;
  385. }
  386. }
  387. }
  388. return (defined($line)) ? 1 : 0;
  389. }
  390. sub _current_hit {
  391. my $self = shift;
  392. return $self->{'_currhit'} = shift if @_;
  393. return $self->{'_currhit'};
  394. }
  395. 1;