PageRenderTime 134ms CodeModel.GetById 21ms RepoModel.GetById 1ms app.codeStats 0ms

/Bio/Matrix/PSM/SiteMatrixI.pm

http://github.com/bioperl/bioperl-live
Perl | 539 lines | 531 code | 6 blank | 2 comment | 0 complexity | fb64cba52519295bd3ec9ec8a10f860e MD5 | raw file
Possible License(s): GPL-3.0, AGPL-1.0
  1. =head1 NAME
  2. Bio::Matrix::PSM::SiteMatrixI - SiteMatrixI implementation, holds a
  3. position scoring matrix (or position weight matrix) and log-odds
  4. =head1 SYNOPSIS
  5. # You cannot use this module directly; see Bio::Matrix::PSM::SiteMatrix
  6. # for an example implementation
  7. =head1 DESCRIPTION
  8. SiteMatrix is designed to provide some basic methods when working with position
  9. scoring (weight) matrices, such as transcription factor binding sites for
  10. example. A DNA PSM consists of four vectors with frequencies {A,C,G,T}. This is
  11. the minimum information you should provide to construct a PSM object. The
  12. vectors can be provided as strings with frequenciesx10 rounded to an int, going
  13. from {0..a} and 'a' represents the maximum (10). This is like MEME's compressed
  14. representation of a matrix and it is quite useful when working with relational
  15. DB. If arrays are provided as an input (references to arrays actually) they can
  16. be any number, real or integer (frequency or count).
  17. When creating the object you can ask the constructor to make a simple pseudo
  18. count correction by adding a number (typically 1) to all positions (with the
  19. -correction option). After adding the number the frequencies will be
  20. calculated. Only use correction when you supply counts, not frequencies.
  21. Throws an exception if: You mix as an input array and string (for example A
  22. matrix is given as array, C - as string). The position vector is (0,0,0,0). One
  23. of the probability vectors is shorter than the rest.
  24. Summary of the methods I use most frequently (details bellow):
  25. iupac - return IUPAC compliant consensus as a string
  26. score - Returns the score as a real number
  27. IC - information content. Returns a real number
  28. id - identifier. Returns a string
  29. accession - accession number. Returns a string
  30. next_pos - return the sequence probably for each letter, IUPAC
  31. symbol, IUPAC probability and simple sequence
  32. consenus letter for this position. Rewind at the end. Returns a hash.
  33. pos - current position get/set. Returns an integer.
  34. regexp - construct a regular expression based on IUPAC consensus.
  35. For example AGWV will be [Aa][Gg][AaTt][AaCcGg]
  36. width - site width
  37. get_string - gets the probability vector for a single base as a string.
  38. get_array - gets the probability vector for a single base as an array.
  39. get_logs_array - gets the log-odds vector for a single base as an array.
  40. New methods, which might be of interest to anyone who wants to store PSM in a relational
  41. database without creating an entry for each position is the ability to compress the
  42. PSM vector into a string with losing usually less than 1% of the data.
  43. this can be done with:
  44. my $str=$matrix->get_compressed_freq('A');
  45. or
  46. my $str=$matrix->get_compressed_logs('A');
  47. Loading from a database should be done with new, but is not yest implemented.
  48. However you can still uncompress such string with:
  49. my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1,1); for PSM
  50. or
  51. my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1000,2); for log odds
  52. =head1 FEEDBACK
  53. =head2 Mailing Lists
  54. User feedback is an integral part of the evolution of this and other
  55. Bioperl modules. Send your comments and suggestions preferably to one
  56. of the Bioperl mailing lists. Your participation is much appreciated.
  57. bioperl-l@bioperl.org - General discussion
  58. http://bioperl.org/wiki/Mailing_lists - About the mailing lists
  59. =head2 Support
  60. Please direct usage questions or support issues to the mailing list:
  61. I<bioperl-l@bioperl.org>
  62. rather than to the module maintainer directly. Many experienced and
  63. reponsive experts will be able look at the problem and quickly
  64. address it. Please include a thorough description of the problem
  65. with code and data examples if at all possible.
  66. =head2 Reporting Bugs
  67. Report bugs to the Bioperl bug tracking system to help us keep track
  68. the bugs and their resolution. Bug reports can be submitted via the
  69. web:
  70. https://github.com/bioperl/bioperl-live/issues
  71. =head1 AUTHOR - Stefan Kirov
  72. Email skirov@utk.edu
  73. =head1 APPENDIX
  74. =cut
  75. # Let the code begin...
  76. package Bio::Matrix::PSM::SiteMatrixI;
  77. # use strict;
  78. use base qw(Bio::Root::RootI);
  79. =head2 calc_weight
  80. Title : calc_weight
  81. Usage : $self->calc_weight({A=>0.2562,C=>0.2438,G=>0.2432,T=>0.2568});
  82. Function: Recalculates the PSM (or weights) based on the PFM (the frequency matrix)
  83. and user supplied background model.
  84. Throws : if no model is supplied
  85. Example :
  86. Returns :
  87. Args : reference to a hash with background frequencies for A,C,G and T
  88. =cut
  89. sub calc_weight {
  90. my $self = shift;
  91. $self->throw_not_implemented();
  92. }
  93. =head2 next_pos
  94. Title : next_pos
  95. Usage : my %base=$site->next_pos;
  96. Function:
  97. Retrieves the next position features: frequencies and weights for
  98. A,C,G,T, the main letter (as in consensus) and the
  99. probabilty for this letter to occur at this position and
  100. the current position
  101. Throws :
  102. Example :
  103. Returns : hash (pA,pC,pG,pT,lA,lC,lG,lT,base,prob,rel)
  104. Args : none
  105. =cut
  106. sub next_pos {
  107. my $self = shift;
  108. $self->throw_not_implemented();
  109. }
  110. =head2 curpos
  111. Title : curpos
  112. Usage : my $pos=$site->curpos;
  113. Function: Gets/sets the current position. Converts to 0 if argument is minus and
  114. to width if greater than width
  115. Throws :
  116. Example :
  117. Returns : integer
  118. Args : integer
  119. =cut
  120. sub curpos {
  121. my $self = shift;
  122. $self->throw_not_implemented();
  123. }
  124. =head2 e_val
  125. Title : e_val
  126. Usage : my $score=$site->e_val;
  127. Function: Gets/sets the e-value
  128. Throws :
  129. Example :
  130. Returns : real number
  131. Args : real number
  132. =cut
  133. sub e_val {
  134. my $self = shift;
  135. $self->throw_not_implemented();
  136. }
  137. =head2 consensus
  138. Title : consensus
  139. Usage :
  140. Function: Returns the consensus
  141. Returns : string
  142. Args : (optional) threshold value 1 to 10, default 5
  143. '5' means the returned characters had a 50% or higher presence at
  144. their position
  145. =cut
  146. sub consensus {
  147. my $self = shift;
  148. $self->throw_not_implemented();
  149. }
  150. =head2 accession_number
  151. Title : accession_number
  152. Usage :
  153. Function: accession number, this will be unique id for the SiteMatrix object as
  154. well for any other object, inheriting from SiteMatrix
  155. Throws :
  156. Example :
  157. Returns : string
  158. Args : string
  159. =cut
  160. sub accession_number {
  161. my $self = shift;
  162. $self->throw_not_implemented();
  163. }
  164. =head2 width
  165. Title : width
  166. Usage : my $width=$site->width;
  167. Function: Returns the length of the site
  168. Throws :
  169. Example :
  170. Returns : number
  171. Args :
  172. =cut
  173. sub width {
  174. my $self = shift;
  175. $self->throw_not_implemented();
  176. }
  177. =head2 IUPAC
  178. Title : IUPAC
  179. Usage : my $iupac_consensus=$site->IUPAC;
  180. Function: Returns IUPAC compliant consensus
  181. Throws :
  182. Example :
  183. Returns : string
  184. Args :
  185. =cut
  186. sub IUPAC {
  187. my $self = shift;
  188. $self->throw_not_implemented();
  189. }
  190. =head2 IC
  191. Title : IC
  192. Usage : my $ic=$site->IC;
  193. Function: Information content
  194. Throws :
  195. Example :
  196. Returns : real number
  197. Args : none
  198. =cut
  199. sub IC {
  200. my $self=shift;
  201. $self->throw_not_implemented();
  202. }
  203. =head2 get_string
  204. Title : get_string
  205. Usage : my $freq_A=$site->get_string('A');
  206. Function: Returns given probability vector as a string. Useful if you want to
  207. store things in a rel database, where arrays are not first choice
  208. Throws : If the argument is outside {A,C,G,T}
  209. Example :
  210. Returns : string
  211. Args : character {A,C,G,T}
  212. =cut
  213. sub get_string {
  214. my $self=shift;
  215. $self->throw_not_implemented();
  216. }
  217. =head2 id
  218. Title : id
  219. Usage : my $id=$site->id;
  220. Function: Gets/sets the site id
  221. Throws :
  222. Example :
  223. Returns : string
  224. Args : string
  225. =cut
  226. sub id {
  227. my $self = shift;
  228. $self->throw_not_implemented();
  229. }
  230. =head2 regexp
  231. Title : regexp
  232. Usage : my $regexp=$site->regexp;
  233. Function: Returns a regular expression which matches the IUPAC convention.
  234. N will match X, N, - and .
  235. Throws :
  236. Example :
  237. Returns : string
  238. Args :
  239. =cut
  240. sub regexp {
  241. my $self=shift;
  242. $self->throw_not_implemented();
  243. }
  244. =head2 regexp_array
  245. Title : regexp_array
  246. Usage : my @regexp=$site->regexp;
  247. Function: Returns a regular expression which matches the IUPAC convention.
  248. N will match X, N, - and .
  249. Throws :
  250. Example :
  251. Returns : array
  252. Args :
  253. To do : I have separated regexp and regexp_array, but
  254. maybe they can be rewritten as one - just check what
  255. should be returned
  256. =cut
  257. sub regexp_array {
  258. my $self=shift;
  259. $self->throw_not_implemented();
  260. }
  261. =head2 get_array
  262. Title : get_array
  263. Usage : my @freq_A=$site->get_array('A');
  264. Function: Returns an array with frequencies for a specified base
  265. Throws :
  266. Example :
  267. Returns : array
  268. Args : char
  269. =cut
  270. sub get_array {
  271. my $self=shift;
  272. $self->throw_not_implemented();
  273. }
  274. =head2 _to_IUPAC
  275. Title : _to_IUPAC
  276. Usage :
  277. Function: Converts a single position to IUPAC compliant symbol and
  278. returns its probability. For rules see the implementation.
  279. Throws :
  280. Example :
  281. Returns : char, real number
  282. Args : real numbers for A,C,G,T (positional)
  283. =cut
  284. sub _to_IUPAC {
  285. my $self = shift;
  286. $self->throw_not_implemented();
  287. }
  288. =head2 _to_cons
  289. Title : _to_cons
  290. Usage :
  291. Function: Converts a single position to simple consensus character and
  292. returns its probability. For rules see the implementation,
  293. Throws :
  294. Example :
  295. Returns : char, real number
  296. Args : real numbers for A,C,G,T (positional)
  297. =cut
  298. sub _to_cons {
  299. my $self = shift;
  300. $self->throw_not_implemented();
  301. }
  302. =head2 _calculate_consensus
  303. Title : _calculate_consensus
  304. Usage :
  305. Function: Internal stuff
  306. Throws :
  307. Example :
  308. Returns :
  309. Args :
  310. =cut
  311. sub _calculate_consensus {
  312. my $self = shift;
  313. $self->throw_not_implemented();
  314. }
  315. =head2 _compress_array
  316. Title : _compress_array
  317. Usage :
  318. Function: Will compress an array of real signed numbers to a string (ie vector of bytes)
  319. -127 to +127 for bi-directional(signed) and 0..255 for unsigned ;
  320. Throws :
  321. Example : Internal stuff
  322. Returns : String
  323. Args : array reference, followed by an max value and
  324. direction (optional, default 1-unsigned),1 unsigned, any other is signed.
  325. =cut
  326. sub _compress_array {
  327. my $self = shift;
  328. $self->throw_not_implemented();
  329. }
  330. =head2 _uncompress_string
  331. Title : _uncompress_string
  332. Usage :
  333. Function: Will uncompress a string (vector of bytes) to create an array of real
  334. signed numbers (opposite to_compress_array)
  335. Throws :
  336. Example : Internal stuff
  337. Returns : string, followed by an max value and
  338. direction (optional, default 1-unsigned), 1 unsigned, any other is signed.
  339. Args : array
  340. =cut
  341. sub _uncompress_string {
  342. my $self = shift;
  343. $self->throw_not_implemented();
  344. }
  345. =head2 get_compressed_freq
  346. Title : get_compressed_freq
  347. Usage :
  348. Function: A method to provide a compressed frequency vector. It uses one byte to
  349. code the frequence for one of the probability vectors for one position.
  350. Useful for relational database. Improvment of the previous 0..a coding.
  351. Throws :
  352. Example : my $strA=$self->get_compressed_freq('A');
  353. Returns : String
  354. Args : char
  355. =cut
  356. sub get_compressed_freq {
  357. my $self = shift;
  358. $self->throw_not_implemented();
  359. }
  360. =head2 get_compressed_logs
  361. Title : get_compressed_logs
  362. Usage :
  363. Function: A method to provide a compressed log-odd vector. It uses one byte to
  364. code the log value for one of the log-odds vectors for one position.
  365. Throws :
  366. Example : my $strA=$self->get_compressed_logs('A');
  367. Returns : String
  368. Args : char
  369. =cut
  370. sub get_compressed_logs {
  371. my $self = shift;
  372. $self->throw_not_implemented();
  373. }
  374. =head2 sequence_match_weight
  375. Title : sequence_match_weight
  376. Usage :
  377. Function: This method will calculate the score of a match, based on the PWM
  378. if such is associated with the matrix object. Returns undef if no
  379. PWM data is available.
  380. Throws : if the length of the sequence is different from the matrix width
  381. Example : my $score=$matrix->sequence_match_weight('ACGGATAG');
  382. Returns : Floating point
  383. Args : string
  384. =cut
  385. sub sequence_match_weight {
  386. my $self = shift;
  387. $self->throw_not_implemented();
  388. }
  389. =head2 get_all_vectors
  390. Title : get_all_vectors
  391. Usage :
  392. Function: returns all possible sequence vectors to satisfy the PFM under
  393. a given threshold
  394. Throws : If threshold outside of 0..1 (no sense to do that)
  395. Example : my @vectors=$self->get_all_vectors(4);
  396. Returns : Array of strings
  397. Args : (optional) floating
  398. =cut
  399. sub get_all_vectors {
  400. my $self = shift;
  401. $self->throw_not_implemented();
  402. }
  403. 1;