/tools/rgenetics/rgQQ.py

https://bitbucket.org/cistrome/cistrome-harvard/ · Python · 365 lines · 313 code · 21 blank · 31 comment · 83 complexity · f5fb45250e3c91bb7fd6e6df1aa86b04 MD5 · raw file

  1. """
  2. oct 2009 - multiple output files
  3. Dear Matthias,
  4. Yes, you can define number of outputs dynamically in Galaxy. For doing
  5. this, you'll have to declare one output dataset in your xml and pass
  6. its ID ($out_file.id) to your python script. Also, set
  7. force_history_refresh="True" in your tool tag in xml, like this:
  8. <tool id="split1" name="Split" force_history_refresh="True">
  9. In your script, if your outputs are named in the following format,
  10. primary_associatedWithDatasetID_designation_visibility_extension
  11. (_DBKEY), all your datasets will show up in the history pane.
  12. associatedWithDatasetID is the $out_file.ID passed from xml,
  13. designation will be a unique identifier for each output (set in your
  14. script),
  15. visibility can be set to visible if you want the dataset visible in
  16. your history, or notvisible otherwise
  17. extension is the required format for your dataset (bed, tabular, fasta
  18. etc)
  19. DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
  20. One of our tools "MAF to Interval converter" (tools/maf/
  21. maf_to_interval.xml) already uses this feature. You can use it as a
  22. reference.
  23. qq.chisq Quantile-quantile plot for chi-squared tests
  24. Description
  25. This function plots ranked observed chi-squared test statistics against the corresponding expected
  26. order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
  27. means of observed and expected values. This is useful for inspecting the results of whole-genome
  28. association studies for overdispersion due to population substructure and other sources of bias or
  29. confounding.
  30. Usage
  31. qq.chisq(x, df=1, x.max, main="QQ plot",
  32. sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
  33. xlab="Expected", ylab="Observed",
  34. conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
  35. slope.one=FALSE, slope.lambda=FALSE,
  36. thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
  37. Arguments
  38. x A vector of observed chi-squared test values
  39. df The degreees of freedom for the tests
  40. x.max If present, truncate the observed value (Y) axis here
  41. main The main heading
  42. sub The subheading
  43. xlab x-axis label (default "Expected")
  44. ylab y-axis label (default "Observed")
  45. conc Lower and upper probability bounds for concentration band for the plot. Set this
  46. to NA to suppress this
  47. overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
  48. concentration band
  49. trim Quantile point for trimmed mean calculations for estimation of lambda. Default
  50. is to trim at the median
  51. slope.one Is a line of slope one to be superimpsed?
  52. slope.lambda Is a line of slope lambda to be superimposed?
  53. thin A pair of numbers indicating how points will be thinned before plotting (see
  54. Details). If NA, no thinning will be carried out
  55. oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
  56. the plotting symbol to be used for out-of-range observations
  57. col.shade The colour with which the concentration band will be filled
  58. ... Further graphical parameter settings to be passed to points()
  59. Details
  60. To reduce plotting time and the size of plot files, the smallest observed and expected points are
  61. thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
  62. precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
  63. The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
  64. is to be applied. The second number should be an integer and sets the maximum number of points
  65. to be plotted in this section.
  66. The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
  67. probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
  68. simultaneous confidence region; the probability that the plot will stray outside the band at some
  69. point exceeds 95
  70. When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
  71. expected value under the chi-squared assumption.
  72. Value
  73. The function returns the number of tests, the number of values omitted from the plot (greater than
  74. x.max), and the estimated dispersion factor, lambda.
  75. Note
  76. All tests must have the same number of degrees of freedom. If this is not the case, I suggest
  77. transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
  78. Author(s)
  79. David Clayton hdavid.clayton@cimr.cam.ac.uki
  80. References
  81. Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
  82. """
  83. import sys, random, math, copy,os, subprocess, tempfile
  84. from rgutils import RRun, rexe
  85. rqq = """
  86. # modified by ross lazarus for the rgenetics project may 2000
  87. # makes a pdf for galaxy from an x vector of chisquare values
  88. # from snpMatrix
  89. # http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
  90. qq.chisq <-
  91. function(x, df=1, x.max,
  92. main="QQ plot",
  93. sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
  94. xlab="Expected", ylab="Observed",
  95. conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
  96. slope.one=T, slope.lambda=FALSE,
  97. thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
  98. h=6,w=6,printpdf=F,...) {
  99. # Function to shade concentration band
  100. shade <- function(x1, y1, x2, y2, color=col.shade) {
  101. n <- length(x2)
  102. polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
  103. }
  104. # Sort values and see how many out of range
  105. obsvd <- sort(x, na.last=NA)
  106. N <- length(obsvd)
  107. if (missing(x.max)) {
  108. Np <- N
  109. }
  110. else {
  111. Np <- sum(obsvd<=x.max)
  112. }
  113. if(Np==0)
  114. stop("Nothing to plot")
  115. # Expected values
  116. if (df==2) {
  117. expctd <- 2*cumsum(1/(N:1))
  118. }
  119. else {
  120. expctd <- qchisq(p=(1:N)/(N+1), df=df)
  121. }
  122. # Concentration bands
  123. if (!is.null(conc)) {
  124. if(conc[1]>0) {
  125. e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
  126. }
  127. else {
  128. e.low <- rep(0, N)
  129. }
  130. if (conc[2]<1) {
  131. e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
  132. }
  133. else {
  134. e.high <- 1.1*rep(max(x),N)
  135. }
  136. }
  137. # Plot outline
  138. if (Np < N)
  139. top <- x.max
  140. else
  141. top <- obsvd[N]
  142. right <- expctd[N]
  143. if (printpdf) {pdf(ofname,h,w)}
  144. plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
  145. main=main, sub=sub)
  146. # Thinning
  147. if (is.na(thin[1])) {
  148. show <- 1:Np
  149. }
  150. else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) {
  151. warning("invalid thin parameter; no thinning carried out")
  152. show <- 1:Np
  153. }
  154. else {
  155. space <- right*thin[1]/floor(thin[2])
  156. iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df))
  157. if (max(iat)>thin[2])
  158. show <- unique(c(iat, (1+max(iat)):Np))
  159. else
  160. show <- 1:Np
  161. }
  162. Nu <- floor(trim*N)
  163. if (Nu>0)
  164. lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
  165. if (!is.null(conc)) {
  166. if (Np<N)
  167. vert <- c(show, (Np+1):N)
  168. else
  169. vert <- show
  170. if (overdisp)
  171. shade(expctd[vert], lambda*e.low[vert],
  172. expctd[vert], lambda*e.high[vert])
  173. else
  174. shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
  175. }
  176. points(expctd[show], obsvd[show], ...)
  177. # Overflow
  178. if (Np<N) {
  179. over <- (Np+1):N
  180. points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
  181. }
  182. # Lines
  183. line.types <- c("solid", "dashed", "dotted")
  184. key <- NULL
  185. txt <- NULL
  186. if (slope.one) {
  187. key <- c(key, line.types[1])
  188. txt <- c(txt, "y = x")
  189. abline(a=0, b=1, lty=line.types[1])
  190. }
  191. if (slope.lambda && Nu>0) {
  192. key <- c(key, line.types[2])
  193. txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
  194. if (!is.null(conc)) {
  195. if (Np<N)
  196. vert <- c(show, (Np+1):N)
  197. else
  198. vert <- show
  199. }
  200. abline(a=0, b=lambda, lty=line.types[2])
  201. }
  202. if (printpdf) {dev.off()}
  203. # Returned value
  204. if (!is.null(key))
  205. legend(0, top, legend=txt, lty=key)
  206. c(N=N, omitted=N-Np, lambda=lambda)
  207. }
  208. """
  209. def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
  210. xvar='Sample',h=8,w=8,logscale=True,outdir=None):
  211. """
  212. y is data for a qq plot and ends up on the x axis go figure
  213. if sampling, oversample low values - all the top 1% ?
  214. assume we have 0-1 p values
  215. """
  216. R = []
  217. colour="maroon"
  218. nrows = len(dat)
  219. dat.sort() # small to large
  220. fn = float(nrows)
  221. unifx = [x/fn for x in range(1,(nrows+1))]
  222. if logscale:
  223. unifx = [-math.log10(x) for x in unifx] # uniform distribution
  224. if sample < 1.0 and len(dat) > maxveclen:
  225. # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
  226. # oversample part of the distribution
  227. always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
  228. skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
  229. if skip <= 1:
  230. skip = 2
  231. samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
  232. # always oversample first sorted (here lowest) values
  233. yvec = [dat[i] for i in samplei] # always get first and last
  234. xvec = [unifx[i] for i in samplei] # and sample xvec same way
  235. maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
  236. else:
  237. yvec = [x for x in dat]
  238. maint='QQ %s (n=%d)' % (title,nrows)
  239. xvec = unifx
  240. if logscale:
  241. maint = 'Log%s' % maint
  242. mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
  243. ylab = '-log10(%s) Quantiles' % title
  244. xlab = '-log10(Uniform 0-1) Quantiles'
  245. yvec = [-math.log10(x) for x in yvec if x > 0.0]
  246. else:
  247. mx = [0,1]
  248. ylab = '%s Quantiles' % title
  249. xlab = 'Uniform 0-1 Quantiles'
  250. xv = ['%f' % x for x in xvec]
  251. R.append('xvec = c(%s)' % ','.join(xv))
  252. yv = ['%f' % x for x in yvec]
  253. R.append('yvec = c(%s)' % ','.join(yv))
  254. R.append('mx = c(0,%f)' % (math.log10(fn)))
  255. R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
  256. R.append("par(lab=c(10,10,10))")
  257. R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
  258. R.append('points(mx,mx,type="l")')
  259. R.append('grid(col="lightgray",lty="dotted")')
  260. R.append('dev.off()')
  261. RRun(rcmd=R,title='makeQQplot',outdir=outdir)
  262. def main():
  263. u = """
  264. """
  265. u = """<command interpreter="python">
  266. rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__
  267. </command>
  268. </command>
  269. """
  270. print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
  271. npar = 11
  272. if len(sys.argv) < npar:
  273. print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
  274. print >> sys.stdout, u
  275. sys.exit(1)
  276. in_fname = sys.argv[1]
  277. name = sys.argv[2]
  278. sample = float(sys.argv[3])
  279. head = None
  280. columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
  281. allout = sys.argv[5]
  282. height = int(sys.argv[6])
  283. width = int(sys.argv[7])
  284. logscale = (sys.argv[8].lower() == 'true')
  285. outid = sys.argv[9] # this is used to allow multiple output files
  286. outdir = sys.argv[10]
  287. nan_row = False
  288. rows = []
  289. for i, line in enumerate( file( sys.argv[1] ) ):
  290. # Skip comments
  291. if line.startswith( '#' ) or ( i == 0 ):
  292. if i == 0:
  293. head = line.strip().split("\t")
  294. continue
  295. if len(line.strip()) == 0:
  296. continue
  297. # Extract values and convert to floats
  298. fields = line.strip().split( "\t" )
  299. row = []
  300. nan_row = False
  301. for column in columns:
  302. if len( fields ) <= column:
  303. return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
  304. val = fields[column]
  305. if val.lower() == "na":
  306. nan_row = True
  307. else:
  308. try:
  309. row.append( float( fields[column] ) )
  310. except ValueError:
  311. return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
  312. if not nan_row:
  313. rows.append( row )
  314. if i > 1:
  315. i = i-1 # remove header row from count
  316. if head == None:
  317. head = ['Col%d' % (x+1) for x in columns]
  318. R = []
  319. for c,column in enumerate(columns): # we appended each column in turn
  320. outname = allout
  321. if c > 0: # after first time
  322. outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
  323. outname = os.path.join(outdir,outname)
  324. dat = []
  325. nrows = len(rows) # sometimes lots of NA's!!
  326. for arow in rows:
  327. dat.append(arow[c]) # remember, we appended each col in turn!
  328. cname = head[column]
  329. makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
  330. xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
  331. if __name__ == "__main__":
  332. main()