/tools/rgenetics/rgQQ.py
Python | 365 lines | 358 code | 2 blank | 5 comment | 21 complexity | f5fb45250e3c91bb7fd6e6df1aa86b04 MD5 | raw file
1""" 2oct 2009 - multiple output files 3Dear Matthias, 4 5Yes, you can define number of outputs dynamically in Galaxy. For doing 6this, you'll have to declare one output dataset in your xml and pass 7its ID ($out_file.id) to your python script. Also, set 8force_history_refresh="True" in your tool tag in xml, like this: 9<tool id="split1" name="Split" force_history_refresh="True"> 10In your script, if your outputs are named in the following format, 11primary_associatedWithDatasetID_designation_visibility_extension 12(_DBKEY), all your datasets will show up in the history pane. 13associatedWithDatasetID is the $out_file.ID passed from xml, 14designation will be a unique identifier for each output (set in your 15script), 16visibility can be set to visible if you want the dataset visible in 17your history, or notvisible otherwise 18extension is the required format for your dataset (bed, tabular, fasta 19etc) 20DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc) 21 22One of our tools "MAF to Interval converter" (tools/maf/ 23maf_to_interval.xml) already uses this feature. You can use it as a 24reference. 25 26qq.chisq Quantile-quantile plot for chi-squared tests 27Description 28This function plots ranked observed chi-squared test statistics against the corresponding expected 29order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed 30means of observed and expected values. This is useful for inspecting the results of whole-genome 31association studies for overdispersion due to population substructure and other sources of bias or 32confounding. 33Usage 34qq.chisq(x, df=1, x.max, main="QQ plot", 35sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), 36xlab="Expected", ylab="Observed", 37conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, 38slope.one=FALSE, slope.lambda=FALSE, 39thin=c(0.25,50), oor.pch=24, col.shade="gray", ...) 40Arguments 41x A vector of observed chi-squared test values 42df The degreees of freedom for the tests 43x.max If present, truncate the observed value (Y) axis here 44main The main heading 45sub The subheading 46xlab x-axis label (default "Expected") 47ylab y-axis label (default "Observed") 48conc Lower and upper probability bounds for concentration band for the plot. Set this 49to NA to suppress this 50overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating 51concentration band 52trim Quantile point for trimmed mean calculations for estimation of lambda. Default 53is to trim at the median 54slope.one Is a line of slope one to be superimpsed? 55slope.lambda Is a line of slope lambda to be superimposed? 56thin A pair of numbers indicating how points will be thinned before plotting (see 57Details). If NA, no thinning will be carried out 58oor.pch Observed values greater than x.max are plotted at x.max. This argument sets 59the plotting symbol to be used for out-of-range observations 60col.shade The colour with which the concentration band will be filled 61... Further graphical parameter settings to be passed to points() 62 63Details 64To reduce plotting time and the size of plot files, the smallest observed and expected points are 65thinned so that only a reduced number of (approximately equally spaced) points are plotted. The 66precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers. 67The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning 68is to be applied. The second number should be an integer and sets the maximum number of points 69to be plotted in this section. 70The "concentration band" for the plot is shown in grey. This region is defined by upper and lower 71probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a 72simultaneous confidence region; the probability that the plot will stray outside the band at some 73point exceeds 95 74When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its 75expected value under the chi-squared assumption. 76Value 77The function returns the number of tests, the number of values omitted from the plot (greater than 78x.max), and the estimated dispersion factor, lambda. 79Note 80All tests must have the same number of degrees of freedom. If this is not the case, I suggest 81transforming to p-values and then plotting -2log(p) as chi-squared on 2 df. 82Author(s) 83David Clayton hdavid.clayton@cimr.cam.ac.uki 84References 85Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004 86""" 87 88import sys, random, math, copy,os, subprocess, tempfile 89from rgutils import RRun, rexe 90 91rqq = """ 92# modified by ross lazarus for the rgenetics project may 2000 93# makes a pdf for galaxy from an x vector of chisquare values 94# from snpMatrix 95# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html 96 qq.chisq <- 97 function(x, df=1, x.max, 98 main="QQ plot", 99 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), 100 xlab="Expected", ylab="Observed", 101 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, 102 slope.one=T, slope.lambda=FALSE, 103 thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf", 104 h=6,w=6,printpdf=F,...) { 105 106 # Function to shade concentration band 107 108 shade <- function(x1, y1, x2, y2, color=col.shade) { 109 n <- length(x2) 110 polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color) 111 } 112 113 # Sort values and see how many out of range 114 115 obsvd <- sort(x, na.last=NA) 116 N <- length(obsvd) 117 if (missing(x.max)) { 118 Np <- N 119 } 120 else { 121 Np <- sum(obsvd<=x.max) 122 } 123 if(Np==0) 124 stop("Nothing to plot") 125 126 # Expected values 127 128 if (df==2) { 129 expctd <- 2*cumsum(1/(N:1)) 130 } 131 else { 132 expctd <- qchisq(p=(1:N)/(N+1), df=df) 133 } 134 135 # Concentration bands 136 137 if (!is.null(conc)) { 138 if(conc[1]>0) { 139 e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df) 140 } 141 else { 142 e.low <- rep(0, N) 143 } 144 if (conc[2]<1) { 145 e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df) 146 } 147 else { 148 e.high <- 1.1*rep(max(x),N) 149 } 150 } 151 152 # Plot outline 153 154 if (Np < N) 155 top <- x.max 156 else 157 top <- obsvd[N] 158 right <- expctd[N] 159 if (printpdf) {pdf(ofname,h,w)} 160 plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab, 161 main=main, sub=sub) 162 163 # Thinning 164 165 if (is.na(thin[1])) { 166 show <- 1:Np 167 } 168 else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) { 169 warning("invalid thin parameter; no thinning carried out") 170 show <- 1:Np 171 } 172 else { 173 space <- right*thin[1]/floor(thin[2]) 174 iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df)) 175 if (max(iat)>thin[2]) 176 show <- unique(c(iat, (1+max(iat)):Np)) 177 else 178 show <- 1:Np 179 } 180 Nu <- floor(trim*N) 181 if (Nu>0) 182 lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu]) 183 if (!is.null(conc)) { 184 if (Np<N) 185 vert <- c(show, (Np+1):N) 186 else 187 vert <- show 188 if (overdisp) 189 shade(expctd[vert], lambda*e.low[vert], 190 expctd[vert], lambda*e.high[vert]) 191 else 192 shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert]) 193 } 194 points(expctd[show], obsvd[show], ...) 195 # Overflow 196 if (Np<N) { 197 over <- (Np+1):N 198 points(expctd[over], rep(x.max, N-Np), pch=oor.pch) 199 } 200 # Lines 201 line.types <- c("solid", "dashed", "dotted") 202 key <- NULL 203 txt <- NULL 204 if (slope.one) { 205 key <- c(key, line.types[1]) 206 txt <- c(txt, "y = x") 207 abline(a=0, b=1, lty=line.types[1]) 208 } 209 if (slope.lambda && Nu>0) { 210 key <- c(key, line.types[2]) 211 txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep="")) 212 if (!is.null(conc)) { 213 if (Np<N) 214 vert <- c(show, (Np+1):N) 215 else 216 vert <- show 217 } 218 abline(a=0, b=lambda, lty=line.types[2]) 219 } 220 if (printpdf) {dev.off()} 221 # Returned value 222 223 if (!is.null(key)) 224 legend(0, top, legend=txt, lty=key) 225 c(N=N, omitted=N-Np, lambda=lambda) 226 227 } 228 229""" 230 231 232 233 234def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title', 235 xvar='Sample',h=8,w=8,logscale=True,outdir=None): 236 """ 237 y is data for a qq plot and ends up on the x axis go figure 238 if sampling, oversample low values - all the top 1% ? 239 assume we have 0-1 p values 240 """ 241 R = [] 242 colour="maroon" 243 nrows = len(dat) 244 dat.sort() # small to large 245 fn = float(nrows) 246 unifx = [x/fn for x in range(1,(nrows+1))] 247 if logscale: 248 unifx = [-math.log10(x) for x in unifx] # uniform distribution 249 if sample < 1.0 and len(dat) > maxveclen: 250 # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points 251 # oversample part of the distribution 252 always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5% 253 skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points 254 if skip <= 1: 255 skip = 2 256 samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)] 257 # always oversample first sorted (here lowest) values 258 yvec = [dat[i] for i in samplei] # always get first and last 259 xvec = [unifx[i] for i in samplei] # and sample xvec same way 260 maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows) 261 else: 262 yvec = [x for x in dat] 263 maint='QQ %s (n=%d)' % (title,nrows) 264 xvec = unifx 265 if logscale: 266 maint = 'Log%s' % maint 267 mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line 268 ylab = '-log10(%s) Quantiles' % title 269 xlab = '-log10(Uniform 0-1) Quantiles' 270 yvec = [-math.log10(x) for x in yvec if x > 0.0] 271 else: 272 mx = [0,1] 273 ylab = '%s Quantiles' % title 274 xlab = 'Uniform 0-1 Quantiles' 275 276 xv = ['%f' % x for x in xvec] 277 R.append('xvec = c(%s)' % ','.join(xv)) 278 yv = ['%f' % x for x in yvec] 279 R.append('yvec = c(%s)' % ','.join(yv)) 280 R.append('mx = c(0,%f)' % (math.log10(fn))) 281 R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w)) 282 R.append("par(lab=c(10,10,10))") 283 R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour)) 284 R.append('points(mx,mx,type="l")') 285 R.append('grid(col="lightgray",lty="dotted")') 286 R.append('dev.off()') 287 RRun(rcmd=R,title='makeQQplot',outdir=outdir) 288 289 290 291def main(): 292 u = """ 293 """ 294 u = """<command interpreter="python"> 295 rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ 296 </command> 297 298 </command> 299 """ 300 print >> sys.stdout,'## rgQQ.py. cl=',sys.argv 301 npar = 11 302 if len(sys.argv) < npar: 303 print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar 304 print >> sys.stdout, u 305 sys.exit(1) 306 in_fname = sys.argv[1] 307 name = sys.argv[2] 308 sample = float(sys.argv[3]) 309 head = None 310 columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns! 311 allout = sys.argv[5] 312 height = int(sys.argv[6]) 313 width = int(sys.argv[7]) 314 logscale = (sys.argv[8].lower() == 'true') 315 outid = sys.argv[9] # this is used to allow multiple output files 316 outdir = sys.argv[10] 317 nan_row = False 318 rows = [] 319 for i, line in enumerate( file( sys.argv[1] ) ): 320 # Skip comments 321 if line.startswith( '#' ) or ( i == 0 ): 322 if i == 0: 323 head = line.strip().split("\t") 324 continue 325 if len(line.strip()) == 0: 326 continue 327 # Extract values and convert to floats 328 fields = line.strip().split( "\t" ) 329 row = [] 330 nan_row = False 331 for column in columns: 332 if len( fields ) <= column: 333 return fail( "No column %d on line %d: %s" % ( column, i, fields ) ) 334 val = fields[column] 335 if val.lower() == "na": 336 nan_row = True 337 else: 338 try: 339 row.append( float( fields[column] ) ) 340 except ValueError: 341 return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) ) 342 if not nan_row: 343 rows.append( row ) 344 if i > 1: 345 i = i-1 # remove header row from count 346 if head == None: 347 head = ['Col%d' % (x+1) for x in columns] 348 R = [] 349 for c,column in enumerate(columns): # we appended each column in turn 350 outname = allout 351 if c > 0: # after first time 352 outname = 'primary_%s_col%s_visible_pdf' % (outid,column) 353 outname = os.path.join(outdir,outname) 354 dat = [] 355 nrows = len(rows) # sometimes lots of NA's!! 356 for arow in rows: 357 dat.append(arow[c]) # remember, we appended each col in turn! 358 cname = head[column] 359 makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname), 360 xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir) 361 362 363 364if __name__ == "__main__": 365 main()