PageRenderTime 67ms CodeModel.GetById 19ms app.highlight 42ms RepoModel.GetById 1ms app.codeStats 0ms

/tools/rgenetics/rgQQ.py

https://bitbucket.org/cistrome/cistrome-harvard/
Python | 365 lines | 358 code | 2 blank | 5 comment | 21 complexity | f5fb45250e3c91bb7fd6e6df1aa86b04 MD5 | raw file
  1"""
  2oct 2009 - multiple output files 
  3Dear Matthias,
  4
  5Yes, you can define number of outputs dynamically in Galaxy. For doing
  6this, you'll have to declare one output dataset in your xml and pass
  7its ID ($out_file.id) to your python script. Also, set
  8force_history_refresh="True" in your tool tag in xml, like this:
  9<tool id="split1" name="Split" force_history_refresh="True">
 10In your script, if your outputs are named in the following format,
 11primary_associatedWithDatasetID_designation_visibility_extension
 12(_DBKEY), all your datasets will show up in the history pane.
 13associatedWithDatasetID is the $out_file.ID passed from xml,
 14designation will be a unique identifier for each output (set in your
 15script),
 16visibility can be set to visible if you want the dataset visible in
 17your history, or notvisible otherwise
 18extension is the required format for your dataset (bed, tabular, fasta
 19etc)
 20DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
 21
 22One of our tools "MAF to Interval converter" (tools/maf/
 23maf_to_interval.xml) already uses this feature. You can use it as a
 24reference.
 25
 26qq.chisq Quantile-quantile plot for chi-squared tests
 27Description
 28This function plots ranked observed chi-squared test statistics against the corresponding expected
 29order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
 30means of observed and expected values. This is useful for inspecting the results of whole-genome
 31association studies for overdispersion due to population substructure and other sources of bias or
 32confounding.
 33Usage
 34qq.chisq(x, df=1, x.max, main="QQ plot",
 35sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
 36xlab="Expected", ylab="Observed",
 37conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
 38slope.one=FALSE, slope.lambda=FALSE,
 39thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
 40Arguments
 41x A vector of observed chi-squared test values
 42df The degreees of freedom for the tests
 43x.max If present, truncate the observed value (Y) axis here
 44main The main heading
 45sub The subheading
 46xlab x-axis label (default "Expected")
 47ylab y-axis label (default "Observed")
 48conc Lower and upper probability bounds for concentration band for the plot. Set this
 49to NA to suppress this
 50overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
 51concentration band
 52trim Quantile point for trimmed mean calculations for estimation of lambda. Default
 53is to trim at the median
 54slope.one Is a line of slope one to be superimpsed?
 55slope.lambda Is a line of slope lambda to be superimposed?
 56thin A pair of numbers indicating how points will be thinned before plotting (see
 57Details). If NA, no thinning will be carried out
 58oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
 59the plotting symbol to be used for out-of-range observations
 60col.shade The colour with which the concentration band will be filled
 61... Further graphical parameter settings to be passed to points()
 62
 63Details
 64To reduce plotting time and the size of plot files, the smallest observed and expected points are
 65thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
 66precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
 67The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
 68is to be applied. The second number should be an integer and sets the maximum number of points
 69to be plotted in this section.
 70The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
 71probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
 72simultaneous confidence region; the probability that the plot will stray outside the band at some
 73point exceeds 95
 74When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
 75expected value under the chi-squared assumption.
 76Value
 77The function returns the number of tests, the number of values omitted from the plot (greater than
 78x.max), and the estimated dispersion factor, lambda.
 79Note
 80All tests must have the same number of degrees of freedom. If this is not the case, I suggest
 81transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
 82Author(s)
 83David Clayton hdavid.clayton@cimr.cam.ac.uki
 84References
 85Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
 86"""
 87
 88import sys, random, math, copy,os, subprocess, tempfile
 89from rgutils import RRun, rexe
 90
 91rqq = """
 92# modified by ross lazarus for the rgenetics project may 2000
 93# makes a pdf for galaxy from an x vector of chisquare values
 94# from snpMatrix
 95# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
 96 qq.chisq <-
 97  function(x, df=1, x.max,
 98    main="QQ plot",
 99    sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
100    xlab="Expected", ylab="Observed",
101    conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
102    slope.one=T, slope.lambda=FALSE,
103    thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
104    h=6,w=6,printpdf=F,...) {
105
106    # Function to shade concentration band
107
108    shade <- function(x1, y1, x2, y2, color=col.shade) {
109      n <- length(x2)
110      polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
111    }
112
113    # Sort values and see how many out of range
114
115    obsvd <- sort(x, na.last=NA)
116    N <- length(obsvd)
117    if (missing(x.max)) {
118      Np <- N
119    }
120    else {
121      Np <- sum(obsvd<=x.max)
122    }
123    if(Np==0)
124      stop("Nothing to plot")
125
126    # Expected values
127
128    if (df==2) {
129      expctd <- 2*cumsum(1/(N:1))
130    }
131    else {
132      expctd <- qchisq(p=(1:N)/(N+1), df=df)
133    }
134
135    # Concentration bands
136
137    if (!is.null(conc)) {
138      if(conc[1]>0) {
139        e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
140      }
141      else {
142        e.low <- rep(0, N)
143      }
144      if (conc[2]<1) {
145        e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
146      }
147      else {
148        e.high <- 1.1*rep(max(x),N)
149      }
150    }
151
152    # Plot outline
153
154    if (Np < N)
155      top <- x.max
156    else
157      top <- obsvd[N]
158    right <- expctd[N]
159    if (printpdf) {pdf(ofname,h,w)}
160    plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
161         main=main, sub=sub)
162
163    # Thinning
164
165    if (is.na(thin[1])) {
166      show <- 1:Np
167    }
168    else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) {
169      warning("invalid thin parameter; no thinning carried out")
170      show <- 1:Np
171    }
172    else {
173      space <- right*thin[1]/floor(thin[2])
174      iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df))
175      if (max(iat)>thin[2])
176        show <- unique(c(iat, (1+max(iat)):Np))
177      else
178        show <- 1:Np
179    }
180    Nu <- floor(trim*N)
181    if (Nu>0)
182      lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
183    if (!is.null(conc)) {
184      if (Np<N)
185        vert <- c(show, (Np+1):N)
186      else
187        vert <- show
188      if (overdisp)
189        shade(expctd[vert], lambda*e.low[vert],
190              expctd[vert], lambda*e.high[vert])
191      else
192        shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
193    }
194    points(expctd[show], obsvd[show], ...)
195    # Overflow
196    if (Np<N) {
197      over <- (Np+1):N
198      points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
199    }
200    # Lines
201    line.types <- c("solid", "dashed", "dotted")
202    key <- NULL
203    txt <- NULL
204    if (slope.one) {
205      key <- c(key, line.types[1])
206      txt <- c(txt, "y = x")
207      abline(a=0, b=1, lty=line.types[1])
208    }
209    if (slope.lambda && Nu>0) {
210      key <- c(key, line.types[2])
211      txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
212      if (!is.null(conc)) {
213        if (Np<N)
214          vert <- c(show, (Np+1):N)
215        else
216          vert <- show
217      }
218      abline(a=0, b=lambda, lty=line.types[2])
219    }
220    if (printpdf) {dev.off()}
221    # Returned value
222
223    if (!is.null(key))
224       legend(0, top, legend=txt, lty=key)
225    c(N=N, omitted=N-Np, lambda=lambda)
226
227  }
228
229"""
230
231
232               
233    
234def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
235           xvar='Sample',h=8,w=8,logscale=True,outdir=None):
236    """
237    y is data for a qq plot and ends up on the x axis go figure
238    if sampling, oversample low values - all the top 1% ?
239    assume we have 0-1 p values
240    """
241    R = []
242    colour="maroon"
243    nrows = len(dat)
244    dat.sort() # small to large
245    fn = float(nrows)
246    unifx = [x/fn for x in range(1,(nrows+1))]
247    if logscale:
248        unifx = [-math.log10(x) for x in unifx] # uniform distribution
249    if sample < 1.0 and len(dat) > maxveclen:
250        # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
251        # oversample part of the distribution
252        always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
253        skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
254        if skip <= 1:
255            skip = 2
256        samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
257        # always oversample first sorted (here lowest) values
258        yvec = [dat[i] for i in samplei] # always get first and last
259        xvec = [unifx[i] for i in samplei] # and sample xvec same way
260        maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
261    else:
262        yvec = [x for x in dat] 
263        maint='QQ %s (n=%d)' % (title,nrows)
264        xvec = unifx
265    if logscale:
266        maint = 'Log%s' % maint
267        mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
268        ylab = '-log10(%s) Quantiles' % title
269        xlab = '-log10(Uniform 0-1) Quantiles'
270        yvec = [-math.log10(x) for x in yvec if x > 0.0]
271    else:
272        mx = [0,1]
273        ylab = '%s Quantiles' % title
274        xlab = 'Uniform 0-1 Quantiles'
275
276    xv = ['%f' % x for x in xvec]
277    R.append('xvec = c(%s)' % ','.join(xv))
278    yv = ['%f' % x for x in yvec]
279    R.append('yvec = c(%s)' % ','.join(yv))
280    R.append('mx = c(0,%f)' % (math.log10(fn)))
281    R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
282    R.append("par(lab=c(10,10,10))")
283    R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
284    R.append('points(mx,mx,type="l")')
285    R.append('grid(col="lightgray",lty="dotted")')
286    R.append('dev.off()')
287    RRun(rcmd=R,title='makeQQplot',outdir=outdir)
288
289
290
291def main():
292    u = """
293    """
294    u = """<command interpreter="python">
295        rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ 
296    </command>                                                                                                 
297
298    </command>
299    """
300    print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
301    npar = 11
302    if len(sys.argv) < npar:
303            print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
304            print >> sys.stdout, u
305            sys.exit(1)
306    in_fname = sys.argv[1]
307    name = sys.argv[2]
308    sample = float(sys.argv[3])
309    head = None
310    columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
311    allout = sys.argv[5]
312    height = int(sys.argv[6])
313    width = int(sys.argv[7])
314    logscale = (sys.argv[8].lower() == 'true')
315    outid = sys.argv[9] # this is used to allow multiple output files 
316    outdir = sys.argv[10]
317    nan_row = False
318    rows = []
319    for i, line in enumerate( file( sys.argv[1] ) ):
320        # Skip comments
321        if  line.startswith( '#' ) or ( i == 0 ):
322            if i == 0:
323                 head = line.strip().split("\t")
324            continue
325        if len(line.strip()) == 0:
326            continue
327        # Extract values and convert to floats
328        fields = line.strip().split( "\t" )
329        row = []
330        nan_row = False
331        for column in columns:
332            if len( fields ) <= column:
333                return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
334            val = fields[column]
335            if val.lower() == "na":
336                nan_row = True
337            else:
338                try:
339                    row.append( float( fields[column] ) )
340                except ValueError:
341                    return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
342        if not nan_row:
343           rows.append( row )
344    if i > 1:
345       i = i-1 # remove header row from count
346    if head == None:
347       head = ['Col%d' % (x+1) for x in columns]
348    R = []
349    for c,column in enumerate(columns): # we appended each column in turn
350        outname = allout
351        if c > 0: # after first time
352            outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
353            outname = os.path.join(outdir,outname)
354        dat = []
355        nrows = len(rows) # sometimes lots of NA's!!
356        for arow in rows:
357           dat.append(arow[c]) # remember, we appended each col in turn!
358        cname = head[column]        
359        makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
360                   xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
361
362
363
364if __name__ == "__main__":
365    main()