rgQQ.py - This Python script takes a tab-delimited file as …

/tools/rgenetics/rgQQ.py

https://bitbucket.org/cistrome/cistrome-harvard/ · Python · 365 lines · 313 code · 21 blank · 31 comment · 83 complexity · f5fb45250e3c91bb7fd6e6df1aa86b04 MD5 · raw file

"""
oct 2009 - multiple output files 
Dear Matthias,

Yes, you can define number of outputs dynamically in Galaxy. For doing
this, you'll have to declare one output dataset in your xml and pass
its ID ($out_file.id) to your python script. Also, set
force_history_refresh="True" in your tool tag in xml, like this:
<tool id="split1" name="Split" force_history_refresh="True">
In your script, if your outputs are named in the following format,
primary_associatedWithDatasetID_designation_visibility_extension
(_DBKEY), all your datasets will show up in the history pane.
associatedWithDatasetID is the $out_file.ID passed from xml,
designation will be a unique identifier for each output (set in your
script),
visibility can be set to visible if you want the dataset visible in
your history, or notvisible otherwise
extension is the required format for your dataset (bed, tabular, fasta
etc)
DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)

One of our tools "MAF to Interval converter" (tools/maf/
maf_to_interval.xml) already uses this feature. You can use it as a
reference.

qq.chisq Quantile-quantile plot for chi-squared tests
Description
This function plots ranked observed chi-squared test statistics against the corresponding expected
order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
means of observed and expected values. This is useful for inspecting the results of whole-genome
association studies for overdispersion due to population substructure and other sources of bias or
confounding.
Usage
qq.chisq(x, df=1, x.max, main="QQ plot",
sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
xlab="Expected", ylab="Observed",
conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
slope.one=FALSE, slope.lambda=FALSE,
thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
Arguments
x A vector of observed chi-squared test values
df The degreees of freedom for the tests
x.max If present, truncate the observed value (Y) axis here
main The main heading
sub The subheading
xlab x-axis label (default "Expected")
ylab y-axis label (default "Observed")
conc Lower and upper probability bounds for concentration band for the plot. Set this
to NA to suppress this
overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
concentration band
trim Quantile point for trimmed mean calculations for estimation of lambda. Default
is to trim at the median
slope.one Is a line of slope one to be superimpsed?
slope.lambda Is a line of slope lambda to be superimposed?
thin A pair of numbers indicating how points will be thinned before plotting (see
Details). If NA, no thinning will be carried out
oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
the plotting symbol to be used for out-of-range observations
col.shade The colour with which the concentration band will be filled
... Further graphical parameter settings to be passed to points()

Details
To reduce plotting time and the size of plot files, the smallest observed and expected points are
thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
is to be applied. The second number should be an integer and sets the maximum number of points
to be plotted in this section.
The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
simultaneous confidence region; the probability that the plot will stray outside the band at some
point exceeds 95
When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
expected value under the chi-squared assumption.
Value
The function returns the number of tests, the number of values omitted from the plot (greater than
x.max), and the estimated dispersion factor, lambda.
Note
All tests must have the same number of degrees of freedom. If this is not the case, I suggest
transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
Author(s)
David Clayton hdavid.clayton@cimr.cam.ac.uki
References
Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
"""

import sys, random, math, copy,os, subprocess, tempfile
from rgutils import RRun, rexe

rqq = """
# modified by ross lazarus for the rgenetics project may 2000
# makes a pdf for galaxy from an x vector of chisquare values
# from snpMatrix
# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
 qq.chisq <-
  function(x, df=1, x.max,
    main="QQ plot",
    sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
    xlab="Expected", ylab="Observed",
    conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
    slope.one=T, slope.lambda=FALSE,
    thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
    h=6,w=6,printpdf=F,...) {

    # Function to shade concentration band

    shade <- function(x1, y1, x2, y2, color=col.shade) {
      n <- length(x2)
      polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
    }

    # Sort values and see how many out of range

    obsvd <- sort(x, na.last=NA)
    N <- length(obsvd)
    if (missing(x.max)) {
      Np <- N
    }
    else {
      Np <- sum(obsvd<=x.max)
    }
    if(Np==0)
      stop("Nothing to plot")

    # Expected values

    if (df==2) {
      expctd <- 2*cumsum(1/(N:1))
    }
    else {
      expctd <- qchisq(p=(1:N)/(N+1), df=df)
    }

    # Concentration bands

    if (!is.null(conc)) {
      if(conc[1]>0) {
        e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
      }
      else {
        e.low <- rep(0, N)
      }
      if (conc[2]<1) {
        e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
      }
      else {
        e.high <- 1.1*rep(max(x),N)
      }
    }

    # Plot outline

    if (Np < N)
      top <- x.max
    else
      top <- obsvd[N]
    right <- expctd[N]
    if (printpdf) {pdf(ofname,h,w)}
    plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
         main=main, sub=sub)

    # Thinning

    if (is.na(thin[1])) {
      show <- 1:Np
    }
    else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) {
      warning("invalid thin parameter; no thinning carried out")
      show <- 1:Np
    }
    else {
      space <- right*thin[1]/floor(thin[2])
      iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df))
      if (max(iat)>thin[2])
        show <- unique(c(iat, (1+max(iat)):Np))
      else
        show <- 1:Np
    }
    Nu <- floor(trim*N)
    if (Nu>0)
      lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
    if (!is.null(conc)) {
      if (Np<N)
        vert <- c(show, (Np+1):N)
      else
        vert <- show
      if (overdisp)
        shade(expctd[vert], lambda*e.low[vert],
              expctd[vert], lambda*e.high[vert])
      else
        shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
    }
    points(expctd[show], obsvd[show], ...)
    # Overflow
    if (Np<N) {
      over <- (Np+1):N
      points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
    }
    # Lines
    line.types <- c("solid", "dashed", "dotted")
    key <- NULL
    txt <- NULL
    if (slope.one) {
      key <- c(key, line.types[1])
      txt <- c(txt, "y = x")
      abline(a=0, b=1, lty=line.types[1])
    }
    if (slope.lambda && Nu>0) {
      key <- c(key, line.types[2])
      txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
      if (!is.null(conc)) {
        if (Np<N)
          vert <- c(show, (Np+1):N)
        else
          vert <- show
      }
      abline(a=0, b=lambda, lty=line.types[2])
    }
    if (printpdf) {dev.off()}
    # Returned value

    if (!is.null(key))
       legend(0, top, legend=txt, lty=key)
    c(N=N, omitted=N-Np, lambda=lambda)

  }

"""


               
    
def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
           xvar='Sample',h=8,w=8,logscale=True,outdir=None):
    """
    y is data for a qq plot and ends up on the x axis go figure
    if sampling, oversample low values - all the top 1% ?
    assume we have 0-1 p values
    """
    R = []
    colour="maroon"
    nrows = len(dat)
    dat.sort() # small to large
    fn = float(nrows)
    unifx = [x/fn for x in range(1,(nrows+1))]
    if logscale:
        unifx = [-math.log10(x) for x in unifx] # uniform distribution
    if sample < 1.0 and len(dat) > maxveclen:
        # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
        # oversample part of the distribution
        always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
        skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
        if skip <= 1:
            skip = 2
        samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
        # always oversample first sorted (here lowest) values
        yvec = [dat[i] for i in samplei] # always get first and last
        xvec = [unifx[i] for i in samplei] # and sample xvec same way
        maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
    else:
        yvec = [x for x in dat] 
        maint='QQ %s (n=%d)' % (title,nrows)
        xvec = unifx
    if logscale:
        maint = 'Log%s' % maint
        mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
        ylab = '-log10(%s) Quantiles' % title
        xlab = '-log10(Uniform 0-1) Quantiles'
        yvec = [-math.log10(x) for x in yvec if x > 0.0]
    else:
        mx = [0,1]
        ylab = '%s Quantiles' % title
        xlab = 'Uniform 0-1 Quantiles'

    xv = ['%f' % x for x in xvec]
    R.append('xvec = c(%s)' % ','.join(xv))
    yv = ['%f' % x for x in yvec]
    R.append('yvec = c(%s)' % ','.join(yv))
    R.append('mx = c(0,%f)' % (math.log10(fn)))
    R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
    R.append("par(lab=c(10,10,10))")
    R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
    R.append('points(mx,mx,type="l")')
    R.append('grid(col="lightgray",lty="dotted")')
    R.append('dev.off()')
    RRun(rcmd=R,title='makeQQplot',outdir=outdir)



def main():
    u = """
    """
    u = """<command interpreter="python">
        rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ 
    </command>                                                                                                 

    </command>
    """
    print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
    npar = 11
    if len(sys.argv) < npar:
            print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
            print >> sys.stdout, u
            sys.exit(1)
    in_fname = sys.argv[1]
    name = sys.argv[2]
    sample = float(sys.argv[3])
    head = None
    columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
    allout = sys.argv[5]
    height = int(sys.argv[6])
    width = int(sys.argv[7])
    logscale = (sys.argv[8].lower() == 'true')
    outid = sys.argv[9] # this is used to allow multiple output files 
    outdir = sys.argv[10]
    nan_row = False
    rows = []
    for i, line in enumerate( file( sys.argv[1] ) ):
        # Skip comments
        if  line.startswith( '#' ) or ( i == 0 ):
            if i == 0:
                 head = line.strip().split("\t")
            continue
        if len(line.strip()) == 0:
            continue
        # Extract values and convert to floats
        fields = line.strip().split( "\t" )
        row = []
        nan_row = False
        for column in columns:
            if len( fields ) <= column:
                return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
            val = fields[column]
            if val.lower() == "na":
                nan_row = True
            else:
                try:
                    row.append( float( fields[column] ) )
                except ValueError:
                    return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
        if not nan_row:
           rows.append( row )
    if i > 1:
       i = i-1 # remove header row from count
    if head == None:
       head = ['Col%d' % (x+1) for x in columns]
    R = []
    for c,column in enumerate(columns): # we appended each column in turn
        outname = allout
        if c > 0: # after first time
            outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
            outname = os.path.join(outdir,outname)
        dat = []
        nrows = len(rows) # sometimes lots of NA's!!
        for arow in rows:
           dat.append(arow[c]) # remember, we appended each col in turn!
        cname = head[column]        
        makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
                   xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)



if __name__ == "__main__":
    main()
Summary ✨

This Python script takes a tab-delimited file as input and generates multiple PDF plots for each column using the makeQQ function. The script allows users to specify the output directory, sample size, height, width, and log transformation of the plots. It also allows users to choose which columns to plot and whether to use a single or multiple output files.
Alerts (4)

Complexity hotspot; lines 30 to 31 (total complexity: 4)
30 31
'def' Ensure functions have docstrings for documentation
234
'== None' Use 'is' for None comparisons (e.g., x is None)
346