histogram.py - This is a Python script that generates a his…

/tools/plotting/histogram.py

https://bitbucket.org/cistrome/cistrome-harvard/ · Python · 101 lines · 85 code · 11 blank · 5 comment · 30 complexity · a6b4b022439f285578766354544feb80 MD5 · raw file


#!/usr/bin/env python
#Greg Von Kuster

import sys
from rpy import *

assert sys.version_info[:2] >= ( 2, 4 )

def stop_err(msg):
    sys.stderr.write(msg)
    sys.exit()

def main():

    # Handle input params
    in_fname = sys.argv[1]
    out_fname = sys.argv[2] 
    try:
        column = int( sys.argv[3] ) - 1
    except:
        stop_err( "Column not specified, your query does not contain a column of numerical data." )
    title = sys.argv[4]
    xlab = sys.argv[5]
    breaks = int( sys.argv[6] )
    if breaks == 0:
        breaks = "Sturges"
    if sys.argv[7] == "true":
        density = True
    else: density = False
    if len( sys.argv ) >= 9 and sys.argv[8] == "true":
        frequency = True
    else: frequency = False

    matrix = []
    skipped_lines = 0
    first_invalid_line = 0
    invalid_value = ''
    i = 0
    for i, line in enumerate( file( in_fname ) ):
        valid = True
        line = line.rstrip('\r\n')
        # Skip comments
        if line and not line.startswith( '#' ): 
            # Extract values and convert to floats
            row = []
            try:
                fields = line.split( "\t" )
                val = fields[column]
                if val.lower() == "na":
                    row.append( float( "nan" ) )
            except:
                valid = False
                skipped_lines += 1
                if not first_invalid_line:
                    first_invalid_line = i+1
            else:
                try:
                    row.append( float( val ) )
                except ValueError:
                    valid = False
                    skipped_lines += 1
                    if not first_invalid_line:
                        first_invalid_line = i+1
                        invalid_value = fields[column]
        else:
            valid = False
            skipped_lines += 1
            if not first_invalid_line:
                first_invalid_line = i+1

        if valid:
            matrix += row

    if skipped_lines < i:
        try:
            a = r.array( matrix )
            r.pdf( out_fname, 8, 8 )
            histogram = r.hist( a, probability=not frequency, main=title, xlab=xlab, breaks=breaks )
            if density:
                density = r.density( a )
                if frequency:
                    scale_factor = len( matrix ) * ( histogram['mids'][1] - histogram['mids'][0] ) #uniform bandwidth taken from first 2 midpoints
                    density[ 'y' ] = map( lambda x: x * scale_factor, density[ 'y' ] )
                r.lines( density )
            r.dev_off()
        except Exception, exc:
            stop_err( "%s" %str( exc ) )
    else:
        if i == 0:
            stop_err("Input dataset is empty.")
        else:
            stop_err( "All values in column %s are non-numeric." %sys.argv[3] )

    print "Histogram of column %s. " %sys.argv[3]
    if skipped_lines > 0:
        print "Skipped %d invalid lines starting with line #%d, '%s'." % ( skipped_lines, first_invalid_line, invalid_value )

    r.quit( save="no" )
    
if __name__ == "__main__":
    main()

Summary ✨

This is a Python script that generates a histogram of a column from a tab-delimited file using R’s hist function. The script takes in several arguments, including the input file name, output file name, column number, title, x-axis label, and whether to use density or frequency. If any of these arguments are not specified, the script will prompt the user for them.

The script first checks that the Python version is at least 2.4, then it reads in the input file line by line and extracts the values from the specified column as floats. It skips any lines that do not contain a value in the specified column or are commented out. If all values in the column are non-numeric, the script will prompt the user to check their input.

Once the data is extracted, the script creates an R array and uses it to generate a histogram using hist. It also allows the user to specify whether to use density or frequency, as well as whether to plot the histogram on top of a density curve. The script then saves the output file and quits R.

Tech Fingerprint

Standard Library: OS Interaction

Alerts (8)

'import *' Avoid to prevent namespace pollution; import specific names or use aliases
5
'def' Ensure functions have docstrings for documentation
9 13
'except:' Avoid catching all exceptions; specify exception types to catch only expected errors
20 51
Complexity hotspot; lines 29 to 30 (total complexity: 3)
29 30
'lambda' Avoid complex 'lambda' functions; prefer named functions for clarity and debugging
83