PageRenderTime 16ms CodeModel.GetById 11ms app.highlight 1ms RepoModel.GetById 1ms app.codeStats 0ms

/scripts/tools/annotation_profiler/README.txt

https://bitbucket.org/cistrome/cistrome-harvard/
Plain Text | 54 lines | 39 code | 15 blank | 0 comment | 0 complexity | 1c8847af6f235b854033742f56593a0d MD5 | raw file
 1This file explains how to create annotation indexes for the annotation profiler tool. Annotation profiler indexes are an exceedingly simple binary format, 
 2containing no header information and consisting of an ordered linear list of (start,stop encoded individually as '<I') regions which are covered by a UCSC table partitioned 
 3by chromosome name. Genomic regions are merged by overlap / direct adjacency (e.g. a table having ranges of: 1-10, 6-12, 12-20 and 25-28 results in two merged ranges of: 1-20 and 25-28).
 4
 5Files are arranged like:
 6/profiled_annotations/DBKEY/TABLE_NAME/
 7                                       CHROMOSOME_NAME.covered
 8                                       CHROMOSOME_NAME.total_coverage
 9                                       CHROMOSOME_NAME.total_regions
10/profiled_annotations/DBKEY/
11                            DBKEY_tables.xml
12                            chromosomes.txt
13                            profiled_info.txt
14
15
16where CHROMOSOME_NAME.covered is the binary file, CHROMOSOME_NAME.total_coverage is a text file containing the integer count of bases covered by the 
17table and CHROMOSOME_NAME.total_regions contains the integer count of the number of regions found in CHROMOSOME_NAME.covered
18
19DBKEY_tables.xml should be appended to the annotation profile available table configuration file (tool-data/annotation_profiler_options.xml).
20The DBKEY should also be added as a new line to the annotation profiler valid builds file (annotation_profiler_valid_builds.txt).
21The output (/profiled_annotations/DBKEY) should be made available as GALAXY_ROOT/tool-data/annotation_profiler/DBKEY.
22
23profiled_info.txt contains info on the generated annotations, separated by lines with tab-delimited label,value pairs:
24        profiler_version - the version of the build_profile_indexes.py script that was used to generate the profiled data
25        dbkey - the dbkey used for the run
26        chromosomes - contains the names and lengths of chromosomes that were used to parse single-chromosome tables (tables divided into individual files by chromosome)
27        dump_time - the declared dump time of the database, taken from trackDb.txt.gz
28        profiled_time - seconds since epoch in utc for when the database dump was profiled
29        database_hash - a md5 hex digest of all the profiled table info 
30
31
32Typical usage includes:
33
34python build_profile_indexes.py -d hg19 -i /ucsc_data/hg19/database/ > hg19.txt
35
36where the genome build is hg19 and /ucsc_data/hg19/database/ contains the downloaded database dump from UCSC (e.g. obtained by rsync: rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ /ucsc_data/hg19/database/).
37
38
39
40By default, chromosome names come from a file named 'chromInfo.txt.gz' found in the input directory, with FTP used as a backup.
41When FTP is used to obtain the names of chromosomes from UCSC for a particular genome build, alternate ftp sites and paths can be specified by using the --ftp_site and --ftp_path attributes. 
42Chromosome names can instead be provided on the commandline via the --chromosomes option, which accepts a comma separated list of:ChromName1[=length],ChromName2[=length],...
43
44
45
46    usage = "usage: %prog options"
47    parser = OptionParser( usage=usage )
48    parser.add_option( '-d', '--dbkey', dest='dbkey', default='hg18', help='dbkey to process' )
49    parser.add_option( '-i', '--input_dir', dest='input_dir', default=os.path.join( 'golden_path','%s', 'database' ), help='Input Directory' )
50    parser.add_option( '-o', '--output_dir', dest='output_dir', default=os.path.join( 'profiled_annotations','%s' ), help='Output Directory' )
51    parser.add_option( '-c', '--chromosomes', dest='chromosomes', default='', help='Comma separated list of: ChromName1[=length],ChromName2[=length],...' )
52    parser.add_option( '-b', '--bitset_size', dest='bitset_size', default=DEFAULT_BITSET_SIZE, type='int', help='Default BitSet size; overridden by sizes specified in chromInfo.txt.gz or by --chromosomes' )
53    parser.add_option( '-f', '--ftp_site', dest='ftp_site', default='hgdownload.cse.ucsc.edu', help='FTP site; used for chromosome info when chromInfo.txt.gz method fails' )
54    parser.add_option( '-p', '--ftp_path', dest='ftp_path', default='/goldenPath/%s/chromosomes/', help='FTP Path; used for chromosome info when chromInfo.txt.gz method fails' )