{ "info": { "author": "James Blachly", "author_email": "james blachly at gmail com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Healthcare Industry", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "Contents\n========\n\n`Introduction <#introduction>`__\n\n`Requirements <#requirements>`__\n\n`Installation <#installation>`__\n\n`Setup and Operation <#setup-and-operation>`__\n\n`DepthGauge <#depthgauge>`__\n\n`Known Issues <#known-issues>`__\n\n`Bug Reports and Issue Tracking <#bug-reports-and-issue-tracking>`__\n\n`Contact <#contact>`__\n\nIntroduction\n============\n\nmucor is software to aggregate variant information sourced from multiple\nVCF files (and some others, see below) into a variety of summary files\nwith varying levels of detail and statistics. The outputs range from\nhigh-level summaries to full, detailed reports in text and Microsoft\nExcel formats. The intended audience is both biologists and\nbioinformaticians.\n\nRequirements\n============\n\nPython modules\n--------------\n\nThe following modules are required:\n\n- numpy (http://www.numpy.org/)\n- pandas (http://pandas.pydata.org/)\n- HTSeq (http://www-huber.embl.de/HTSeq/)\n\nThe following modules are optional:\n\n- pytabix (https://github.com/slowkow/pytabix)\n- XlsxWriter (https://github.com/jmcnamara/XlsxWriter)\n- xlwt (https://pypi.python.org/pypi/xlwt)\n\nAdditional tools\n----------------\n\n- bgzip and Tabix (optional, see *Databases*, below;\n https://github.com/samtools/htslib)\n\nAnnotation\n----------\n\nmucor requires a reference annotation file (in GTF or GFF3 format) for\nfeature definition. Contig (chromosome) names must match between the\ninput VCF and reference annotation file; for example, for *Homo\nsapiens*, UCSC uses the form **chr17** while Ensembl uses the form\n**17**.\n\nProtip: If your human data are aligned and variants called with *chr*\nstyle contig names but you wish to use Ensembl genes, you can try the\nGENCODE annotation which contains most Ensembl genes but using *chr*\nstyle contig numbering.\n\nDatabases\n---------\n\nmucor can *optionally* check variants against any number of supplied\ndatabases and report presence (including identifier) or absence in the\ndatabase. Common choices would be one or more releases of dbSNP (e.g.,\ndbSNP 137 and latest dbSNP; dbSNP clinvar and dbSNP all), 1000 Genomes,\nNHLBI 6500 Exomes, or COSMIC. Users may also supply their own custom\ndatabases.\n\nDatabses must meet the following requirements: \\* Conform to VCF\nstandard format \\* be compressed with bgzip (*not* gzip; see below) \\*\nbe indexed with tabix (see below).\n\nbgzip and tabix provide massive speedups looking up variants in these\nexternal files. The database lookup feature is not available if files\nare not correctly bgzipped or tabix indexed, or if the Pytabix module is\nnot installed. ``bgzip`` and ``tabix`` are available as part of\n``htslib`` (https://github.com/samtools/htslib)\n\nDatabase preparation\n~~~~~~~~~~~~~~~~~~~~\n\nBeginning with an example VCF ``dbSNP.vcf``, execute the following:\n\n::\n\n $ bgzip dbSNP.vcf\n $ tabix -p vcf dbSNP.vcf.gz\n\nNote that bgzipped files can be treated as regular .gz files by other\ntools.\n\nVariant call data\n-----------------\n\nThese data are the data that you wish to aggregate and summarize. In\ntheory, mucor will work with any well-formed generic VCF, but it has\nbeen tested with and contains specific code to handle VCF files\ngenerated by the following tools:\n\n- snpEff\n- Ion Torrent PGM (default machine output)\n- Illumina Miseq (default machine output)\n- GATK\n- GATK SomaticIndelDetector\n- GATK HaplotypeCaller\n- MuTect\n- VarScan\n- FreeBayes\n- Samtools\n\nIn addition, mucor can read the more detailed .out files produced by\nMuTect.\n\nInstallation\n============\n\nclone from https://github.com/blachlylab\n\nSetup and Operation\n===================\n\nOverview\n--------\n\nRunning mucor to aggregate and summarize variants is a two step process:\n\n1. Project setup / configuration\n\n - ``mucor_config.py`` creates a JSON config file\n\n2. Variant Aggregation!\n\n - ``mucor.py``\n\nProject setup\n-------------\n\nConfiguration with ``mucor_config.py``\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe configuration step is completed using the provided\n``mucor_config.py`` utility. It accepts the following command line\narguments and creates a JSON file that will be passed to the main mucor\nscript.\n\n``-ex, --example`` Print a valid, example JSON config file and exit.\nFunction that will write a template of the JSON config. It can be edited\nmanually and supplied to mucor.\n\n``-g GFF, --gff GFF`` Reference annotation GFF/GTF for feature binning.\nRequired\n\n``-f FEATURETYPE, --featuretype FEATURETYPE`` Feature type into which to\nbin. Gencode GTF example: gene\\_name, gene\\_id, transcript\\_name,\ntranscript\\_id, etc. Required\n\n``-db DATABASES, --databases DATABASES`` Colon delimited name and path\nto variant database in bgzipped VCF format. Can be declared >= 0 times.\nEx: -db name1:/full/user/path/name1.vcf.gz. Optional\n\n``-s SAMPLES, --samples SAMPLES`` Text file containing sample names. One\nsample per line. ``mucor_config.py`` attempts to guess which files\nbelong with which sample IDs using globbing (wildcard filename\nmatching). This means that the auto-configuration may be incorrect if\nany sample names are contained within another sample name. Ex: U-23 and\nU-238. U-23 would erroneously identify U-238 files, requiring manual\nmodification of the JSON file. Sample IDs U-023 and U-238 would not\nexhibit this problem. Required\n\n``-d PROJECT_DIRECTORY, --project_directory PROJECT_DIRECTORY``\nWorking/project directory, in which to find variant call files to\naggregate. Variant calls can be in the provided directory, or any of its\nsubdirectories. Default: current working directory\n\n``-vcff VCF_FILTERS, --vcf_filters VCF_FILTERS`` Comma separated list of\nVCF filters to allow. Default: PASS\n\n``-a ARCHIVE_DIRECTORY, --archive_directory ARCHIVE_DIRECTORY`` Specify\ndirectory in which to read/write archived annotations. This step will\nsignificantly speed up future runs that use the same annotation and\nfeature type, even if the sample data changes. Undefined will prevent\nusing the annotation archive features. Optional\n\n``-r REGIONS, --regions REGIONS`` Comma separated list of bed regions\nand/or bed files by which to limit output. Bed regions can be specific\npositions, or entire chromosomes. Ex:\nchr1:10230-10240,chr2,my\\_regions.bed. Optional\n\n``-u, --union`` Join all items with same ID for feature\\_type (specified\nby -f) into a single, continuous bin. For example, if you want intronic\nvariants counted with a gene, use this option. WARNING, this will lead\nto spurious results for features that are duplicated on the same contig.\nWhen feature names are identical, the bin will range from the beginning\nof the first instance to the end of the last, even if they are several\nmegabases apart. Refer to the documentation for a resolution using\n'detect\\_union\\_bin\\_errors.py.' Optional.\n\n``-jco JSON_CONFIG_OUTPUT, --json_config_output JSON_CONFIG_OUTPUT``\nName of JSON configuration output file. This is the configuration file\nfed into mucor. Required\n\n``-outd OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY`` Name of\ndirectory in which to write mucor output. Required\n\n``-outt OUTPUT_TYPE, --output_type OUTPUT_TYPE`` Comma separated list of\ndesired output types. Options include: counts, txt, longtxt, xls,\nlongxls, bed, featXsamp, featmutXsamp, all. Default: counts,txt (See the\ndetailed description of `Output File Formats <#output-file-formats>`__,\nbelow)\n\nExample:\n\n::\n\n python ./mucor_config.py -g ~/references/human/gencode/gencode.v19.annotation.gtf -f gene_name -s samples.txt -a ./fast -u -jco mucor_config.json -outd ./mucor_output -outt all -db 1000G:~/references/human/1000_genomes/chrm_1000Genomes.20130502.genotypes.vcf.gz -db dbsnp:~/references/human/dbsnp/common_all.hg19.sorted.leftalign.vcf.gz\n\nJSON config file\n~~~~~~~~~~~~~~~~\n\n``mucor_config.py`` produces a JSON-formatted configuration file\nembodying the selected options for the subsequent mucor run. The\nconfiguration could be edited manually (e.g., to tweak a database name\nor path) or programmatically (e.g., if ``mucor_config.py``'s guesses\nabout sample IDs were incomplete) at this stage.\n\nAlternatively, ``mucor_config.py -ex`` produces a syntactically valid\nexample JSON file for editing.\n\nProducing a configuration file is intended to facilitate the following:\n\\* Consistency between runs \\* Documentation of settings \\* Easier use\nwith dozens to thousands of input files\n\nVariant Aggregation!\n--------------------\n\nExecution: ``mucor.py ``\n\nmucor is executed by launching the main ``mucor.py`` script, passing as\na single parameter the name of the previously generated/edited JSON\nfile.\n\nOutput files of the specified type are placed in the output directory\nspecified during configuration.\n\nOutput File Formats\n-------------------\n\nOutput types are specified at the time of configuration. The user may\nselect any number and combination of output types from the list below.\n\n**all** Execute all output types\n\n**counts** Print counts of mutations per feature. Output: ``counts.txt``\n\n**txt** Print all information about each variant, with one-per-row,\nirrespective of how many samples in which it appears. Useful for\nvariant-centric projects. Identical to xls in layout. Output:\n``variant_details.txt``\n\n**longtxt** Similar to txt above, but writes each instance of a variant\nto a new row. Each variant is written once per source file, instead of\ncombining recurrent variations into 1 unique row. Identical to longxls\nin layout. Output: ``long_variant_details.txt``\n\n**xls** Print all information about each variant, with one-per-row,\nirrespective of how many samples in which it appears. Useful for\nvariant-centric projects. Identical to txt in layout. Output:\n``variant_details.xls/xlsx``\n\n**longxls** Similar to xls above, but writes each instance of a variant\nto a new row. Each variant is written once per source file, instead of\ncombining recurrent variations into 1 unique row. Identical to longtxt\nin layout. Output: ``long_variant_details.xls/xlsx``\n\n*NB*: The XLS format has a hard limit of 2^16 rows; in long record\nformat, a moderate sized study could exceed this (2,000 total\nvariants/sample \\* 32 samples = 65,536 rows). mucor can use Python's\n``xlwt`` module to write .xls format, but it is preferrable to have\n``XlsxWriter`` or ``openPyxl`` installed for .xlsx support.\n\n**bed** Print bed file of the variant locations Output:\n``variant_locations.bed``\n\n**vcf** Print vcf file of the variant locations, features, depths, and\nvariant frequencies. Output: ``variant_locations.vcf``\n\n**featXsamp** Print table of mutation counts per feature per sample.\nSamples are in columns, while features are in rows. The count of unique\nmutations per sample per feature are the table values. This output is\nuseful for examining patterns in variation across samples, for example,\nto look at combinatoric mutation status for selected recurrently mutated\ngenes. This output could be used directly to make a heatmap. Output:\n``feature_by_sample.txt``\n\n**mutXsamp** Print table of mutations per sample. Unlike **featXsamp**,\nthis differentiates among different variants within the same features.\nFor example, in acute leukemia, the functional effect of mutations in\nDNMT3A depends on whether it is an R882 mutation or non-R882 mutation.\nAs before, samples are in columns, with features in rows. However, rows\n2-4 contain information about chromosome, position, ref, and alt. The\ntable values are boolean: 1 for present mutation, 0 for missing\nmutation. This output could be used directly or with appropriate\nfiltering to make an Oncoprint. Output:\n``feature_and_mutation_by_sample.txt``\n\n**mutXsampVAF** Print table of mutations per sample. Identical to\n**mutXsamp** except boolean values are replaced by the respective\nvariant VAF. Output: ``feature_and_mutation_by_sample.txt``\n\nDepthGauge\n==========\n\nDepthGauge, a companion utility to mucor, measures coverage at regions\nof interest to increase confidence in variant calls. The purpose of this\ntool is not only to verify sufficient depth at mutations, but more\nimportantly, to verify sufficient depth in places that are not called\nmutant. The lack of a mutation call may indicate no mutation, or simply\nthat the region of interest had insufficient coverage for analysis. Like\nmucor, DepthGauge is dependent only on the JSON formatted configuration\nfile which may be entirely hand created, entirely automatically\ngenerated by mucor\\_config.py, or automatically generated and\nhand-tuned. DepthGauge has three additional options, the latter of which\ncan override the regions, if any, specified in the JSON configuration\nfile.\n\ndepth\\_gauge.py config\\_file.json [-p] [-c] [-r REGIONS]\n\n-p, --point Instead of reporting an average depth for every location\nwithin a window (default), take the middle coordinate within the range\nand calculate the depth at that point as a surrogate for the entire\nregion\n\n-c, --count Instead of reporting an average depth for every location\nwithin a window (default), or taking a point estimate from the middle\n(-p), instead count the total number of reads mapped within the region.\n\n-r REGIONS This option specifies a list of regions to query for depth.\nIf present, it overrides the region(s) specified in the JSON\nconfiguration file. As an alternative, the JSON configuration file could\nbe edited before running DepthGauge.\n\nOutput: Depth\\_of\\_Coverage.xlsx\n\nKnown Issues\n============\n\nThe --union feature will behave inappropriately when genomic feature\nnames are duplicated on the same contig. For example, if gene \"ABC\" is\nduplicated on the beginning and the end of chromosome 1, the feature bin\nfor gene \"ABC\" will cover the whole contig (from the beginning of the\nfirst copy of \"ABC\", to the end of the last copy). Users may select\nanother feature type, such as gene\\_id, which is unique to every copy of\na gene. Otherwise, users may run the included python script\n[detect\\_union\\_bin\\_errors.py] to detect potential bin errors. It\naccepts a feature\\_type, a GTF/GFF annotation, and an output directory.\nThe output is a text file list of feature names likely to cause large\nbin errors. Place this text document into the working directory where\nmucor will be executed. mucor will automatically search for the text\nfile by name ['union\\_incompatible\\_genes.txt'] in the current directory\nand print a message indicating when the file is found.\n\n::\n\n python ./detect_union_bin_errors.py -o ~/projects/mucor -g ~/references/human/gencode/gencode.v19.annotation.gtf -f gene_name\n\nThere is an issue when a variant file presents a contig that the pickled\n(archived) annotation does not have. This will throw a warning that\nshows how many contigs were unknown, and how many mutations were\nencountered on these contigs. The solution is to disable the archive\nfeature by omitting the ``-a`` or ``--archive_directory`` option. This\nwill permit the unknown contig in output, but all mutations on the\nunknown contig will be shown as having no feature.\n\n::\n\n *** WARNING: 18 Contigs and 39 mutations are in areas unknown to the genomic array of sets. If using --fast, perhaps try again without it. *** \n\nThe VCF files need to have columns #CHROM, POS \u2026 etc. The configuration\nscript checks each VCF file for proper columns and will print a warning\nif any are missing or wrong. However, it does not halt execution and\nwill include any malformed column VCF files and attempt to process them\nregardless. The main script may finish execution with the malformed VCF,\nbut the output may be perturbed or useless.\n\n::\n\n File \"mucor/inputs.py\", in parse_VarScan\n position = int(row[fieldId['POS']])\n KeyError: 'POS'\n\nUsers may not supply vcf files that have inconsistent 'effect' and/or\n'functional consequence' for the same variant. Presumably, if the\nvariant has the same location and reference and alternate allele, the\neffect and functional consequences would be predicted to be the same.\nThe problem lies in collapsing mutations in the ``variant_details``\noutput type(s). This issue may arise when different platforms or\npipelines are used for samples containing a common variant within a\nsingle run of mucor. The current solution is to annotate all VCF files\nwith the same functional consequence decoration software.\n\nBug Reports and Issue Tracking\n==============================\n\nCheck the issue tracker at the github repository. Please report bugs and\nrequest new features there for better tracking.\n\nContact\n=======\n\n| James S. Blachly, MD\n| james.blachly@osumc.edu\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/blachlylab/mucor", "keywords": "sequencing,VCF,mutation,variant", "license": "GNU GPLv3", "maintainer": null, "maintainer_email": null, "name": "mucor", "package_url": "https://pypi.org/project/mucor/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/mucor/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/blachlylab/mucor" }, "release_url": "https://pypi.org/project/mucor/1.51/", "requires_dist": null, "requires_python": null, "summary": "Genomic Variant Aggregation and Mutation Correlation", "version": "1.51" }, "last_serial": 1940026, "releases": { "1.5": [ { "comment_text": "", "digests": { "md5": "fcbdd5d94196506913c8a6585465f5c9", "sha256": "9dbbfa29a44e9737bc8d56c715dc2be4858daa219acbf329d5201392aa5a4656" }, "downloads": -1, "filename": "mucor-1.5.tar.gz", "has_sig": false, "md5_digest": "fcbdd5d94196506913c8a6585465f5c9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7644, "upload_time": "2016-01-07T00:23:00", "url": "https://files.pythonhosted.org/packages/25/b6/556549a053bb9b87e9806bf3dc49a33d510fffeb5a591db6ac252a7c7879/mucor-1.5.tar.gz" } ], "1.51": [ { "comment_text": "", "digests": { "md5": "6904b862b0c0881950e3950bedb2c794", "sha256": "4dd8e3db4d135e45f818b07ba199cd9cc40c11e7438441065f131538f0ecba11" }, "downloads": -1, "filename": "mucor-1.51.tar.gz", "has_sig": false, "md5_digest": "6904b862b0c0881950e3950bedb2c794", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 28694, "upload_time": "2016-02-04T17:09:54", "url": "https://files.pythonhosted.org/packages/78/63/e21e2a2fa8798fbbd28fe886f51760004497323d3830c83df99fb83976f4/mucor-1.51.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6904b862b0c0881950e3950bedb2c794", "sha256": "4dd8e3db4d135e45f818b07ba199cd9cc40c11e7438441065f131538f0ecba11" }, "downloads": -1, "filename": "mucor-1.51.tar.gz", "has_sig": false, "md5_digest": "6904b862b0c0881950e3950bedb2c794", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 28694, "upload_time": "2016-02-04T17:09:54", "url": "https://files.pythonhosted.org/packages/78/63/e21e2a2fa8798fbbd28fe886f51760004497323d3830c83df99fb83976f4/mucor-1.51.tar.gz" } ] }