{ "info": { "author": "Christopher Wardell", "author_email": "github@cpwardell.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "

\n \n

\n\n# F***i***NGS: ***F******i***lters for ***N***ext ***G***eneration ***S***equencing\n\n## Key features\n- **Filters SNVs from any variant caller to remove false positives**\n- **Calculates metrics based on BAM files and provides filtering not possible with other tools**\n- **Fully user-configurable filtering (including which filters to use and their thresholds)**\n- **Option to use filters identical to ICGC recommendations**\n\n## Introduction\nSomatic variant callers compare matched pairs of tumor-normal samples to produce variant calls. The results can be extremely rich in false positives due to confounding factors such as the purity of the samples, artifacts introduced by sequencing chemistry, the alignment algorithm and the incomplete and repetitive nature of reference genomes. \nIt has become common practice to attempt to ameliorate these effects using a variety of filtering techniques, including taking the intersect of results from multiple variant callers and employing some post-calling filtering. This ad-hoc filtering varies greatly between laboratories. Attempts have been made to standardize the methodology of this filtering, with recommendations produced by the International Cancer Genome Consortium (ICGC) [(Alioto et al., 2015)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4682041/). \nWe have developed Filters for Next Generation Sequencing (FiNGS), software written specifically to address these filtering issues. FiNGS can implement the ICGC filtering standards and has filters and thresholds that are fully configurable, which substantially increases the precision of results and provides high quality variants for further analysis. \n\n## Availability\nFiNGS is open source and released under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). The latest source code is [freely available at GitHub](https://github.com/cpwardell/FiNGS).\n\n## Dependencies\nPython 3\n\n## Quickstart guide, with example data and test\nYou have three options for running FiNGS; \n1. Download using Bioconda (preferred)\n2. Download directly from GitHub\n3. Download using Docker/Singularity\n\n### Bioconda installation\nCreate new conda environment\nInstall package (and dependencies)\nRun example\n\n\n### Standalone installation via GitHub\nFiNGS depends on Python3 and several packages. We recommend using a Conda environment to manage Python and associated packages.\nOn Debian-based systems such as Ubuntu, you can install Python3 and pip, the recommended tool for installing Python packages using the following commands:\n\n```\napt-get -y install python3\napt-get -y install python3-pip\n```\n\nNow Python3 and pip are installed, you may install the required Python packages:\n```\npip3 install PyVCF\npip3 install pysam\npip3 install editdistance\npip3 install scipy\npip3 install joblib\n```\n\nClone the git repository and run the included example script in the `exampledata` directory.\n```\ngit clone https://github.com/cpwardell/FiNGS.git\ncd FiNGS/exampledata\n./test.sh\n```\nYou should see the progress printed to the screen. When filtering is complete, a `results` directory will have been created containing the output.\n\n### Docker installation\nThis guide assumes you have Docker installed and have some basic knowledge on how to use it. The Dockerfile builds an image based on Ubuntu 16.04. \nYou need to get a copy of the Dockerfile in this repository; below we use \"wget\" on Linux to download it, but you could just as easily \ncopy and paste the link in your web browser and \"right click/save as\" the file. The Docker build command works identically in both Bash on Linux and PowerShell on Windows\nand assumes that you're in the same directory as the dockerfile named \"Dockerfile\".\n\n```\n# Download the Dockerfile from this address:\nwget https://raw.githubusercontent.com/cpwardell/FiNGS/master/Dockerfile\n# Build the image and call it \"fings\" (lowercase)\ndocker build -t fings .\n```\n\nYou can test that your image works by running a container interactively:\n```\ndocker run -it fings\ncd /FiNGS/exampledata\n./test.sh\n```\n\nWhen you run it on your own data, you can mount the location of your files as below. This would output a results directory in the directory the command was executed in. Note that the `-u` argument ensures that files created by the \ndocker container will be owned by the user invoking the docker.\n```\ndocker run -v /path/to/tumorbamdir:/tumorbamdir -v /path/to/normalbamdir:/normalbamdir -v /path/to/vcfdir:/vcfdir -v $PWD:/local -w /local -u $UID:$UID fings /bin/bash -c \"python3 /FiNGS/FiNGS.py -n /normalbamdir/normal.bam -t /tumorbamdir/tumor.bam -v /vcfdir/somatic.vcf --PASSonlyin --PASSonlyout\"\n```\n\n## Suggested usage\n+ Use default filtering thresholds\n+ Use every available processor\n+ Only consider variants with a PASS filter value from the variant caller\n+ Only emit variants that PASS all filters \n\n```\npython3 /path/to/FiNGS.py -n /path/to/normal.bam -t /path/to/tumor.bam -v /path/to/somaticvariants.vcf --PASSonlyin --PASSonlyout\n```\n\n+ ICGC mode:\n\n```\npython3 /path/to/FiNGS.py -n /path/to/normal.bam -t /path/to/tumor.bam -v /path/to/somaticvariants.vcf -r /path/to/reference/genome.fasta --PASSonlyin --PASSonlyout --ICGC\n```\n\n## Basic usage and outputs\nIn the simplest case, provide FiNGS with paths to a VCF from a somatic variant caller and the BAM files used to produce it.\n```\npython3 FiNGS.py -n /path/to/normal.bam -t /path/to/tumor.bam -v /path/to/somaticvariants.vcf\n```\n\nFiNGS will create a directory called \"results\" containing the following files:\n\n+ `inputvcf.filtered.vcf` \nA VCF containing the filtered results. Descriptions of the filters used and their threshold values are stored in the header, and PASS/fail status stored in the FILTER column of each record\n+ `plots.pdf` \nPlots for every filter applied in a single PDF. The first page shows a table of the PASS/Fail counts for each filter. Subsequent pages show kernel density plots of the data used, with a dashed vertical line demarcating the threshold used; the red region failed the filter, the green region passed\n+ `log.txt` \nThe log file for the run. Contains the command line arguments used and a complete log of the run\n+ `filterresults.txt.gz` \nA gzipped text file giving the results of each filter for every variant \n+ `tumor.combined.txt.gz` and `normal.combined.txt.gz` \nAll metrics collected and used for filtering are stored in these gzipped text files. A dictionary explaining the contents of each column is below\n+ `summarystats.txt.gz`\nA gzipped text file containing summary stats that may be used for filtering\n\n## Advanced usage\n\n+ **-v** Required; path to the somatic variant VCF from any variant caller\n+ **-t** Required; path to the tumor BAM file used to produce the VCF\n+ **-n** Required; path to the normal BAM file used to produce the VCF\n+ **-r** Optional; absolute path to faidx indexed reference genome; required if using 'repeats' filter \n+ **-d** Optional; path to output directory. Default is to create a directory called \"results\" in the current working directory\n+ **-p** Optional; path to file specifying filtering parameters. Details on filters and default values is provided below. Default is a file located at `FiNGS/filter_parameters.txt`\n+ **-c** Optional; chunk size, the number of variants to process per chunk. Default is 100\n+ **-m** Optional; maximum read depth. Reads beyond this depth will be ignored and a warning emitted to the log file. Default is 1000 \n+ **-j** Optional; number of processors to use. -1 uses all available. Default is -1 \n+ **--ICGC** Optional; use filters identical to those recommended by the ICGC (Alioto et al, 2015). File is located at `FiNGS/icgc_filter_parameters.txt`\n+ **--logging** Optional; change logging level. Default is INFO, can be DEBUG for more detail or NOTSET for silent\n+ **--overwrite** Optional; allow overwriting of existing results if rerunning \n+ **--PASSonlyin** Optional; only consider variants with a PASS in the filter field of the input VCF \n+ **--PASSonlyout** Optional; only write variants that PASS all filters to the output VCF\n\n## Getting help\nRun FiNGS with no additional arguments to get the help file. If there's something not adddressed here, or if you need further help, raise an issue on GitHub or find me online.\n\n## Citing FiNGS\nA paper is being prepared for submission shortly and will be referenced here when available.\n\n## Description of filters\nFiNGS assesses variants using any combination of these possible filters. Below is a table describing them, their default thresholds and ICGC thresholds. NA values mean that the filter is not employed in eithe the default or ICGC mode. Users can create their own parameter file using *any* combination of filters and thresholds.\n\nFilter name | Description | Default value | ICGC value\n--- | --- | --- | ---\n**minaltcount** | Minimum number of ALT reads in tumor | 3 | 4\n**minbasequality** | Minimum median base quality (separate filters for ALT reads in tumor, REF reads in tumor and REF reads in normal) | 30 | 30\n**minmapquality** | Minimum median mapping quality of ALT reads in tumor | 50 | 40\n**minmapqualitydifference** | Maximum difference between median mapping quality of ALT reads in tumor and REF reads in normal | 5 | 5\n**enddistance** | Maximum median shortest distance to either aligned end in tumor | 10 | 10\n**enddistancemad** | Minimum MAD of ALT position in tumor | 3 | 3\n**zeroproportion** | Maximum proportion of zero mapping quality reads in tumor and normal | 0.05 | 0.1\n**minimumdepth** | Minimum depth in tumor and normal | 10 | NA\n**maximumdepth** | Maximum depth in tumor and normal | 1000 | NA\n**minvaftumor** | Minimum VAF in tumor | 0.05 | NA\n**maxvafnormal** | Maximum VAF in normal | 0.03 | NA\n**maxoaftumor** | Maximum OAF in tumor | 0.04 | NA\n**foxog** | FoxoG artifact proportion (see note below) | 0.9 | NA\n**editdistance** | Maximum edit distance of ALT reads in tumor (maximum edit distance of REF reads in tumor is 1 less than this value) | 4 | NA\n**maxsecondtumor** | Maximum proportion of secondary alignments in tumor | 0.05 | NA\n**maxbadorient** | Maximum proportion of inversion orientation reads in normal | 0.2 | NA\n**strandbiasprop** | Strand bias exclusion proportion (see note below) | 0.1 | NA\n**strandbiassimple** | Maximum strand bias (see note below) | NA | 0.02\n**maxaltcount** | Maximum number of ALT reads in normal | NA | 1\n**snvcluster50** | Maximum number of mutations within 50 bp (see note below) | NA | 2\n**snvcluster100** | Maximum number of mutations within 100 bp (see note below) | NA | 4\n**repeats** | Maximum length of 1/2/3/4mer repeats around the variant position (see note below) | NA | 12\n\n+ **Note on _foxog_ filter** \nC>A|G>T variants can be oxidation artifacts introduced during library preparation [(Costello et al, 2013)](http://doi.org/10.1093/nar/gks1443).\nThese \"OxoG\" artifacts have a telltale read orientation, with the majority of ALT reads in the artifact orientation. All C>A|G>T variants are classified as being part of two binomial distributions, one centered at 0.5 (50% artifact orientation, 50% non-artifact orientation) and the other at the filter value (defautl is 0.9, which is 90% artifact orientation reads). C>A|G>T variants classified as OxoG are removed.\n\n+ **Note on _strandbiasprop_ filter** \nStrand bias is defined below [(Guo et al., 2012)](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-666) where a,c represent the forward and reverse strand allele counts of REF reads and b,d represent the forward and reverse strand allele counts of ALT reads. The topmost proportion of biased variants is removed (e.g. suggested value is 0.1, leading to the top 10% variants with the highest strand bias being removed. Note that this is not the GATK strand bias, which is calculated differently. Note that this filter is available, but is *not* implemented in either of the default settings. \nStrand bias: |b/(a+b)-d/(c+d)|/ ((b+d)/(a+b+c+d)) \nGATK strand bias: max(((b/(a+b))*(c/(c+d)))/((a+c)/(a+b+c+d)),((d/(c+d))*(a/(a+b)))/((a+c)/(a+b+c+d)))\n\n+ **Note on _strandbiassimple_ filter** \nMaximum strand bias. This is defined very simply as the minimum proportion of reads in either direction. e.g. if there were 100 reads and only 1 were forward, strand bias would be 0.01.\nStrand bias: min(forward/(forward+reverse),reverse/(reverse+forward)) or min((a+b)/(a+b+c+d),(c+d)/(a+b+c+d))\n\n+ **Note on _SNVcluster50_ and _SNVcluster100_ filters** \nMaximum number of SNVs in the input VCF in a 50 or 100 bp window centered on the current SNV. SNVs must have at least 2 reads supporting them and a VAF>=5%.\n\n+ **Note on _repeats_ filter** \nMaximum length of 1/2/3/4mers surrounding the SNV. Lengths must be factors of repeats e.g. n=8 would only consider 1/2/4mers because 3 is not a factor of 8. Repeated regions frequently result in false positive variants. When using this filter, a reference genome *must* be supplied so FiNGS can find the flanking sequences.\n\n## Dictionary of values reported in the metrics files\nThe gzipped `tumor.combined.txt.gz` and `normal.combined.txt.gz` output files contain all metrics calculated for every input variant. Each row is a single variant. Non-integer values are rounded to 3 decimal places. Not all of the values reported are used for filtering.\n\nColumn | Description | Example\n--- | --- | ---\n**UID** | Unique Identifier for variant | 1:931362:G:A\n**CHR** | Chromosome | 1\n**POS** | Position on chromsome | 931362\n**REF** | Reference allele | G\n**ALT** | Alternate allele | A\n**refcount** | Count of REF alleles | 100\n**altcount** | Count of ALT alleles | 19\n**varianttype** | SNV or INDEL | SNV\n**depth** | Depth of all reads | 122\n**vaf** | Variant Allele Frequency (altcount/depth) | 0.156\n**raf** | Reference Allele Frequency (refcount/depth) | 0.82\n**oaf** | Other Allele Frequency ((depth-altcount-refcount)/depth) | 0.025\n**medianbaseq** | Median base quality (all reads) | 32\n**medianbaseqref** | Median base quality (REF reads only) | 34.5\n**medianbaseqalt** | Median base quality (ALT reads only) | 32\n**medianmapq** | Median mapping quality (all reads) | 60\n**medianmapqref** | Median mapping quality (REF reads only) | 60\n**medianmapqalt** | Median mapping quality (ALT reads only) | 60\n**zeros** | Total number of zero mapping quality reads | 0\n**zerospersite** | Proportion of reads that have zero mapping quality | 0\n**softreadlengthsrefmean** | Mean length of REF reads after soft clipping | 147.68\n**softreadlengthsaltmean** | Mean length of ALT reads after soft clipping | 151\n**goodoffsetproportion** | Proportion of variants that occur within the first 2/3rds of the mean read length | 0.664\n**distancetoend1median** | Median distance to lefthand soft-clipped read end (all reads) | 74.5\n**mad1** | Median absolute deviation of distancetoend1median | 34\n**distancetoend2median** | Median distance to righthand soft-clipped read end (all reads) | 70\n**mad2** | Median absolute deviation of distancetoend2median | 34.5\n**distancetoend1medianref** | Median distance to lefthand soft-clipped read end (REF reads only) | 76\n**madref1** | Median absolute deviation of distancetoend1medianref | 31.5\n**distancetoend2medianref** | Median distance to righthand soft-clipped read end (REF reads only) | 64\n**madref2** | Median absolute deviation of distancetoend2medianref | 29\n**distancetoend1medianalt** | Median distance to lefthand soft-clipped read end (ALT reads only) | 65\n**madalt1** | Median absolute deviation of distancetoend1medianalt | 25\n**distancetoend2medianalt** | Median distance to righthand soft-clipped read end (ALT reads only) | 85\n**madalt2** | Median absolute deviation of distancetoend2medianalt | 25\n**shortestdistancetoendmedian** | Median of shortest distance of distancetoend1alt and distancetoend2alt | 42\n**madaltshort** | Median absolute deviation of shortestdistancetoendmedian | 19\n**sb** | Strand bias, see definition above | 0.22\n**gsb** | GATK strand bias, see definition above | 0.185\n**fishp** | P value for Fisher's exact test for strand bias | 0.614\n**FR** | Count of forward reads supporting REF allele | 36\n**FA** | Count of forward reads supporting ALT allele | 8\n**RR** | Count of reverse reads supporting REF allele | 64\n**RA** | Count of reverse reads supporting ALT allele | 11\n**altsb** | Simple strand bias of ALT reads | 0.421\n**refsb** | Simple strand bias of REF reads | 0.36\n**allsb** | Simple strand bias of all reads | 0.37\n**F1R2** | Count of reads in FoxoG orientation 1 | 8 \n**F2R1** | Count of reads in FoxoG orientation 2 | 11\n**FoxoG** | Oxoguanine artifact orientation proportion, only relevant for for C>A or G>T mutations, defined in [Costello et al, 2013](https://academic.oup.com/nar/article/41/6/e67/2902364) | 0.421\n**refld** | Edit distance for REF reads | 2\n**altld** | Edit distance for ALT reads | 3\n**refsecondprop** | Proportion of REF reads that have secondary alignments | 0\n**altsecondprop** | Proportion of ALT reads that have secondary alignments | 0\n**refbadorientationprop** | Proportion of REF reads with an inverted orientation | 0\n**altbadorientationprop** | Proportion of ALT reads with an inverted orientation | 0\n**refmatecontigcount** | Number of contigs seen in REF reads | 1\n**altmatecontigcount** | Number of contigs seen in ALT reads | 1\n**sixtypes** | Types of SNV (of the six possible types) | C>T/G>A\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cpwardell/FiNGS", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "fings", "package_url": "https://pypi.org/project/fings/", "platform": "", "project_url": "https://pypi.org/project/fings/", "project_urls": { "Homepage": "https://github.com/cpwardell/FiNGS" }, "release_url": "https://pypi.org/project/fings/1.6.7/", "requires_dist": [ "pyvcf", "pysam", "numpy", "scipy", "pandas", "joblib", "seaborn", "statsmodels", "editdistance" ], "requires_python": ">=3.6", "summary": "Filters for Next Generation Sequencing", "version": "1.6.7" }, "last_serial": 5822224, "releases": { "1.6.6": [ { "comment_text": "", "digests": { "md5": "326e8206c36a2573f084b98ae6df9e5e", "sha256": "719f55e725fa53d7ed9b7742e205624d78e68ff8976be2e3b14714b4d6c6f0e0" }, "downloads": -1, "filename": "fings-1.6.6-py3-none-any.whl", "has_sig": false, "md5_digest": "326e8206c36a2573f084b98ae6df9e5e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 4456724, "upload_time": "2019-09-12T18:45:44", "url": "https://files.pythonhosted.org/packages/d6/41/3051b525556c5e380d80db59f4c571571d2997425e3daf951a4f76a777ea/fings-1.6.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2aa936b89480638e5486de474234a697", "sha256": "433f988867d390ec768902b995e52a14fcc44f6dc3c98100c0bce17d2baa2467" }, "downloads": -1, "filename": "fings-1.6.6.tar.gz", "has_sig": false, "md5_digest": "2aa936b89480638e5486de474234a697", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 4458654, "upload_time": "2019-09-12T18:45:48", "url": "https://files.pythonhosted.org/packages/0e/83/27674e3876bfe6ad4005fe91499cc16ca26bde1d1aa8d5ad52a5f9cecafc/fings-1.6.6.tar.gz" } ], "1.6.7": [ { "comment_text": "", "digests": { "md5": "b51bad6401ed8445604ad67d16976e4a", "sha256": "393bfdddc44aae622010b32ee295b827d5deb73ef4bfe14a9319d5ac6d258fa6" }, "downloads": -1, "filename": "fings-1.6.7-py3-none-any.whl", "has_sig": false, "md5_digest": "b51bad6401ed8445604ad67d16976e4a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 4456689, "upload_time": "2019-09-12T19:54:23", "url": "https://files.pythonhosted.org/packages/d6/2b/c592894f0ba54a3f035809c8e70484e2c6551ceb6dd08379789394829ef6/fings-1.6.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "62ead58dcda95f9f540bd591357ef5b1", "sha256": "e1825b1a1664cf134dadcdad762dd50f9309e48a4544c7fd337749e6c9e06e6f" }, "downloads": -1, "filename": "fings-1.6.7.tar.gz", "has_sig": false, "md5_digest": "62ead58dcda95f9f540bd591357ef5b1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 4458642, "upload_time": "2019-09-12T19:54:25", "url": "https://files.pythonhosted.org/packages/d9/81/7dd8b9ce7cddccbddc16fa13cf59eb79c6f37824b33c985d63c61971da86/fings-1.6.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b51bad6401ed8445604ad67d16976e4a", "sha256": "393bfdddc44aae622010b32ee295b827d5deb73ef4bfe14a9319d5ac6d258fa6" }, "downloads": -1, "filename": "fings-1.6.7-py3-none-any.whl", "has_sig": false, "md5_digest": "b51bad6401ed8445604ad67d16976e4a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 4456689, "upload_time": "2019-09-12T19:54:23", "url": "https://files.pythonhosted.org/packages/d6/2b/c592894f0ba54a3f035809c8e70484e2c6551ceb6dd08379789394829ef6/fings-1.6.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "62ead58dcda95f9f540bd591357ef5b1", "sha256": "e1825b1a1664cf134dadcdad762dd50f9309e48a4544c7fd337749e6c9e06e6f" }, "downloads": -1, "filename": "fings-1.6.7.tar.gz", "has_sig": false, "md5_digest": "62ead58dcda95f9f540bd591357ef5b1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 4458642, "upload_time": "2019-09-12T19:54:25", "url": "https://files.pythonhosted.org/packages/d9/81/7dd8b9ce7cddccbddc16fa13cf59eb79c6f37824b33c985d63c61971da86/fings-1.6.7.tar.gz" } ] }