{ "info": { "author": "Wes Woollard", "author_email": "wes.woollard3985@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "CoverageCompacter\n==============\n\nNext generation sequencing (NGS) is an expensive process but its cost could be significantly reduced if only good quality libraries\nwere sequenced. CoverageCompacter can be used to generate estimates of the coverage of NGS libraries that have been\nsequenced at ultra low depth if those libraries were to undergo further sequencing. This is useful for analysing DNA which\nis likely to have regions of allelic dropout such as samples that have undergone whole genome amplification (WGA) from single celled\nsequencing experiments or other experiments that use low amounts of starting material such as DNA extracted from laser capture \nmicrodissection experiments. The output from CoverageCompacter can be used to guide which libraries to shortlist for further sequencing\nor producing useful summaries of sequencing depth in the BED format.\n\nPrinciples and applications\n===========================\nThis software was inspired by work by Zhang et al 2015 (see link to original article below). NGS libraries produced from DNA that has \nundergone WGA are likely to contain amplificaton bias, i.e. areas of the genome which are over/under represented. This bias can be \nestimated from low pass sequencing data based on 2 principles as discussed by Zhang et al. (1) Essentially sequencing a library at ultra low sequencing \ncan be thought of as 'sampling' a library, we are sequencing regions which are well represented in the overall fragment pool. (2) WGA tends to produce \namplicons rangeing between 10-50Kb in size. Infering from these 2 concepts; if a library is sequenced at low depth and 1 or more\nreads map to a given loci, then it can be asumed that if more sequencing is done then its likely to cover the 10-50kb region surrounding the loci. \n\n- Original aritcle: https://www.nature.com/articles/ncomms7822\n- Pubmed link: https://www.ncbi.nlm.nih.gov/pubmed/25879913\n\nCoverageCompacter uses the BED format output of Samtools depth (see Samtools docs for detals) and compacts this output into smaller loci. CoverageCompacter \ntakes the binsize as an argument and will compact areas of coverage into a single loci surounded by areas of no coverage up to half the size of the binSize \narguement. The reasoning behind this is that up to half an amplicon telomeric and centromeric of a given read are likely be contained within the library \nbut not yet sequenced. Where 2 amplicons are separated by less than half a bin, CoverageCompacter will merge the loci into a single loci. See 'Inputs \nand outputs of the software' examples 1-4 for more detail about how this is implemented. The output files can then be more easily compared with established tools \nsuch as 'bedtools intersect'.\n\nCoverageCompacter can also be used to compress loci with a minimum depth chosen by the user. This can be suplied as an arguement. The 'NoCov' arguement is set to 0 by \ndefault but if set to 10 (for example) then loci will be identified as having sufficient coverage if they have greater than 10 reads covering a single base. \nThe loci will be compressed into regions with above or below the minimum coverage and separated by the binsize. This is useful for identifying regions that may\nrequire more sequencing, or comparing multiple libraries to identify areas of poor coverage across a range of samples. It can also be used to identify regions \nthat have excessive coverage or verifying that targetted sequencing experiments have performed appropriately. It can also be used to identify transcript boundaries \nin RNASeq datasets.\n\n\nDependencies\n============\nAnalysis ready BAM files\nReference genome\nSamtools (we used samtools v1.3.1)\nPython 3\n\nInstallation\n============\n\ngit clone https://github.com/wes3985/CoverageCompacter.git\n\n\nInstructions for usage\n================\n(1) Run samtools depth over your BAM file(s) to generate a depth file. You can use the following command and launch \neach chromosome to a separate core on a cluster.\n\nbam_file_list \t\t\t\t= a text file containing a list with the full paths to your bam files.\npath_to_ref \t\t\t\t= the full path to the reference genome.\nchr\t\t \t\t\t= the chromosome in the same format as your reference.\ndepth_file_full_outpath\t\t\t= the full file path to write the depth data to.\n\nsamtools depth -a -q 10 -Q 20 --reference $path_to_ref -r $chr -f $bam_file_list > $depth_file_full_outpath\n\n(2) Call CoverageCompacter, this can take some time for large genomes and/or many samples, we recommend launching each chromosome to a different core.\n\nin_depthfile\t\t\t\t= the input depth file produced by samtools depth\noutfile\t\t\t\t\t= the full file path to write the depth data to.\nsamples\t\t\t\t\t= a string containing the samples separated by commas, if calling from a python shell: \"sample1,sample2,sample3\"\nCHROM\t\t\t\t\t= The chromosome covered by the infile e.g \"chr1\", \"1\". The infile must span only a single chromosome or loci from sequential chromosomes\n\t\t\t\t\t\t will be merged. This functionality may be improved in future versions depending on demand.\nbinSize\t\t\t\t\t= the presumed size of the amplicons, recommend size is 10000 based on Zhang et al (see link above). Setting the binSize to\n\t\t\t\t\t\t zero also makes this a useful tool to check for transcript boundaries in RNASeq data.\nNoCov\t\t\t\t\t= the minimum depth at which will be regions will be separated into coverage / no coverage\n\n# from a UNIX command line or launch script\npython CoverageCompacter $in_depthfile $outfile $samples $CHROM $binSize\n\n# from a python interface\nfrom CoverageCompacter import CoverageCompacter\nCoverageCompacter(depth_file, outfile, samples, CHROM, 10000, 0)\n\nInputs and outputs of the software\n==================================\n\n# INPUT EXAMPLE: This is the output format of samtools depth, col1 = chromosome, col2 = position in chromosome, col3 = depth at position\n# See samtools depth docs for more information\n\nchr1\t1\t0\nchr1\t2\t0\nchr1\t3\t0\nchr1\t4\t0\nchr1\t5\t0\nchr1\t6\t0\nchr1\t7\t0\nchr1\t8\t0\nchr1\t9\t1\nchr1\t10\t1\nchr1\t11\t1\nchr1\t12\t1\nchr1\t13\t1\nchr1\t14\t1\nchr1\t15\t1\n\n# OUTPUT EXAMPLE\n\nchr\tstart\tend\tsize\tfirstCoveredBase\tlastCoveredBase\tmeanCoverage\tNBasesCovered\tDepthSum\tcoverageFraction\nchr1\t1\t8\t8\tNone\tNone\t0\t0\t0\t0\nchr1\t9\t15\t7\t9\t15\t1\t7\t7\t1\n\n# The output file is a headered file in the bed format. Below is a description of each header:\n\nchr\t\t\t\t\t= chromosome\nstart\t\t\t\t= start position of the loci\nend\t\t\t\t\t= end postion of the loci\nsize\t\t\t\t= total size of the loci\nfirstCoveredBase\t= the first base in the loci with coverage >0\nlastCoveredBase\t\t= the last base in the loci with coverage >0\nmeanCoverage\t\t= mean coverage accross the loci\nNBasesCovered\t = toal number of bases covered in the loci\nDepthSum\t\t\t= the total sum of depth in the loci, this is useful for identifying regions of high relative coverage\ncoverageFraction\t= fraction of the loci that has coverage >0\n\n# INPUT EXAMPLE 1\n\nchr1\t1\t0\nchr1\t2\t0\nchr1\t3\t1\nchr1\t4\t1\nchr1\t5\t1\nchr1\t6\t0\nchr1\t7\t0\nchr1\t8\t0\nchr1\t9\t1\nchr1\t10\t1\nchr1\t11\t0\nchr1\t12\t0\nchr1\t13\t1\nchr1\t14\t1\nchr1\t15\t0\nchr1\t16\t1\nchr1\t17\t1\nchr1\t18\t1\nchr1\t19\t0\n\n(1) OUTPUT of CoverageCompacter with binSize set to '0' \n# Notes: \n\t- areas of no coverage are joined into a single loci in the bed format\n\t- areas of continuous coverage are also joined into single loci\n\nchr\t\tstart\tend\t\tsize\tfirstCoveredBase\tlastCoveredBase\tmeanCoverage\tNBasesCovered\tDepthSum\tcoverageFraction\nchr1\t1\t\t2\t\t2\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t3\t\t5\t\t3\t\t3\t\t\t\t\t5\t\t\t\t1.0\t\t\t\t3\t\t\t\t3\t\t\t1.0\nchr1\t6\t\t8\t\t3\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t9\t\t10\t\t2\t\t9\t\t\t\t\t10\t\t\t\t1.0\t\t\t\t2\t\t\t\t2\t\t\t1.0\nchr1\t11\t\t12\t\t2\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t13\t\t14\t\t2\t\t13\t\t\t\t\t14\t\t\t\t1.0\t\t\t\t2\t\t\t\t2\t\t\t1.0\nchr1\t15\t\t15\t\t1\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t16\t\t18\t\t3\t\t16\t\t\t\t\t18\t\t\t\t1.0\t\t\t\t3\t\t\t\t3\t\t\t1.0\nchr1\t19\t\t19\t\t1\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\n\n(2) OUTPUT of CoverageCompacter with binSize set to '1'\n# Notes:\n\t- The boundary of the bin is moved to binSize/2 and rounded up to the nearest whole number - '1' in this case which means that loci\n\twith coverage will include 1 bp with zero coverage on either side. Where a single base has no coverage between 2 loci where coverage\n\tis present these adjacent loci will be merged.\n\nchr\t\tstart\tend\t\tsize\tfirstCoveredBase\tlastCoveredBase\tmeanCoverage\tNBasesCovered\tDepthSum\tcoverageFraction\nchr1\t1\t\t1\t\t1\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t2\t\t6\t\t5\t\t3\t\t\t\t\t5\t\t\t\t0.6\t\t\t\t3\t\t\t\t3\t\t\t0.6\nchr1\t7\t\t7\t\t1\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t8\t\t11\t\t4\t\t9\t\t\t\t\t10\t\t\t\t0.5\t\t\t\t2\t\t\t\t2\t\t\t0.5\nchr1\t12\t\t19\t\t8\t\t13\t\t\t\t\t18\t\t\t\t0.625\t\t\t5\t\t\t\t5\t\t\t0.625\n\n(3) OUTPUT of CoverageCompacter with binSize set to '2'\n# Notes:\n\t- The output is the same as supplying binsize of 1 (2/2=1) because a boundary is formed at half the bin size and a single base cannot\n\tbe cut in half. \n\nchr\t\tstart\tend\t\tsize\tfirstCoveredBase\tlastCoveredBase\tmeanCoverage\tNBasesCovered\tDepthSum\tcoverageFraction\nchr1\t1\t\t1\t\t1\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t2\t\t6\t\t5\t\t3\t\t\t\t\t5\t\t\t\t0.6\t\t\t\t3\t\t\t\t3\t\t\t0.6\nchr1\t7\t\t7\t\t1\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t8\t\t11\t\t4\t\t9\t\t\t\t\t10\t\t\t\t0.5\t\t\t\t2\t\t\t\t2\t\t\t0.5\nchr1\t12\t\t19\t\t8\t\t13\t\t\t\t\t18\t\t\t\t0.625\t\t\t5\t\t\t\t5\t\t\t0.625\n\n(4) OUTPUT of CoverageCompacter with binSize set to '3'\n# Notes:\n - The area between postion 6 and postion 8 has 3 bp with zero coverage, in cases where splits occur that could reasonably fit into \n both adjacent loci, the loci is split the extra non-covered base is allocated to the previous loci\n\nchr\t\tstart\tend\t\tsize\tfirstCoveredBase\tlastCoveredBase\tmeanCoverage\t\tNBasesCovered\tDepthSum\tcoverageFraction\nchr1\t1\t\t7\t\t7\t\t3\t\t\t\t\t5\t\t\t\t0.42857142857142855\t3\t\t\t\t3\t\t\t0.42857142857142855\nchr1\t8\t\t19\t\t12\t\t9\t\t\t\t\t18\t\t\t\t0.5833333333333334\t7\t\t\t\t7\t\t\t0.5833333333333334\n\n\n# INPUT EXAMPLE 2:\n\nchr1\t1\t0\nchr1\t2\t0\nchr1\t3\t1\nchr1\t4\t2\nchr1\t5\t1\nchr1\t6\t0\nchr1\t7\t0\nchr1\t8\t0\nchr1\t9\t1\nchr1\t10\t1\nchr1\t11\t2\nchr1\t12\t2\nchr1\t13\t1\nchr1\t14\t1\nchr1\t15\t0\nchr1\t16\t1\nchr1\t17\t1\nchr1\t18\t1\nchr1\t19\t0\n\n(4) OUTPUT of CoverageCompacter with binSize set to '0' and noCov set to '1'\n\nchr\t\tstart\tend\t\tsize\tfirstCoveredBase\tlastCoveredBase\tmeanCoverage\tNBasesCovered\tDepthSum\tcoverageFraction\nchr1\t1\t\t3\t\t3\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t4\t\t4\t\t1\t\t4\t\t\t\t\t4\t\t\t\t2.0\t\t\t\t1\t\t\t\t2\t\t\t1.0\nchr1\t5\t\t10\t\t6\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\nchr1\t11\t\t12\t\t2\t\t11\t\t\t\t\t12\t\t\t\t2.0\t\t\t\t2\t\t\t\t4\t\t\t1.0\nchr1\t13\t\t19\t\t7\t\tNone\t\t\t\tNone\t\t\t0.0\t\t\t\t0\t\t\t\t0\t\t\t0.0\n\nFuture Updates\n==============\n(1) Add an argument which can be specified by the user to exclude regions of no coverage from the output file.\n\n\n\nSupport\n=======\nPlease inform us of any issues at 'w.woollard@ucl.ac.uk'\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/wes3985/CoverageCompacter", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "CoverageCompacter", "package_url": "https://pypi.org/project/CoverageCompacter/", "platform": "", "project_url": "https://pypi.org/project/CoverageCompacter/", "project_urls": { "Homepage": "https://github.com/wes3985/CoverageCompacter" }, "release_url": "https://pypi.org/project/CoverageCompacter/1.0/", "requires_dist": null, "requires_python": "", "summary": "Package to extrapolate coverage estimates from ultra low sequenced genomes generated by whole genome amplification (WGA) based library preparation methods", "version": "1.0" }, "last_serial": 4435809, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "d8122f368148a96a4e90927df581f6fe", "sha256": "53db4457423b37a40c4cd52da30b2e2841c5551ed3ffc948566e3b5de3afc785" }, "downloads": -1, "filename": "CoverageCompacter-1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d8122f368148a96a4e90927df581f6fe", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 10390, "upload_time": "2018-10-31T13:41:34", "url": "https://files.pythonhosted.org/packages/d8/b6/5e56bd8ed920efc8c1fe62284e186cd2ef710d060a7e5c622012a0d9bdc8/CoverageCompacter-1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "76949175cef1b591a06d5a848f888202", "sha256": "8420ca0000b2b0f82447f58d1fac7efeb8ca2abe48d993896265ad34b4385d35" }, "downloads": -1, "filename": "CoverageCompacter-1.0.tar.gz", "has_sig": false, "md5_digest": "76949175cef1b591a06d5a848f888202", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9208, "upload_time": "2018-10-31T13:41:34", "url": "https://files.pythonhosted.org/packages/a3/5e/92d60df39f37fce4c305b9f5bc60e2d1095aafa3f52f776140fc393fddec/CoverageCompacter-1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d8122f368148a96a4e90927df581f6fe", "sha256": "53db4457423b37a40c4cd52da30b2e2841c5551ed3ffc948566e3b5de3afc785" }, "downloads": -1, "filename": "CoverageCompacter-1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d8122f368148a96a4e90927df581f6fe", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 10390, "upload_time": "2018-10-31T13:41:34", "url": "https://files.pythonhosted.org/packages/d8/b6/5e56bd8ed920efc8c1fe62284e186cd2ef710d060a7e5c622012a0d9bdc8/CoverageCompacter-1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "76949175cef1b591a06d5a848f888202", "sha256": "8420ca0000b2b0f82447f58d1fac7efeb8ca2abe48d993896265ad34b4385d35" }, "downloads": -1, "filename": "CoverageCompacter-1.0.tar.gz", "has_sig": false, "md5_digest": "76949175cef1b591a06d5a848f888202", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9208, "upload_time": "2018-10-31T13:41:34", "url": "https://files.pythonhosted.org/packages/a3/5e/92d60df39f37fce4c305b9f5bc60e2d1095aafa3f52f776140fc393fddec/CoverageCompacter-1.0.tar.gz" } ] }