{ "info": { "author": "Tony Yang", "author_email": "tony@tony.tc", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "[![Build Status](https://travis-ci.org/tonyyzy/GC_analysis.svg?branch=master)](https://travis-ci.org/tonyyzy/GC_analysis)\n[![Build Status](https://travis-ci.org/tonyyzy/GC_analysis.svg?branch=parallel)](https://travis-ci.org/tonyyzy/GC_analysis)\n# GC-analysis\nA command-line utility for calculating GC percentages of genome sequences\n\n# Quick starter\nCalculate the GC content of chromosome 17 of the human reference genome with window size (or span) = 5 and shift (or step) = 5. Input fasta file is `GRCh38-Chrom17.fasta` and output wiggle file is `GRCh38-Chrom17.wig`. Note that the output file's extension is added by the program.\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17\n```\n\n# Installation guide\nNote that pyBigWig can only be used under linux environment. To work with Windows system, the Docker image can be used as shown below. Alternatively, you can clone the repository, comment out `import pyBigWig` and the script would work but without BigWig support.\n\n1. Pip install GC_analysis (NB. Python3 is recommanded but GC_analysis should work with Python2 as well)\n```\npip3 install GC_analysis\n```\nThen `GC_analysis.py` command will be available globally.\n```\nGC_analysis.py -i [INPUT] -o [OUTPUT] -w [window size] -s [shift]\n```\n\n2. Run the python script directly. Please ensure you have python3 installed with pyBigwig and Biopython.\nClone the github repository and install packages.\n```\ngit clone https://github.com/tonyyzy/GC_analysis\ncd GC_analysis\npip3 install -r requirements.txt\n```\nrun the script from `GC_analysis` directory.\n```\npython3 ./GC_analysis/GC_analysis.py -i [INPUT] -o [OUTPUT] -w [window size] -s [shift]\n```\n\n3. Use the packaged binary.\n```\nmkdir ~/GC_analysis\ncd ~/GC_analysis\nwget https://github.com/tonyyzy/GC_analysis/releases/download/v0.3/GC_analysis\n```\nExecute the binary command\n```\nGC_analysis -i [INPUT] -o [OUTPUT] -w [window size] -s [shift]\n```\n\n4. Use the Docker image.\nFirstly, pull the docker image (around 384 MB)\n```\ndocker pull tonyyzy/gc_analysis\n```\nTo use input files outside the container and save output files on your computer, the `-v` volume mapping option will be used. You will need to know the absolute path of the directory you want to map (which can be found out with `pwd`).\n```\ndocker run -v /your/local/path:/app tonyyzy/gc_analysis GC_analysis -i /app/yours.fasta -o /app/yours -w 5 -s 5\n```\nThis option maps `/your/local/path` to `/app` under the container's root directory. Your result file will be saved to `/your/local/path/yours.wig`.\n\n# Command-line options\n```\n~ $ GC_analysis -h\nusage: GC_analysis [-h] -i INPUT_FILE -w WINDOW_SIZE -s SHIFT [-o OUTPUT_FILE]\n [-ot] [-f {wiggle,gzip,bigwig}]\n\nrequired named arguments:\n\n-i INPUT_FILE, --input_file INPUT_FILE\nINPUTFILE: Name of the input file in FASTA format\n\n-w WINDOW_SIZE, --window_size WINDOW_SIZE\nWINDOW_SIZE: Number of base pairs that the GC percentage is calculated for\n\n-s SHIFT, --shift SHIFT\nSHIFT: The shift increment (step size)\n\noptional arguments:\n\n-h, --help\nShow the help message and exit\n\n-o OUTPUT_FILE, --output_file OUTPUT_FILE\nOUTPUT_FILE: Name of the output file\n\n-ot, --omit_tail\nUse if the trailing sequence should be omitted. Default behaviour is to retain the leftover sequence.\n\n-f {wiggle,bigwig,gzip}, --output_format {wiggle,bigwig,gzip}\nChoose output formats from wiggle, bigwig or gzip compressed wiggle file.\n\n```\n## Example usage\n1. Calculate the GC content of chromosome 17 of the human reference genome, the percentage is calculated over five base pairs (window_size), and the window is shifted by five base pairs every time (i.e. there is no overlapping base paires in each entry).\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17\n```\n\n2. By default, the GC percentage of the trailing sequence is calculated and appended to the end of the output file. For example, with the following input\n```\n~ $ GC_analysis -i examaple1.fasta -w 5 -s 5 -o with_tail\n```\nand `example1.fasta` is\n```\n>chr1\nAAAAACC\n```\nthe generated `with_tail.wig` will look like\n```\ntrack type=wiggle_0 name=\"GC percentage\" description=\"chr1\"\nvariableStep chrom=chr1 span=5\n1\t0\n6\t100\n```\nIf it is desirable to omit the trailing sequence in the result, the `-ot` or `--omit_tail` option can be used. For example\n```\n~ $ GC_analysis -i examaple1.fasta -w 5 -s 5 -o without_tail -ot\n```\nwill generate output file `without_tail` with the following content\n```\ntrack type=wiggle_0 name=\"GC percentage\" description=\"chr1\"\nvariableStep chrom=chr1 span=5\n1\t0\n```\n\n3. The program support three output file formats, wiggle, bigwig and gzip compressed wiggle file.\nWiggle output file follows the [UCSC variableStep format definition](https://genome.ucsc.edu/goldenpath/help/wiggle.html). Wiggle file is the default output format. The output format can be changed with `-f` or `--format` option.\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17\n```\nand\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17 -f wiggle\n```\nwill generate `GRCh38-Chrom17.wig` as the output file.\n\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17 -f gzip\n```\nwill generate `GRCh38-Chrom17.wig.gz` as the output file. Decompress `GRCh38-Chrom17.wig.gz` will give you the same wiggle file as choosing wiggle as the output format.\n\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17 -f bigwig\n```\nwill generate `GRCh38-Chrom17.bw` as the output file. It should be noted that bigwig format does not allow overlapping bases, which means that `-w 5 -s 3` is an invalid option with choosing bigwig as the output format. In this case, where shift is smaller than window size and bigwig format is specified, the program will generate a wiggle file instead and output a warning message.\n\n```\n~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 3 -o GRCh38-Chrom17 -f bigwig\nWARNING! BigWig file does not allow overlapped items. A wiggle file was generated instead.\n```\n\n4. If an output filename is not given, the result will be written to stdout. If the output filename is not given and a file format other than wiggle was chosen, the program will automatically output the result to stdout and give you a warning before and after the result.\nEg. \n```\nGC_analysis -i example1.fasta -w 5 -s 3 -f bigwig\nWARNING! BigWig file does not allow overlapped items. A wiggle file will be generated instead.\nWARNING! An output filename is needed to save output as bigwig. The result is shown below:\ntrack type=wiggle_0 name=\"GC percentage\" description=\"chr1\"\nvariableStep chrom=chr1 span=5\n1 0\n4 50\nWARNING! BigWig file does not allow overlapped items. A wiggle file was generated instead.\nWARNING! An output filename is needed to save output as bigwig. The result is shown above.\n```\n\n## Timing againts human chromosomes\n
Click for raw data table\n

\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
EntryHuman chromosomeNo. of base pairsAverage real time - single thread (s)Average real time - multi threads (s)
CM000663.2.fasta1248956422288.429179.221
CM000664.2.fasta2242193529276.355169.611
CM000665.2.fasta3198295559227.528135.637
CM000666.2.fasta4190214555217.846153.091
CM000667.2.fasta5181538259205.623123.858
CM000668.2.fasta6170805979193.209117.180
CM000669.2.fasta7159345973183.445109.135
CM000670.2.fasta8145138636166.60798.632
CM000671.2.fasta9138394717157.14293.898
CM000672.2.fasta10133797422150.87292.371
CM000673.2.fasta11135086622154.00392.498
CM000674.2.fasta12133275309150.53390.807
CM000675.2.fasta13114364328129.95177.498
CM000676.2.fasta14107043718121.00871.970
CM000677.2.fasta15101991189115.19468.336
CM000678.2.fasta1690338345103.16960.799
CM000679.2.fasta178325744194.35355.729
CM000680.2.fasta188037328592.02053.395
CM000681.2.fasta195861761667.50639.308
CM000682.2.fasta206444416774.04843.280
CM000683.2.fasta214670998353.63331.118
CM000684.2.fasta225081846857.46633.701
CM000685.2.fastaX156040895176.895105.408
CM000686.2.fastaY5722741567.01638.142
J01415.2.fastaMT165690.2310.397
\n\n

\n
\n\n\n\n\n![Execution time vs. number of base pairs plot](https://github.com/tonyyzy/GC_analysis/blob/master/tests/time_profile/GC_time_profile.png \"execution time plot\")\n\\* 1) Real time data is the average of three runs; 2) GC_analysis parameters for each run is `-w 5 -s 5`; 3) `Serial` data is collected with the `Master` branch, `Parallel` data is collected with the `Parallel` branch.\n\nAs can be seen from the plot, `GC_analysis` scales well with number of base pairs, resulted a linear relationship between the execution time and the size of the chromosomes. Although multi-threaded version can provide ~1.7x speed improvement, it has a significantly higher memory consumption, hence it's not recommended.\n\n\n## (EXPERIMENTAL!!!) Multi-threaded GC_analysis\nGit clone the `parallel` branch from GitHub repo:\n```\ngit clone --single-branch -b parallel https://github.com/tonyyzy/GC_analysis\n```\nExecute as normal from `GC_analysis` directory\n```\n~ python3 ./scripts/GC_analysis.py -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17\n```\n\nThis multithreading implementation is a very crude one and only result in ~1.7x speed up. A large amount of RAM is needed to store out-of-order intermediate results for sorting.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/tonyyzy/GC_analysis", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "GC-analysis", "package_url": "https://pypi.org/project/GC-analysis/", "platform": "", "project_url": "https://pypi.org/project/GC-analysis/", "project_urls": { "Homepage": "https://github.com/tonyyzy/GC_analysis" }, "release_url": "https://pypi.org/project/GC-analysis/0.4.5/", "requires_dist": [ "pyBigWig", "Biopython" ], "requires_python": "", "summary": "A program that compute the GC percentage of a given genomic sequence", "version": "0.4.5" }, "last_serial": 4411074, "releases": { "0.3": [ { "comment_text": "", "digests": { "md5": "5b1d3cd0b7befdc1b0d5c66083f12746", "sha256": "7b434825d4571230cc7dea474768ebe18fb877e48000cd2d08d1ac9ab6b59346" }, "downloads": -1, "filename": "GC_analysis-0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "5b1d3cd0b7befdc1b0d5c66083f12746", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 11211, "upload_time": "2018-07-05T10:27:16", "url": "https://files.pythonhosted.org/packages/d5/57/60cacb3fa7003a8c63c5de08cf705be7c06348b858e5f34a71876e373fae/GC_analysis-0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "aedc5c07b0173ff007da2479b1d3cd60", "sha256": "64f99617fc177dfcd97170a83a9ce4dee02b51933c6dce53723c27c064f77878" }, "downloads": -1, "filename": "GC_analysis-0.3.tar.gz", "has_sig": false, "md5_digest": "aedc5c07b0173ff007da2479b1d3cd60", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6783, "upload_time": "2018-07-05T10:27:17", "url": "https://files.pythonhosted.org/packages/0d/67/d69a022c277a22324192d3c30930237a063f0aca7ecb7855629bad5801f2/GC_analysis-0.3.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "71aecf284a6a9e46c75a1c87fc834d23", "sha256": "c0dfe7bedea949820a037db204f8ba6684f74ebd14683d3e513abc92b2ac109e" }, "downloads": -1, "filename": "GC_analysis-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "71aecf284a6a9e46c75a1c87fc834d23", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16702, "upload_time": "2018-07-05T14:14:55", "url": "https://files.pythonhosted.org/packages/1c/1e/d0f8f01f3f8cb30544c0da1c0023c3a28c059139444ba7855c5b601b1f9f/GC_analysis-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d103017ed51dbed1996d36285c19b06d", "sha256": "0075bb37bae7b7894c96d2198df402030ea79d8d7e223e2548752c703017dac2" }, "downloads": -1, "filename": "GC_analysis-0.3.1.tar.gz", "has_sig": false, "md5_digest": "d103017ed51dbed1996d36285c19b06d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6814, "upload_time": "2018-07-05T14:14:56", "url": "https://files.pythonhosted.org/packages/b6/a0/42c06e4968d9ba74bcb7a1cb9237697505b0a30d55c3e77ca868761ecaa7/GC_analysis-0.3.1.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "d817dc70013204dd6641ea44586b2043", "sha256": "4dfb6b6988e1475e84681ecd476349696a831b897eca414d15ad9f27618eb651" }, "downloads": -1, "filename": "GC_analysis-0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "d817dc70013204dd6641ea44586b2043", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6074, "upload_time": "2018-07-05T15:07:25", "url": "https://files.pythonhosted.org/packages/b6/fb/3b67c81c8aa31897c8df6296ca4450e51a312d2249129b4fa20fc890587e/GC_analysis-0.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "40d11cb674e50f325fdbbf21082ab4dd", "sha256": "34186a3d26fb2822d14df2447a9fce537b685ea7642e5b4f1f304613b7926c30" }, "downloads": -1, "filename": "GC_analysis-0.3.2.tar.gz", "has_sig": false, "md5_digest": "40d11cb674e50f325fdbbf21082ab4dd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5819, "upload_time": "2018-07-05T15:07:26", "url": "https://files.pythonhosted.org/packages/3e/ff/a34a2183970486af6218611a5f252b63c9f74b0e4e82c7f46d3746732bd7/GC_analysis-0.3.2.tar.gz" } ], "0.3.3": [ { "comment_text": "", "digests": { "md5": "a3560330b8ff40df00d2d124647d8e07", "sha256": "a142a661b9454e8ec4bd008927060697b9c5cd47a285d5a4d6f268f8f47bc006" }, "downloads": -1, "filename": "GC_analysis-0.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "a3560330b8ff40df00d2d124647d8e07", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6095, "upload_time": "2018-07-07T16:06:06", "url": "https://files.pythonhosted.org/packages/90/59/8526753980145bfe6a8b67e6cda0b0f341a208ce2d6fc1bcca59306c4bd5/GC_analysis-0.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1bf65c68ed7cf07c62f9ee474de286d4", "sha256": "f7281397a80a78bc8413e95bfa64f8a6e33a4ec1d9965e8aafc8889d58a52c21" }, "downloads": -1, "filename": "GC_analysis-0.3.3.tar.gz", "has_sig": false, "md5_digest": "1bf65c68ed7cf07c62f9ee474de286d4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5878, "upload_time": "2018-07-07T16:06:07", "url": "https://files.pythonhosted.org/packages/7d/33/e1c38f7592d8b3d22af00b08815e6b3a2bde607e9e0e857064fcb84a586a/GC_analysis-0.3.3.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "e49819be19040bc12af98f29110120ab", "sha256": "f00f17bb66506ce02cf951d9fff5e55faab42f8e6621f5dd11fbd63cda8894bc" }, "downloads": -1, "filename": "GC_analysis-0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "e49819be19040bc12af98f29110120ab", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7584, "upload_time": "2018-07-29T22:05:49", "url": "https://files.pythonhosted.org/packages/a7/24/1565570a4bd469a671e7f92682320889848771fd26d39018d61319c10b0e/GC_analysis-0.4-py3-none-any.whl" } ], "0.4.3": [ { "comment_text": "", "digests": { "md5": "8ba2f05f26209719e12286087740ca3f", "sha256": "cd1df59183dcbd8a2610700073b90c9eb6b80fe70d386c870b90469da3ad3cbb" }, "downloads": -1, "filename": "GC_analysis-0.4.3-py3-none-any.whl", "has_sig": false, "md5_digest": "8ba2f05f26209719e12286087740ca3f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7607, "upload_time": "2018-07-31T02:30:43", "url": "https://files.pythonhosted.org/packages/f8/43/09cc70f15b730bd6444303047b030a46d1e80fc82ec451f3a3f667a68b5f/GC_analysis-0.4.3-py3-none-any.whl" } ], "0.4.5": [ { "comment_text": "", "digests": { "md5": "42634919988715b9cc984556d2527a7b", "sha256": "9e56a987469fb56421e646a00e3429fbb27bd670c9ca040ba2138bf7a533c28d" }, "downloads": -1, "filename": "GC_analysis-0.4.5-py3-none-any.whl", "has_sig": false, "md5_digest": "42634919988715b9cc984556d2527a7b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7606, "upload_time": "2018-07-31T03:11:24", "url": "https://files.pythonhosted.org/packages/c4/59/ecc6979b22e2a9ff5e98ac398315ae2fcbe68ecc8ce4d6b0eef2cff9e254/GC_analysis-0.4.5-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "42634919988715b9cc984556d2527a7b", "sha256": "9e56a987469fb56421e646a00e3429fbb27bd670c9ca040ba2138bf7a533c28d" }, "downloads": -1, "filename": "GC_analysis-0.4.5-py3-none-any.whl", "has_sig": false, "md5_digest": "42634919988715b9cc984556d2527a7b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7606, "upload_time": "2018-07-31T03:11:24", "url": "https://files.pythonhosted.org/packages/c4/59/ecc6979b22e2a9ff5e98ac398315ae2fcbe68ecc8ce4d6b0eef2cff9e254/GC_analysis-0.4.5-py3-none-any.whl" } ] }