{ "info": { "author": "Serafim Nenarokov, Martin Kolisko", "author_email": "serafim.nenarokov@gmail.com", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7" ], "description": "# WinstonCleaner\nWinstonCleaner is a software tool for detecting and removing cross-contaminated \ncontigs from assembled transcriptomes. The program uses BLAST to identify \nsuspicious contigs and RPKM values to sort these as either correct or \ncontamination. \n\n# Requirements\n\nTo run WinstonCleaner, the following requirements must be satisfied:\n* Python 2.7\n* [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi)\n* [bbtools](https://jgi.doe.gov/data-and-tools/bbtools/) (`pileup.sh`)\n* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)\n\n# Installation\n\n1. Checkout repository\n\n `git clone https://github.com/kolecko007/WinstonCleaner.git`\n\n `cd WinstonCleaner`\n\n2. Install pip dependencies:\n\n `pip2 install --user -r requirements.txt`\n\n3. Initialize settings:\n\n `cp config/settings.yml.default config/settings.yml`\n\n4. Check installation by running `test/integration/run.sh` from `WinstonCleaner` folder\n\n# Quick Start\n1. Prepare the folder with input data and an empty folder for the results\n1. Open `config/settings.yml` and specify input and output paths\n1. `bin/prepare_data.py`\n1. `bin/find_contaminations.py`\n1. Inspect the results in the output folder\n\n# Usage\n## Input\nThe input data should be presented as a set of triads of files for each dataset.\nFor each dataset it is necessary to prepare:\n* left reads `.fastq`\n* right reads `.fastq`\n* assembled transcriptome `.fasta` file\n\nNames of the files must be in the following format:\n* `NAME_1.fastq`\n* `NAME_2.fastq`\n* `NAME.fasta`\n\nFor example:\n* `brucei_1.fastq`\n* `brucei_2.fastq`\n* `brucei.fasta`\n* `giardia_1.fastq`\n* `giardia_2.fastq`\n* `giardia.fasta`\n\nFor file names only letters, digits and `_` symbols are allowed.\n\nAll the files must be placed together in one folder.\n\n## Configuration\n\nAll the settings are declared in `config/settings.yml`.\n\n* `winston.paths.input` — input folder with reads and contigs\n* `winston.paths.output` — output folder with the results\n* `winston.paths.tools.pileup_sh` — (_optional_) bbtools `pileup.sh` execution command\n* `winston.paths.tools.bowtie2` — (_optional_) bowtie2 execution command\n* `winston.paths.tools.bowtie2_build` — (_optional_) bowtie2-build execution command\n* `winston.hits_filtering.len_ratio` — minimal `qcovhsp` for hits filtering\n* `winston.hits_filtering.len_minimum` — minimal hit lenth for hits filtering\n* `winston.coverage_ratio.regular` — coverage ratio for REGULAR dataset pair type \n(lower values make contamination prediction more strict, less contaminations will be found)\n* `winston.coverage_ratio.close` — coverage ratio for CLOSE dataset pair type\n* `winston.threads.multithreading` — enable multithreading (disabling is convenient for debugging purposes)\n* `winston.threads.count` — number of threads if multithreading enabled\n* `winston.tools.blast.threads` — number of threads for BLAST processing\n* `winston.tools.bowtie.threads` — number of threads for bowtie2 processing\n* `winston.in_memory_db` — load coverage database to RAM in the beginning. \nMakes contamination lookup faster, but requires decent amount of memory.\n\nThe default configuration can be found in file `config/settings.yml.default`.\n\n```\nwinston:\n in_memory_db: false\n\n paths:\n input: /path/to/folder/with/data/\n output: /path/to/output/folder\n\n hits_filtering:\n len_ratio: 70\n len_minimum: 100\n\n coverage_ratio:\n REGULAR: 1.1\n CLOSE: 0.04\n\n threads:\n multithreading: true\n count: 8\n\n tools:\n blast:\n threads: 8\n bowtie:\n threads: 8\n```\n\n## Data preparation\nThe first step is to prepare the data for WinstonCleaner processing.\n\n`bin/prepare_data.py`\n\nThe result will be stored in the folder, specified in `winston.paths.output` option.\n\nAfter the preparation the file `types.csv` can be inspected and edited.\nIt contains all possible combinations of dataset pairs and their types.\n\nThe default types are:\n* `CLOSE` - taxonomically close organisms\n* `REGULAR` - simple pair of organisms\n\nIn `types.csv` there can also be specified any amount of custom types.\nTheir names must be in upper case. \n\n```\npredator,prey,95.0,LEFT_EATS_RIGHT\nprey,predator,95.0,RIGHT_EATS_LEFT\n``` \n\nIn these case coverage ratio for each custom type must be specified in `winston.coverage_ratio` section of\n `settings.yml` file:\n\n```\n...\n coverage_ratio:\n REGULAR: 1.1\n CLOSE: 0.04\n LEFT_EATS_RIGHT: 10\n RIGHT_EATS_LEFT: 0.1\n...\n```\n\n\n## Contamination cleanup\n\n`bin/find_contaminations.py`\n\n## Output\n\nThe results will be saved in the folder, specified in `winston.paths.output` option.\n\nFor each datasets there will be the following structure of files.\n\n* **DATASET_NAME_clean.fasta** — clean contigs\n* **DATASET_NAME_deleted.fasta** — contaminated contigs\n* **DATASET_NAME_suspicious_hits.csv** — all suspicious BLAST hits\n* **DATASET_NAME_contamination_sources.csv** — \nsources of contaminations with a following columns: source contamination dataset name, number of sequences\n* **DATASET_NAME_contaminations.csv** — list of blast hits from which contaminations were detected\n* **DATASET_NAME_missing_coverage.csv** — list of contig ids without a coverage\n\n\n# TODO\n* Moving to python3\n* Logging system\n* Extended testing\n* export to graph format\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kolecko007/WinstonCleaner", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "winston-cleaner", "package_url": "https://pypi.org/project/winston-cleaner/", "platform": "", "project_url": "https://pypi.org/project/winston-cleaner/", "project_urls": { "Homepage": "https://github.com/kolecko007/WinstonCleaner" }, "release_url": "https://pypi.org/project/winston-cleaner/0.1.0/", "requires_dist": [ "matplotlib (==2.0.2)", "numpy (>=1.14.0)", "scipy (>=0.19.0)", "biopython (>=1.68)", "tqdm (>=4.14.0)", "pathlib (>=1.0.1)", "PyYAML (>=3.11)" ], "requires_python": "==2.7.*", "summary": "WinstonCleaner - transcriptomic data cross-contamination eliminator", "version": "0.1.0" }, "last_serial": 4973836, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "2f2f53e6deb8db7c9e9d3af99bad24c4", "sha256": "1110bfe6c0b028fc7076ac359872fb03275356437638a2530cfd3419936a92fb" }, "downloads": -1, "filename": "winston_cleaner-0.1.0-py2-none-any.whl", "has_sig": false, "md5_digest": "2f2f53e6deb8db7c9e9d3af99bad24c4", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": "==2.7.*", "size": 26942, "upload_time": "2019-03-22T17:46:58", "url": "https://files.pythonhosted.org/packages/70/f2/388759daf73edd64d412c3f7b2b20258bc1e40404f3422de192cb0849b5e/winston_cleaner-0.1.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a595e8e27ab7e6088b0c7479194e3927", "sha256": "2b5a8307b10ab9244379fb5982ba7f5ba30844ea6e8728d1d19b39b05535b879" }, "downloads": -1, "filename": "winston_cleaner-0.1.0.tar.gz", "has_sig": false, "md5_digest": "a595e8e27ab7e6088b0c7479194e3927", "packagetype": "sdist", "python_version": "source", "requires_python": "==2.7.*", "size": 16825, "upload_time": "2019-03-22T17:47:01", "url": "https://files.pythonhosted.org/packages/58/f3/75ffb3796dda3a8b55aac6b9e769029177b761e06f1659385f4701d953c3/winston_cleaner-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2f2f53e6deb8db7c9e9d3af99bad24c4", "sha256": "1110bfe6c0b028fc7076ac359872fb03275356437638a2530cfd3419936a92fb" }, "downloads": -1, "filename": "winston_cleaner-0.1.0-py2-none-any.whl", "has_sig": false, "md5_digest": "2f2f53e6deb8db7c9e9d3af99bad24c4", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": "==2.7.*", "size": 26942, "upload_time": "2019-03-22T17:46:58", "url": "https://files.pythonhosted.org/packages/70/f2/388759daf73edd64d412c3f7b2b20258bc1e40404f3422de192cb0849b5e/winston_cleaner-0.1.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a595e8e27ab7e6088b0c7479194e3927", "sha256": "2b5a8307b10ab9244379fb5982ba7f5ba30844ea6e8728d1d19b39b05535b879" }, "downloads": -1, "filename": "winston_cleaner-0.1.0.tar.gz", "has_sig": false, "md5_digest": "a595e8e27ab7e6088b0c7479194e3927", "packagetype": "sdist", "python_version": "source", "requires_python": "==2.7.*", "size": 16825, "upload_time": "2019-03-22T17:47:01", "url": "https://files.pythonhosted.org/packages/58/f3/75ffb3796dda3a8b55aac6b9e769029177b761e06f1659385f4701d953c3/winston_cleaner-0.1.0.tar.gz" } ] }