{ "info": { "author": "Asaf Peer", "author_email": "asaf.peer@jax.org", "bugtrack_url": null, "classifiers": [], "description": "# domain classifier - Classify a sequence to its taxonomic domain using PFAM domains\n## Install\npip install domain-classifier\n## How to use\nThe input is usually a DNA fasta sequence which will be translated to proteins and will be searched for PFAM domains. These domains will be used to classify each sequence to one of four domains: Bacteria, Eukaryota, Archaea and Viruses. The main idea is for the user to be easily spot contaminations where the reference genomes are not known. Since protein domains are highly distinctive between the above taxonomic domain (with the exception of Viruses which kidnap their hosts' genes), the distinction is quite easy. \n\nFirst you'll have to download the data from PFAM ftp server:\n```\nmodule load diamond\nbuild_domains_DB.py -d \n```\nIf diamond is not installed or you wish not to use diamond for domain prediction (which is fine) you can add `--nodiamond`.\n\nNow take a fasta file and run the classifier, you'll have to define two files for intermediate steps: one for the translated protein sequences and one for the hmmsearch results. Output will be written to STDOUT:\n```\n# You'll need hmmer and EMBOSS loaded\nmodule load hmmer emboss\npredict_domain.py -d -i input.fna -t input.faa -b input.hmm > output_table.txt\n```\nThe script does some funny things like concatenating all the protein sequences and selecting only 100 proteins for each sequence. These are done to save running time of course, you can change the number of proteins using `--limit [int]` flag, I found it unnecessary with the sequences I tested. \nOther options are `--diamond` which will use diamond instead of hmmsearch to find domains, not recommended unless the sequences you have are well known. `--threads` to use another number of threads (default is 20), `--getorf` and `--hmmsearch` to define other paths to EMBOSS getorf and hmmer hmmsearch. \n`--pseudocounts` allows you to introduce more pseudocounts to the Naive-Bayes classifier initial counts (number of genomes the domain was found in) to introduce some uncertainty in the results, the default is 1 (just to avoid log of zero).\n\n## Output\nThe output table will contain the maximum a-posterior (MAP) which is the most probable domain and the log probability of each of the four domains, unnormalized. You can use some filtering for the minimal number of domains or the difference between the maximal domain and the second best.\n\n## Filtering kraken2 database\nThe first step would be to run each library.fna file of the downloaded kraken2 database. The next step would be to use the script `filter_kraken_db.py` with the input fna file and the table to get the records that match their taxonomic domain. Run:\n```\nfilter_kraken_db.py --dbfile --taxonomy \n```\nThe script will generate an sqlite3 database in the file given by `--dbfile` if file doesn't exist, if omitted it will be written in memory. The `--taxonomy` flag is the taxonomy directory under the kraken2 database dir, if `--dbfile` exists it can be omitted. Some criteria must be met to filter a sequence:\n - At least `--mindomains` are found for the record (default 5)\n - The record's taxid doesn't match the predicted domain\n - The difference between the MAP domain and the second best is > `--midiff` (1 by default)\n\nIf there is no prediction or the prediction didn't meet the criteria the record will be written to the output (which goes to STDOUT)\n\n© 2019 The Jackson Laboratory", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/asafpr/domain_classifier", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "domain-classifier", "package_url": "https://pypi.org/project/domain-classifier/", "platform": "", "project_url": "https://pypi.org/project/domain-classifier/", "project_urls": { "Homepage": "https://github.com/asafpr/domain_classifier" }, "release_url": "https://pypi.org/project/domain-classifier/0.0.4/", "requires_dist": null, "requires_python": "", "summary": "Use PFAM domains to classify DNA or proteins to taxonomic domain", "version": "0.0.4" }, "last_serial": 5189107, "releases": { "0.0.3": [ { "comment_text": "", "digests": { "md5": "cf0acb96f905195c4e1c9cfc07b9f9a8", "sha256": "bbb4d8ef3e52a39ba471b16c081c0fb330fd7a503c5ca23ddb772b9d02f21e2a" }, "downloads": -1, "filename": "domain_classifier-0.0.3.tar.gz", "has_sig": false, "md5_digest": "cf0acb96f905195c4e1c9cfc07b9f9a8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8825, "upload_time": "2019-04-24T12:01:05", "url": "https://files.pythonhosted.org/packages/ff/03/ae4f15dbb05fbedc783f030016db1d545f5eb9685ad413459d3a9df9831c/domain_classifier-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "62faf17268800327fb1ed87a89f79783", "sha256": "f70b1078550736277514bff61fbc17e5d4db6f351288da0e2f0b71a29b48c962" }, "downloads": -1, "filename": "domain_classifier-0.0.4.tar.gz", "has_sig": false, "md5_digest": "62faf17268800327fb1ed87a89f79783", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8826, "upload_time": "2019-04-25T17:27:26", "url": "https://files.pythonhosted.org/packages/63/da/0b8fc951fb57c22336f4637ae52ef745f04f2d30ca59e775c2a982847ad4/domain_classifier-0.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "62faf17268800327fb1ed87a89f79783", "sha256": "f70b1078550736277514bff61fbc17e5d4db6f351288da0e2f0b71a29b48c962" }, "downloads": -1, "filename": "domain_classifier-0.0.4.tar.gz", "has_sig": false, "md5_digest": "62faf17268800327fb1ed87a89f79783", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8826, "upload_time": "2019-04-25T17:27:26", "url": "https://files.pythonhosted.org/packages/63/da/0b8fc951fb57c22336f4637ae52ef745f04f2d30ca59e775c2a982847ad4/domain_classifier-0.0.4.tar.gz" } ] }