{ "info": { "author": "xiao hu", "author_email": "xiaohu@iastate.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Programming Language :: Python :: 2.7", "Topic :: Software Development :: Build Tools" ], "description": "# Debiasing a Protein Annotation Database\n\nDebiaser removes bias from [GAF](http://www.geneontology.org/page/go-annotation-file-formats) files based on annotation information content, GO evidence, annotation source, number of proteins annotated from a given source, an date. Debiaser accepts one or more GAF files as input. The motivation for GAF lies in the observation that many organism annotations are biased due to high throughpout experimental studies ([1](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003063)). Removing such annotation biases can help present a more balanaced picture of protein annotations for a given organism or set of proteins. \n\n### Prerequisites\n\n#### Required modules. \n\nModules are available in most GNU/Linux distributions, or from their respective websites.\n\n* [networkx](https://networkx.github.io/)\n\n* [matplotlib](https://matplotlib.org/)\n\n* [numpy](http://www.numpy.org/)\n\n* [Biopython](http://biopython.org/)\n\n* [xlsxwriter](http://xlsxwriter.readthedocs.io/)\n\n#### Required files\nYou would need an obo formatted version of the Gene Ontology. Depending on your needs, this would usually be one of [go-basic.obo](http://purl.obolibrary.org/obo/go/go-basic.obo) or [go.obo](http://purl.obolibrary.org/obo/go.obo). For more details and to download either the most recent daily version or the latest version go to the [Gene Ontology website](http://geneontology.org/page/download-ontology). \n\n\n\n### Installation\n\nInstalling from source\n```\ngit clone https://github.com/Rinoahu/debias\ncd debias\npython setup.py install\n```\n\nInstalling with pip\n```\npip install debias\n```\nOR\n```\npip install git+git://github.com/Rinoahu/debias\n```\n\n### Initial files\nThese files will be created upon running `debias_prep`. \n`debias_prep -i data/GOFILE.obo`\n\nGOFILE will usually be one of `go.obo` or `go-basic.obo`\n\nThis will generate seven files in total. Three files corresponds to the three ontologies. Three files corresponds to the mapping between each GO_term and its ancestors in its own respective ontology. The last file contains mapping from alternate GO_ID to actual GO_ID. Please use this command when a new go.obo file is released.\n```\n1. ./data/alt_to_id.graph : Needed to obtain mapping from alternate GO_ID to actual GO_ID\n2. ./data/mf.graph : The MFO Ontology graph\n3. ./data/bp.graph : The BPO Ontology graph\n4. ./data/cc.graph : The CCO Ontology graph\n5. ./data/mf_ancestors.map : The MFO Ancestors map\n6. ./data/bp_ancestors.map : The BPO Ancestors map\n7. ./data/cc_ancestors.map : The CCO Ancestors map\n```\n\n### Quick setup steps\n\n1. Download the latest go.obo file from http://www.geneontology.org/ontology/ \n\n2. Run the program `debias_prep` program and provide the downloaded\n .obo file. See the usage details below. This program needs to be run only when a new .obo file needs to be used.\n\n3. Run the program `debias` \n\n\n```\nusage: debias [-h] [--prefix PREFIX] [--cutoff_prot CUTOFF_PROT]\n [--cutoff_attn CUTOFF_ATTN] [--output OUTPUT]\n [--evidence EVIDENCE [EVIDENCE ...] | --evidence_inverse\n EVIDENCE_INVERSE [EVIDENCE_INVERSE ...]] --input INPUT\n [INPUT ...] [--aspect ASPECT [ASPECT ...]]\n [--assigned_by ASSIGNED_BY [ASSIGNED_BY ...] |\n --assigned_by_inverse ASSIGNED_BY_INVERSE\n [ASSIGNED_BY_INVERSE ...]] [--recalculate RECALCULATE]\n [--info_threshold_Wyatt_Clark_percentile INFO_THRESHOLD_WYATT_CLARK_PERCENTILE | --info_threshold_Wyatt_Clark INFO_THRESHOLD_WYATT_CLARK]\n [--info_threshold_Phillip_Lord_percentile INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE | --info_threshold_Phillip_Lord INFO_THRESHOLD_PHILLIP_LORD]\n [--verbose VERBOSE] [--date_before DATE_BEFORE]\n [--date_after DATE_AFTER] [--single_file SINGLE_FILE]\n [--select_references SELECT_REFERENCES [SELECT_REFERENCES ...]\n | --select_references_inverse SELECT_REFERENCES_INVERSE\n [SELECT_REFERENCES_INVERSE ...]] [--report REPORT]\n [-histogram HISTOGRAM]\n\noptional arguments:\n -h, --help show this help message and exit\n --prefix PREFIX, -pref PREFIX\n Add a prefix to the name of your output files.\n --cutoff_prot CUTOFF_PROT, -cprot CUTOFF_PROT\n The threshold level for deciding to eliminate\n annotations which come from references that annotate\n more than the given 'threshold' number of PROTEINS\n --cutoff_attn CUTOFF_ATTN, -cattn CUTOFF_ATTN\n The threshold level for deciding to eliminate\n annotations which come from references that annotate\n more than the given 'threshold' number of ANNOTATIONS\n --output OUTPUT, -odir OUTPUT\n Writes the final outputs to the directory in this\n path.\n --evidence EVIDENCE [EVIDENCE ...], -e EVIDENCE [EVIDENCE ...]\n Accepts Standard Evidence Codes outlined in\n http://geneontology.org/page/guide-go-evidence-codes.\n All 3 letter code for each standard evidence is\n acceptable. In addition to that EXPEC is accepted\n which will pull out all annotations which are made\n experimentally. COMPEC will extract all annotations\n which have been done computationally. Similarly,\n AUTHEC and CUREC are also accepted. Cannot be provided\n if -einv is provided\n --evidence_inverse EVIDENCE_INVERSE [EVIDENCE_INVERSE ...], -einv EVIDENCE_INVERSE [EVIDENCE_INVERSE ...]\n Leaves out the provided Evidence Codes. Cannot be\n provided if -e is provided\n --aspect ASPECT [ASPECT ...], -a ASPECT [ASPECT ...]\n Enter P, C or F for Biological Process, Cellular\n Component or Molecular Function respectively\n --assigned_by ASSIGNED_BY [ASSIGNED_BY ...], -assgn ASSIGNED_BY [ASSIGNED_BY ...]\n Choose only those annotations which have been\n annotated by the provided list of databases. Cannot be\n provided if -assgninv is provided\n --assigned_by_inverse ASSIGNED_BY_INVERSE [ASSIGNED_BY_INVERSE ...], -assgninv ASSIGNED_BY_INVERSE [ASSIGNED_BY_INVERSE ...]\n Choose only those annotations which have NOT been\n annotated by the provided list of databases. Cannot be\n provided if -assgn is provided\n --recalculate RECALCULATE, -recal RECALCULATE\n Set this to 1 if you wish to enforce the recalculation\n of the Information Accretion for every GO term.\n Calculation of the information accretion is time\n consuming. Therefore keep it to zero if you are\n performing rerun on old data. The program will then\n read the information accretion values from a file\n which it wrote to in the previous run of the program\n --info_threshold_Wyatt_Clark_percentile INFO_THRESHOLD_WYATT_CLARK_PERCENTILE, -WCTHRESHp INFO_THRESHOLD_WYATT_CLARK_PERCENTILE\n Provide the percentile p. All annotations having\n information content below p will be discarded\n --info_threshold_Wyatt_Clark INFO_THRESHOLD_WYATT_CLARK, -WCTHRESH INFO_THRESHOLD_WYATT_CLARK\n Provide a threshold value t. All annotations having\n information content below t will be discarded\n --info_threshold_Phillip_Lord_percentile INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE, -PLTHRESHp INFO_THRESHOLD_PHILLIP_LORD_PERCENTILE\n Provide the percentile p. All annotations having\n information content below p will be discarded. So if 5 is provided, proteins annotated by \n terms whose score is in the top 5% will be left in, the rest will be discarded.\n --info_threshold_Phillip_Lord INFO_THRESHOLD_PHILLIP_LORD, -PLTHRESH INFO_THRESHOLD_PHILLIP_LORD\n Provide a value t. All annotations having\n information content below t will be discarded\n --verbose VERBOSE, -v VERBOSE\n Set this argument to 1 if you wish to view the outcome\n of each operation on the console\n --date_before DATE_BEFORE, -dbfr DATE_BEFORE\n The date entered here will be parsed by the parser\n from dateutil package. For more information on\n acceptable date formats please visit\n https://github.com/dateutil/dateutil/. All annotations\n made prior to this date will be picked up\n --date_after DATE_AFTER, -daftr DATE_AFTER\n The date entered here will be parsed by the parser\n from dateutil package. For more information on\n acceptable date formats please visit\n https://github.com/dateutil/dateutil/. All annotations\n made after this date will be picked up\n --single_file SINGLE_FILE, -single SINGLE_FILE\n Set to 1 in order to output the results of each\n individual species in a single file.\n --select_references SELECT_REFERENCES [SELECT_REFERENCES ...], -selref SELECT_REFERENCES [SELECT_REFERENCES ...]\n Provide the paths to files which contain references\n you wish to select. It is possible to include\n references in case you wish to select annotations made\n by a few references. This will prompt the program to\n interpret string which have the keywords\n 'GO_REF','PMID' and 'Reactome' as a GO reference.\n Strings which do not contain that keyword will be\n interpreted as a file path which the program will\n except to contain a list of GO references. The program\n will accept a mixture of GO_REF and file names. It is\n also possible to choose all references of a particular\n category and a handful of references from another. For\n example if you wish to choose all PMID references,\n just put PMID. The program will then select all PMID\n references. Currently the program can accept PMID,\n GO_REF and Reactome\n --select_references_inverse SELECT_REFERENCES_INVERSE [SELECT_REFERENCES_INVERSE ...], -selrefinv SELECT_REFERENCES_INVERSE [SELECT_REFERENCES_INVERSE ...]\n Works like -selref but does not select the references\n which have been provided as input\n --report REPORT, -r REPORT\n Provide the path where the report file will be stored.\n If you are providing a path please make sure your path\n ends with a '/'. Otherwise the program will assume the\n last string after the final '/' as the name of the\n report file. A single report file will be generated.\n Information for each species will be put into\n individual worksheets.\n --histogram HISTOGRAM, -hist HISTOGRAM\n Set this option to 1 if you wish to view the histogram\n of GO_TERM frequency before and after debiasing is\n performed with respect to cutoffs based on number of\n proteins or annotations. If you wish to save the file\n then please enter a filepath. If you are providing a\n path please make sure your path ends with a '/'.\n Otherwise the program will assume the last string\n after the final '/' as the name of the image file.\n Separate histograms will be generated for each\n species.\n\nRequired arguments:\n --input INPUT [INPUT ...], -i INPUT [INPUT ...]\n The input file path. Please remember the name of the\n file must start with goa in front of it, with the name\n of the species following separated by an underscore\n```\n\nNOTE: The files inside the folder \"temp\" are the one which have been generated by executing the command below
\n### Examples\n\n1. `debias_prep -i data/go.obo` \n\nThis command will generate seven files in total. Three files corresponds\nto the three ontologies. Three files corresponds to the mapping between\neach GO_term and its ancestors in its own respective ontology. The last\nfile contains mapping from alternate GO_ID to actual GO_ID. Please use\nthis command every time you update GOFILE. \n\n2. `debias -cprot 100 -i data/goa_yeast.gaf data/goa_dicty.gaf -a C -WCTHRESHp 2 -recal 1`\n\nThis command reads from two input files one for yeast and the other for\ndicty. The -a C only selects the annotations which are CCO. The\n-WCTHRESHp argument specifies that the Wyatt Clark Threshold is a 2\npercentile, which means all annotations having a Wyatt Clark Information\ncontent below 2% will be removed. Instead of providing a percentage\nvalue one can also provide a threshold value using the argument\n-WCTHRESH. In addition to that, those annotations will be removed which\nhave been annotated by references that have in turn annotated more than\n100 **proteins**. The output will be put in the current directory. It is\nnecessary to have -recal 1 in this command since the GO_term to IC has\nto be created. Subsequent runs with different threshold and all other\nparameters fised is possible **WITHOUT** providing the argument -recal.\nThis command will lead to 3 output files. One each for the two organisms\nand the third one is where both the organisms are combined. \n\n3. `debias -i data/goa_yeast.gaf data/goa_dicty.gaf -a C P -PLTHRESHp 30 -e EXPEC IBA -odir data/output -single 1`\n\nThis command will read from two input files, select CCO and BPO\nannotations. Further, it will **choose** only those annotations which\nhave been made experimentally or have been annotated computationally as\n\"IBA\" (Inferred from Biological aspect of Ancestor). In addition to that\nit will discard all annotations which have a Phillip Lord information\ncontent less than 30%. Instead of providing a percentage value one can\nalso provide a threshold value using the argument -PLTHRESH. The final\noutput will be put inside the data/output directory. You can include non\nexistent paths. The program will attempt to create the folders if\nrequired permissions are present. This will lead to only one file, since\nthe -single argument has been provided, which will contain all the\nselected annotations from both the organisms. \n\n4. `debias -cattn 1000 -i data/goa_yeast.gaf data/goa_dicty.gaf -a C P -einv COMPEC -pref testing -selrefinv Reactome`\n\nThis command will read from two input files, select CCO and BPO\nannotations. Further, it will **discard** those annotations which have\nbeen made computationally. The program further filters out all\nannotations made by \"Reactome\". All files will be prefixed with the\nstring \"testing\". Since the program creates a meaningful name for each\nfile, the user has been given the opportunity to give a prefix.\n\n### Running test data\n\nTo test all the commands mentioned above, you can run the shell script named test.sh in the tests directory.\n\n```\ngit clone https://github.com/Rinoahu/debias\ncd ./debias/tests\nbash test.sh\n```", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Rinoahu/debias", "keywords": "GO Annotation", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "debias", "package_url": "https://pypi.org/project/debias/", "platform": "", "project_url": "https://pypi.org/project/debias/", "project_urls": { "Homepage": "https://github.com/Rinoahu/debias" }, "release_url": "https://pypi.org/project/debias/0.165/", "requires_dist": null, "requires_python": "", "summary": "remove bias from GAF files", "version": "0.165" }, "last_serial": 2936021, "releases": { "0.165": [ { "comment_text": "", "digests": { "md5": "437f009daa376e7260d7666a72af29d0", "sha256": "2a232f8729e0052b28a39147e912c17e43cda12067be2aa4d73efeee4f3ef8d8" }, "downloads": -1, "filename": "debias-0.165.tar.gz", "has_sig": false, "md5_digest": "437f009daa376e7260d7666a72af29d0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6343113, "upload_time": "2017-06-08T16:54:26", "url": "https://files.pythonhosted.org/packages/90/09/143e3c2aba5588dd2899fc68179ffafa9917c469183190cb35d308b26029/debias-0.165.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "437f009daa376e7260d7666a72af29d0", "sha256": "2a232f8729e0052b28a39147e912c17e43cda12067be2aa4d73efeee4f3ef8d8" }, "downloads": -1, "filename": "debias-0.165.tar.gz", "has_sig": false, "md5_digest": "437f009daa376e7260d7666a72af29d0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6343113, "upload_time": "2017-06-08T16:54:26", "url": "https://files.pythonhosted.org/packages/90/09/143e3c2aba5588dd2899fc68179ffafa9917c469183190cb35d308b26029/debias-0.165.tar.gz" } ] }