{ "info": { "author": "Robert Ietswaart", "author_email": "robert_ietswaart@hms.harvard.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "# GeneWalk\n\n[![License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://opensource.org/licenses/BSD-2-Clause)\n[![Documentation](https://readthedocs.org/projects/genewalk/badge/?version=latest)](https://genewalk.readthedocs.io/en/latest/?badge=latest)\n[![PyPI version](https://badge.fury.io/py/genewalk.svg)](https://badge.fury.io/py/genewalk)\n[![Python 3](https://img.shields.io/pypi/pyversions/genewalk.svg)](https://www.python.org/downloads/release/python-357/)\n\nGeneWalk determines for individual genes the functions that are relevant in a\nparticular biological context and experimental condition. GeneWalk quantifies \nthe similarity between vector representations of a gene and annotated GO terms \nthrough representation learning with random walks on a condition-specific gene \nregulatory network. Similarity significance is determined through comparison \nwith node similarities from randomized networks. \n\n## Install GeneWalk\nTo install the latest release of GeneWalk (preferred):\n```\npip install genewalk\n```\nTo install the latest code from Github (typically ahead of releases):\n```\npip install git+https://github.com/churchmanlab/genewalk.git\n```\n\n## Using GeneWalk\n\n### Gene list file\nGeneWalk always requires as input a text file containing a list with genes of interest \nrelevant to the biological context. For example, differentially expressed genes \nfrom a sequencing experiment that compares an experimental versus control condition. \nGeneWalk supports gene list files containing HGNC human gene symbols, \nHGNC IDs, or MGI mouse gene IDs. Each line in the file contains a gene identifier of \none of these types.\n\n### GeneWalk command line interface\nOnce installed, GeneWalk can be run from the command line as `genewalk`, with\na set of required and optional arguments. The required arguments include the\nproject name, a path to a text file containing a list of genes, and an argument\nspecifying the types of genes in the file.\n\nExample\n```bash\ngenewalk --project context1 --genes gene_list.txt --id_type hgnc_symbol\n```\n\nBelow is the full documentation of the command line interface:\n\n```\ngenewalk [-h] --project PROJECT --genes GENES --id_type\n {hgnc_symbol,hgnc_id,mgi_id}\n [--stage {all,node_vectors,null_distribution,statistics}]\n [--base_folder BASE_FOLDER]\n [--network_source {pc,indra,edge_list,sif}]\n [--network_file NETWORK_FILE] [--nproc NPROC] [--nreps NREPS]\n [--alpha_fdr ALPHA_FDR] [--save_dw SAVE_DW]\n\n\nrequired arguments:\n --project PROJECT A name for the project which determines the folder\n within the base folder in which the intermediate and\n final results are written. Must contain only\n characters that are valid in folder names.\n --genes GENES Path to a text file with a list of differentially\n expressed genes. Thetype of gene identifiers used in\n the text file are provided in the id_type argument.\n --id_type {hgnc_symbol,hgnc_id,mgi_id}\n The type of gene IDs provided in the text file in the\n genes argument. Possible values are: hgnc_symbol,\n hgnc_id, and mgi_id.\n\noptional arguments:\n --stage {all,node_vectors,null_distribution,statistics}\n The stage of processing to run. Default: all\n --base_folder BASE_FOLDER\n The base folder used to store GeneWalk temporary and\n result files for a given project. Default:\n ~/genewalk\n --network_source {pc,indra,edge_list,sif}\n The source of the network to be used.Possible values\n are: pc, indra, edge_list, and sif. In case of indra,\n edge_list, and sif, the network_file argument must be\n specified. Default: pc\n --network_file NETWORK_FILE\n If network_source is indra, this argument points to a\n Python pickle file in which a list of INDRA Statements\n constituting the network is contained. In case\n network_source is edge_list or sif, the network_file\n argument points to a text file representing the\n network.\n --nproc NPROC The number of processors to use in a multiprocessing\n environment. Default: 1\n --nreps_graph NREPS_GRAPH\n The number of repeats to run when calculating node\n vectors on the GeneWalk graph. Default: 3\n --nreps_null NREPS_NULL\n The number of repeats to run when calculating node\n vectors on the random network graphs for constructing\n the null distribution. Default: 3\n --alpha_fdr ALPHA_FDR\n The false discovery rate to use when outputting the\n final statistics table. If 1 (default), all\n similarities are output, otherwise only the ones whose\n false discovery rate are below this parameter are\n included. Default: 1\n --save_dw SAVE_DW If True, the full DeepWalk object for each repeat is\n saved in the project folder. This can be useful for\n debugging but the files are typically very large.\n Default: False\n --random_seed RANDOM_SEED\n If provided, the random number generator is seeded\n with the given value. This should only be used if the\n goal is to deterministically reproduce a prior result\n obtained with the same random seed.\n\n```\n\n\n### Output files\nGeneWalk automatically creates a `genewalk` folder in the user's home folder \n(or the user specified base_folder).\nWhen running GeneWalk, one of the required inputs is a project name.\nA sub-folder is created for the given project name where all intermediate and\nfinal results are stored. The files stored in the project folder are:\n- **`genewalk_results.csv`** - The main results table, a comma-separated values text file. See below for detailed description.\n- `genes.pkl` - A processed representation of the given gene list, in Python pickle (.pkl) binary file format.\n- `multi_graph.pkl` - A networkx MultiGraph resembling the GeneWalk network which was assembled based on the\ngiven list of genes, an interaction network, GO annotations, and the GO ontology.\n- `deepwalk_node_vectors_*.pkl` - A set of learned node vectors for each analysis repeat for the graph.\n- `deepwalk_node_vectors_rand_*.pkl` - A set of learned node vectors for each analysis repeat for a random graph.\n- `genewalk_rand_simdists.pkl` - Distributions constructed from repeats.\n- `deepwalk_*.pkl` - A DeepWalk object for each analysis repeat on the graph \n(only present if save_dw argument is set to True).\n- `deepwalk_rand_*.pkl` - A DeepWalk object for each analysis repeat on a random graph \n(only present if save_dw argument is set to True).\n\n\n### GeneWalk results file description\n`genewalk_results.csv` is the main GeneWalk output table, a comma-separated values text file \nwith the following column headers:\n- hgnc_id - human gene HGNC identifier.\n- **hgnc_symbol** - human gene symbol.\n- **go_name** - GO term name.\n- go_id - GO term identifier.\n- go_domain - Ontology domain that GO term belongs to \n(biological process, cellular component or molecular function).\n- ncon_gene - number of connection to gene in GeneWalk network.\n- ncon_go - number of connections to GO term in GeneWalk network.\n- **mean_padj** - mean false discovery rate (FDR) adjusted p-value of the similarity between gene and GO term.\nThis is the key statistic indicating how relevant the GO term (function) is to the gene in the \nparticular biological context or tested condition. GeneWalk determines an adjusted p-value with \nBenjamini Hochberg FDR correction for multiple tested of all connected GO term for each \nnreps_graph repeat analysis. The value presented here is the average over all p-adjust values \nfrom each repeat analysis. \n- cilow_padj - lower bound of 95% confidence interval on mean_padj estimate from the nreps_graph repeat analyses.\n- ciupp_padj - upper bound of 95% confidence interval on mean_padj estimate.\n- mean_pval - mean p-values of gene - GO term similarities, not FDR corrected for multiple testing.\n- cilow_pval - lower bound of 95% confidence interval on mean_pval estimate.\n- ciupp_pval - upper bound of 95% confidence interval on mean_pval estimate.\n- mean_sim - mean of gene - GO term similarities.\n- sem_sim - standard error on mean_sim estimate.\n- mgi_id - in case mouse gene MGI identifiers were provided as input, the GeneWalk results \ntable starts with an additional mgi_id column to indicate these mouse genes. In this\ncase the corresponding hgnc_id and hgnc_symbol resemble its human ortholog\ngene used for the GeneWalk analysis.\n\n\n### Stages and run times of GeneWalk algorithm\nGiven a list of genes, GeneWalk runs three stages of analysis:\n1. Assembling a GeneWalk network and learning node vector representations\nby running DeepWalk on this network, for a specified number of repeats. \nTypical run time: one to a few hours.\n2. Learning random node vector representations by running DeepWalk on a set of\nrandomized versions of the GeneWalk network, for a specified number of\nrepeats. Typical run time: one to a few hours.\n3. Calculating statistics of similarities between genes and GO terms, and\noutputting the GeneWalk results in a table. Typical run time: a few minutes.\n\nGeneWalk can either be run once to complete all these stages (default), or called separately\nfor each stage (optional argument: stage). \nRecommended memory availability on your operating system: 16Gb or 32Gb RAM. \nRecommended number of processors (optional argument: nproc) for a short run time is 4:\n```bash\ngenewalk --project context1 --genes gene_list.txt --id_type hgnc_symbol --nproc 4\n```\nGeneWalk outputs the uncertainty (95% confidence intervals) of the similarity significance\n(mean p-adjust). Depending on the context-specific network topology, this uncertainty can be \nlarge for individual gene - function associations. However, if overall the uncertainties \nturn out very large, one can set the optional arguments nreps_graph to 10 (or more) and \nnreps_null to 10 to increase the algorithm's precision. This comes at the cost of an \nincreased run time. \n\n\n### Further documentation\nFor a tutorial and more general information see the \n[GeneWalk website](http://churchman.med.harvard.edu/genewalk). \nFor further code documentation see our [readthedocs page](https://genewalk.readthedocs.io).\n\n\n### Citation\nRobert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, and \nL. Stirling Churchman \n*GeneWalk identifies relevant gene functions for a biological context using network \nrepresentation learning* (2019), BioRxiv; 755579.\n\n\n### Funding\nThis work was supported by National Institutes of Health grant 5R01HG007173-07 \n(L.S.C.), EMBO fellowship ALTF 2016-422 (R.I.), and DARPA grants W911NF-15-1-0544 \nand W911NF018-1-0124 (P.K.S.). \n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/churchmanlab/genewalk", "keywords": "gene function,network,embedding", "license": "", "maintainer": "", "maintainer_email": "", "name": "genewalk", "package_url": "https://pypi.org/project/genewalk/", "platform": "", "project_url": "https://pypi.org/project/genewalk/", "project_urls": { "Homepage": "https://github.com/churchmanlab/genewalk" }, "release_url": "https://pypi.org/project/genewalk/1.1.0/", "requires_dist": [ "numpy", "pandas", "networkx (>=2.1)", "gensim", "goatools", "indra", "scipy (>=1.3.0)" ], "requires_python": "", "summary": "Determine gene function based on network embeddings.", "version": "1.1.0" }, "last_serial": 5781478, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "251fc4d7f94b4b859bfd2b993c1bd82f", "sha256": "92eab6508671c43c67aae9f574391d7e4106639adafbbc14f448e33a5f2dceff" }, "downloads": -1, "filename": "genewalk-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "251fc4d7f94b4b859bfd2b993c1bd82f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25277, "upload_time": "2019-08-30T13:58:18", "url": "https://files.pythonhosted.org/packages/0f/8e/27fb05a1d89a43a056c0f599d626be4482aa76d36ccbda68f8f29da60478/genewalk-1.0.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "38ed7eeef5b79b6e281ac3e309d9d01b", "sha256": "6385c5c13898b0beeaea9eb4a2c4c0e3d09236170ba6b1cc3750c5cec77fe22d" }, "downloads": -1, "filename": "genewalk-1.0.0.tar.gz", "has_sig": false, "md5_digest": "38ed7eeef5b79b6e281ac3e309d9d01b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26212, "upload_time": "2019-08-30T13:58:20", "url": "https://files.pythonhosted.org/packages/ef/53/81f9971fc84b0f664d92f41e5602beaa9bc279b08d15af53bbbe638bd5c1/genewalk-1.0.0.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "513d248f6a18d1471e44f027ed78920c", "sha256": "b0ad45ce7ed268aeeb6d0bb1db87e3407ff4d0da2935790e1ea3234f3f6dd6e1" }, "downloads": -1, "filename": "genewalk-1.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "513d248f6a18d1471e44f027ed78920c", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25907, "upload_time": "2019-09-04T14:19:38", "url": "https://files.pythonhosted.org/packages/f0/d7/024a2682f138656e0715e85001d548965e4cfd9d41477f131bea9d8faa1a/genewalk-1.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "edf25640142ca41a3d291728832467d0", "sha256": "10e58f27c30079ff2c555fce6b95ea6104fb0067a9d0bfe756993e911e35f719" }, "downloads": -1, "filename": "genewalk-1.1.0.tar.gz", "has_sig": false, "md5_digest": "edf25640142ca41a3d291728832467d0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27665, "upload_time": "2019-09-04T14:19:40", "url": "https://files.pythonhosted.org/packages/00/52/ad037a22d336993a2b53646c28e5318fed7cc18c8f8327c38d596dc0767c/genewalk-1.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "513d248f6a18d1471e44f027ed78920c", "sha256": "b0ad45ce7ed268aeeb6d0bb1db87e3407ff4d0da2935790e1ea3234f3f6dd6e1" }, "downloads": -1, "filename": "genewalk-1.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "513d248f6a18d1471e44f027ed78920c", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25907, "upload_time": "2019-09-04T14:19:38", "url": "https://files.pythonhosted.org/packages/f0/d7/024a2682f138656e0715e85001d548965e4cfd9d41477f131bea9d8faa1a/genewalk-1.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "edf25640142ca41a3d291728832467d0", "sha256": "10e58f27c30079ff2c555fce6b95ea6104fb0067a9d0bfe756993e911e35f719" }, "downloads": -1, "filename": "genewalk-1.1.0.tar.gz", "has_sig": false, "md5_digest": "edf25640142ca41a3d291728832467d0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27665, "upload_time": "2019-09-04T14:19:40", "url": "https://files.pythonhosted.org/packages/00/52/ad037a22d336993a2b53646c28e5318fed7cc18c8f8327c38d596dc0767c/genewalk-1.1.0.tar.gz" } ] }