{ "info": { "author": "Andrew Kern", "author_email": "", "bugtrack_url": null, "classifiers": [], "description": "# diploS/HIC\nThis repo contains the implementation for `diploS/HIC` as described in Kern and Schrider (2018; https://doi.org/10.1534/g3.118.200262), along \nwith its associated support scripts. `diploS/HIC` uses a deep convolutional neural network to identify\nhard and soft selective sweep in population genomic data. \n\nThe workflow for analysis using `diploS/HIC` consists of four basic parts. 1) Generation of a training set for `diploS/HIC` \nusing simulation. 2) `diploS/HIC` training and performance evaluation. 3) Calculation of `dipoS/HIC` feature vectors from genomic data.\n4) prediction on empirical data using the trained network. The software provided here can handle the last three parts; population\ngenetic simulations must be performed using separate software such as discoal (https://github.com/kern-lab/discoal) \n\n## Installation\n`diploS/HIC` has a number of dependencies that should be straightforward to install using python package managers\nsuch as `conda` or `pip`. The complete list of dependencies looks like this:\n\n- numpy\n- scipy\n- pandas\n- scikit-allel\n- scikit-learn\n- tensorflow\n- keras\n\n## Install on linux\nI'm going to focus on the steps involved to install on a linux machine using Anaconda as our python source / main\npackage manager. First download and install Anaconda\n\n```\n$ wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh\n$ bash Anaconda3-5.0.1-Linux-x86_64.sh\n```\nThat will give us the basics (numpy, scipy, scikit-learn, etc). Next lets install scikit-allel using `conda`\n```\n$ conda install -c conda-forge scikit-allel\n```\nThat's easy. Installing tensorflow and keras can be slightly more touchy. You will need to determine if \nyou want to use a CPU-only implementation (probably) or a GPU implementation of tensorflow. See\nhttps://www.tensorflow.org/install/install_linux for install instructions. I'm going to install the\nCPU version for simplicity. tensorflow and keras are the libraries which handle the deep learning\nportion of things so it is important to make sure the versions of these libraries play well together \n```\n$ pip install tensorflow \n$ pip install keras\n```\nNote that because I'm using the Anaconda version of python, pip will only install this in the anaconda directory\nwhich is a good thing. Okay that should be the basics of dependencies. Now we are ready to install `diploS/HIC` itself\n```\n$ git clone https://github.com/kern-lab/diploSHIC.git\n$ cd diploSHIC \n$ python setup.py install\n```\nAssuming all the dependencies were installed this should be all set\n\n## Usage\nThe main program that you will interface with is `diploSHIC.py`. This script has four run modes that allow the user to \nperform each of the main steps in the supervised machine learning process. We will briefly lay out the modes of use\nand then will provide a complete example of how to use the program for fun and profit.\n\n`diploSHIC.py` uses the `argparse` module in python to try to give the user a complete, command line based help menu. \nWe can see the top level of this help by typing\n```\n$ python diploSHIC.py -h\nusage: diploSHIC.py [-h] {train,predict,fvecSim,fvecVcf} ...\n\ncalculate feature vectors, train, or predict with diploSHIC\n\npossible modes (enter 'python diploSHIC.py modeName -h' for modeName's help message:\n {fvecSim,makeTrainingSets,train,fvecVcf,predict}\n sub-command help\n fvecSim Generate feature vectors from simulated data\n makeTrainingSets Combine feature vectors from muliple fvecSim runs into\n 5 balanced training sets\n train train and test a shic CNN\n fvecVcf Generate feature vectors from data in a VCF file\n predict perform prediction using an already-trained SHIC CNN\n\noptional arguments:\n -h, --help show this help message and exit\n```\n### before running diploSHIC: simulating training/testing data\nAll flavors of S/HIC require simulated data for training (and ideally, testing). Users can select whatever simulator \nthey prefer and parameterize them however they wish. We have included an example script in this respository \n(generateSimLaunchScript.py) which demonstrates how a training set can be simulated with discoal (available at \nhttps://github.com/kern-lab/discoal).\n\n### feature vector generation modes\nThe first task in our pipeline is generating feature vectors from simulation data (or empirical data) to\nuse with the CNN that we will train and then use for prediction. The `diploSHIC.py` script eases this \nprocess with two run modes\n\n#### fvecSim mode\nThe fvecSim run mode is used for turning ms-style output into feature vectors compatible with `diploSHIC.py`. The\nhelp message from this mode looks like this\n```\n$ python diploSHIC.py fvecSim -h\nusage: diploSHIC.py fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]\n [--numSubWins NUMSUBWINS]\n [--maskFileName MASKFILENAME]\n [--chrArmsForMasking CHRARMSFORMASKING]\n [--unmaskedFracCutoff UNMASKEDFRACCUTOFF]\n [--outStatsDir OUTSTATSDIR]\n [--ancFileName ANCFILENAME] [--pMisPol PMISPOL]\n shicMode msOutFile fvecFileName\n\nrequired arguments:\n shicMode specifies whether to use original haploid SHIC (use\n 'haploid') or diploSHIC ('diploid')\n msOutFile path to simulation output file (must be same format\n used by Hudson's ms)\n fvecFileName path to file where feature vectors will be written\n\noptional arguments:\n -h, --help show this help message and exit\n --totalPhysLen TOTALPHYSLEN\n Length of simulated chromosome for converting infinite\n sites ms output to finite sites (default=1100000)\n --numSubWins NUMSUBWINS\n The number of subwindows that our chromosome will be\n divided into (default=11)\n --maskFileName MASKFILENAME\n Path to a fasta-formatted file that contains masking\n information (marked by 'N'). If specified, simulations\n will be masked in a manner mirroring windows drawn\n from this file.\n --chrArmsForMasking CHRARMSFORMASKING\n A comma-separated list (no spaces) of chromosome arms\n from which we want to draw masking information (or\n 'all' if we want to use all arms. Ignored if\n maskFileName is not specified.\n --unmaskedFracCutoff UNMASKEDFRACCUTOFF\n Minimum fraction of unmasked sites, if masking\n simulated data\n --outStatsDir OUTSTATSDIR\n Path to a directory where values of each statistic in\n each subwindow are recorded for each rep\n --ancFileName ANCFILENAME\n Path to a fasta-formatted file that contains inferred\n ancestral states ('N' if unknown). This is used for\n masking, as sites that cannot be polarized are masked,\n and we mimic this in the simulted data. Ignored in\n diploid mode which currently does not use ancestral\n state information\n --pMisPol PMISPOL The fraction of sites that will be intentionally\n polarized to better approximate real data\n```\nThis mode takes three arguments and then offers many options. The arguments are the \"shicMode\", i.e. whether\nto calculate the haploid or diploid summary statistics, the name of the input file, and the name of the output file. \nThe various options allow one to account for missing data (via masking), unfolding the site frequency spectrum via the ancestral\nstates file (haploid only), and a mis-polarization rate of that unfolded site frequency spectrum. Please see the example usage below\nfor a fleshed out example of how to use these features.\n\n#### fvecVcf mode\nThe fvecVcf mode is used for calculating feature vectors from data that is stored as a VCF file. \nThe help message from this mode is as follows\n```\n$ python diploSHIC.py fvecVcf -h\nusage: diploSHIC.py fvecVcf [-h] [--targetPop TARGETPOP]\n [--sampleToPopFileName SAMPLETOPOPFILENAME]\n [--winSize WINSIZE] [--numSubWins NUMSUBWINS]\n [--maskFileName MASKFILENAME]\n [--unmaskedFracCutoff UNMASKEDFRACCUTOFF]\n [--ancFileName ANCFILENAME]\n [--statFileName STATFILENAME]\n [--segmentStart SEGMENTSTART]\n [--segmentEnd SEGMENTEND]\n shicMode chrArmVcfFile chrArm chrLen\n\nrequired arguments:\n shicMode specifies whether to use original haploid SHIC (use\n 'haploid') or diploSHIC ('diploid')\n chrArmVcfFile VCF format file containing data for our chromosome arm\n (other arms will be ignored)\n chrArm Exact name of the chromosome arm for which feature\n vectors will be calculated\n chrLen Length of the chromosome arm\n fvecFileName path to file where feature vectors will be written\n\noptional arguments:\n -h, --help show this help message and exit\n --targetPop TARGETPOP\n Population ID of samples we wish to include\n --sampleToPopFileName SAMPLETOPOPFILENAME\n Path to tab delimited file with population\n assignments; format: SampleID popID\n --winSize WINSIZE Length of the large window (default=1100000)\n --numSubWins NUMSUBWINS\n Number of sub-windows within each large window\n (default=11)\n --maskFileName MASKFILENAME\n Path to a fasta-formatted file that contains masking\n information (marked by 'N'); must have an entry with\n title matching chrArm\n --unmaskedFracCutoff UNMASKEDFRACCUTOFF\n Fraction of unmasked sites required to retain a\n subwindow\n --ancFileName ANCFILENAME\n Path to a fasta-formatted file that contains inferred\n ancestral states ('N' if unknown); must have an entry\n with title matching chrArm. Ignored for diploid mode\n which currently does not use ancestral state\n information.\n --statFileName STATFILENAME\n Path to a file where statistics will be written for\n each subwindow that is not filtered out\n --segmentStart SEGMENTSTART\n Left boundary of region in which feature vectors are\n calculated (whole arm if omitted)\n --segmentEnd SEGMENTEND\n Right boundary of region in which feature vectors are\n calculated (whole arm if omitted)\n```\nThis mode takes five arguments and again has many options. The required arguments are the \"shicMode\", i.e. whether\nto calculate the haploid or diploid summary statistics, the name of the input file, which chromosome to arm to calculate\nstatistics for, the length of that chromosome, and the name of the output file.\n\n### training the CNN and prediction\nOnce we have feature vector files ready to go we can train and test our CNN and then finally do prediction on empirical data.\n\n### formatting our training set\nBefore entering train mode we need to consolidate our training set into 5 files, one for each class. This is done using the\nmakeTrainingSets mode whose help message is as follows:\n```\n$ python diploSHIC.py makeTrainingSets -h\nusage: diploSHIC.py makeTrainingSets [-h]\n neutTrainingFileName\n softTrainingFilePrefix\n hardTrainingFilePrefix\n sweepTrainingWindows\n linkedTrainingWindows outDir\n\nrequired arguments:\n neutTrainingFileName Path to our neutral feature vectors\n softTrainingFilePrefix\n Prefix (including higher-level path) of files\n containing soft training examples; files must end with\n '_$i.$ext' where $i is the subwindow index of the\n sweep and $ext is any extension.\n hardTrainingFilePrefix\n Prefix (including higher-level path) of files\n containing hard training examples; files must end with\n '_$i.$ext' where $i is the subwindow index of the\n sweep and $ext is any extension.\n sweepTrainingWindows comma-separated list of windows to classify as sweeps\n (usually just '5' but without the quotes)\n linkedTrainingWindows\n list of windows to treat as linked to sweeps (usually\n '0,1,2,3,4,6,7,8,9,10' but without the quotes)\n outDir path to directory where the training sets will be\n written\n\noptional arguments:\n -h, --help show this help message and exit\n```\n#### train mode\nHere is the help message for the train mode of `diploSHIC.py`\n```\n$ python diploSHIC.py train -h\nusage: diploSHIC.py train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]\n trainDir testDir outputModel\n\nrequired arguments:\n trainDir path to training set files\n testDir path to test set files, can be same as trainDir\n outputModel file name for output model, will create two files one\n with structure one with weights\n\noptional arguments:\n -h, --help show this help message and exit\n --epochs EPOCHS max epochs for training CNN (default = 100)\n --numSubWins NUMSUBWINS\n number of subwindows that our chromosome is divided\n into (default = 11)\n```\nAs you will see in a moment train mode is used for training the deep learning classifier. Its required\narguments are trainDir (the directory where the training feature vectors\nare kept), testDir (the directory where the testing feature vectors are kept), and outputModel the file name for the trained\nnetwork. One note -- `diploSHIC.py` expects five files named `hard.fvec`, `soft.fvec`, `neut.fvec`, `linkedSoft.fvec`, and \n`linkedHard.fvec` in the training and testing directories. The training and testing directory can be the same directory in \nwhich case 20% of the training examples are held out for use in testing and validation.\n\ntrain mode has two options, the number of subwindows used for the feature vectors and the number of training epochs for the\nnetwork.\n\n### predict mode\nOnce a classifier has been trained, one uses the predict mode of `diploSHIC.py` to classify empirical data. Here is the help\nstatement\n```\n$ python diploSHIC.py predict -h\nusage: diploSHIC.py predict [-h] [--numSubWins NUMSUBWINS]\n modelStructure modelWeights predictFile\n predictFileOutput\n\nrequired arguments:\n modelStructure path to CNN structure .json file\n modelWeights path to CNN weights .h5 file\n predictFile input file to predict\n predictFileOutput output file name\n\noptional arguments:\n -h, --help show this help message and exit\n --numSubWins NUMSUBWINS\n number of subwindows that our chromosome is divided\n into (default = 11)\n```\nThe predict mode takes as input the two model files output by the train mode, an input file of empirical feature \nvectors, and a file name for the prediction output. \n\n#### a quick example of the train/predict cycle\nWe have supplied in the repo some example data that can give you a quick run through the train/predict cycle (we will also\nshortly provide a soup-to-nuts example that starts by calculating feature vectors from simulations and ends with prediction of \ngenomic data). Let's quickly give that code a spin. The directories `testing/` and `training/` each contain appropriately\nformatted diploid feature vectors that are ready to be fed into diploSHIC. First we will train the diploSHIC CNN, but we will\nrestrict the number of training epochs to 10 to keep things relatively brief (this runs in less than 5 minutes on our server). \n```\n$ python diploSHIC.py train training/ testing/ fooModel --epochs 10\n```\nas it runs a bunch of information monitoring the training of the network will apear. We are tracking the loss and accuracy in the\nvalidation set. When optimization is complete our trained network will be contained in two files, `fooModel.json` and \n`fooModel.weights.hdf5`. The last bit of output from `diploSHIC.py` gives us information about the loss and accuracy on\nthe held out test data. From the above run my looks like this:\n```\nevaluation on test set:\ndiploSHIC loss: 0.404791\ndiploSHIC accuracy: 0.846800\n```\nNot bad. In practice I would set the `--epochs` value much higher than 10- the default setting of 100 should suffice in most cases.\nNow that we have a trained model we can make predictions on some empirical data. In the repo there is a file called `testEmpirical.fvec`\nthat we will use as input\n```\n$ python diploSHIC.py predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds\n```\nthe output predictions will be saved in `testEmpirical.preds` and should be straightforward to interpret.\n\n### A complete test case\nIn the interest of showing the user the whole enchilada when it comes to the workflow, I've provided the user\nwith a more detailed example on the wiki of this repo. That example can be found here: https://github.com/kern-lab/diploSHIC/wiki/A-soup-to-nuts-example\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jgallowa07/diploSHIC", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "diploshictest", "package_url": "https://pypi.org/project/diploshictest/", "platform": "", "project_url": "https://pypi.org/project/diploshictest/", "project_urls": { "Homepage": "https://github.com/jgallowa07/diploSHIC" }, "release_url": "https://pypi.org/project/diploshictest/0.2/", "requires_dist": [ "numpy", "scipy", "pandas", "scikit-allel", "scikit-learn", "tensorflow", "keras" ], "requires_python": "", "summary": "DiploS/HIC", "version": "0.2" }, "last_serial": 4573163, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "362024d8c26792dcad8f5cbbdf58c547", "sha256": "04803e0b65a37a24b341dea1d417e483b8369d7b30e0c480685353106ba87bcc" }, "downloads": -1, "filename": "diploshictest-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "362024d8c26792dcad8f5cbbdf58c547", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 37035, "upload_time": "2018-12-07T19:37:23", "url": "https://files.pythonhosted.org/packages/b5/ca/058422e9e06950b92548bf1b7c065bf6516d4b66de457733b8e5b3d9a8a3/diploshictest-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1265020ef644e78fb87001dc262d688d", "sha256": "71dd2ec3d53a0b2173afd91f0b3ac042f01aeedf6eb7779ffb00022fc6058016" }, "downloads": -1, "filename": "diploshictest-0.1.tar.gz", "has_sig": false, "md5_digest": "1265020ef644e78fb87001dc262d688d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 35815, "upload_time": "2018-12-07T19:37:25", "url": "https://files.pythonhosted.org/packages/c3/e6/70ee7a6d0d086ee224fe91174b7353101d24a4d5b051ab827e2e5638e361/diploshictest-0.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "08c28bfd958e90108bf2e3b741aa755e", "sha256": "27fde049f711a88883b2a170fefaff60a1193a335f28cb7327e13ec9f33a41ca" }, "downloads": -1, "filename": "diploshictest-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "08c28bfd958e90108bf2e3b741aa755e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 37039, "upload_time": "2018-12-07T20:01:25", "url": "https://files.pythonhosted.org/packages/1d/a3/96015400dc10c4398e35d4b5819d64a74cef45f1593c896212756df443a9/diploshictest-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4e621d5221fccef5d5143e8da72faf7f", "sha256": "b873d834e3fba042042e50a20b5d9dac280b99e98bc5ec82419adf244f8316e1" }, "downloads": -1, "filename": "diploshictest-0.2.tar.gz", "has_sig": false, "md5_digest": "4e621d5221fccef5d5143e8da72faf7f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13868243, "upload_time": "2018-12-07T20:01:53", "url": "https://files.pythonhosted.org/packages/82/bf/ecb9830bba7ed2d75f5523332ae85122510d9f284e47cf03346809a7d59a/diploshictest-0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "08c28bfd958e90108bf2e3b741aa755e", "sha256": "27fde049f711a88883b2a170fefaff60a1193a335f28cb7327e13ec9f33a41ca" }, "downloads": -1, "filename": "diploshictest-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "08c28bfd958e90108bf2e3b741aa755e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 37039, "upload_time": "2018-12-07T20:01:25", "url": "https://files.pythonhosted.org/packages/1d/a3/96015400dc10c4398e35d4b5819d64a74cef45f1593c896212756df443a9/diploshictest-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4e621d5221fccef5d5143e8da72faf7f", "sha256": "b873d834e3fba042042e50a20b5d9dac280b99e98bc5ec82419adf244f8316e1" }, "downloads": -1, "filename": "diploshictest-0.2.tar.gz", "has_sig": false, "md5_digest": "4e621d5221fccef5d5143e8da72faf7f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13868243, "upload_time": "2018-12-07T20:01:53", "url": "https://files.pythonhosted.org/packages/82/bf/ecb9830bba7ed2d75f5523332ae85122510d9f284e47cf03346809a7d59a/diploshictest-0.2.tar.gz" } ] }