{ "info": { "author": "Jake Biesinger, Yuanfeng Wang, Xiaohui Xie", "author_email": "jake.biesinger@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License (GPL)", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "This package implements a tree hidden Markov model for learning epigenetic \nstates in multiple cell types. It's similar to the bioinformatics tools \n`ChromHMM `_ and \n`Segway `_, except that it allows \nthe user to explicitly model the relationships between cell types or\nspecies. Please see our `paper `_ \nfor further details!\n\nFor the development version, see our page on \n`Github `_.\n\nQuickstart\n----------\n::\n\n # INSTALLATION\n # If you're on Ubuntu\n sudo apt-get install python-pip python-scipy python-matplotlib cython git\n\n # OR if you're on a mac\n ruby -e \"$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)\"\n brew install python scipy matplotlib cython git\n\n # don't forget python prerequisite `pysam`\n sudo pip install -U pysam\n\n\n # grab and build the tree-hmm code\n # easiest way first:\n sudo pip install -U treehmm\n\n # OR using the latest development version\n git clone https://github.com/uci-cbcl/tree-hmm.git\n cd tree-hmm\n\n # building the package:\n # don't install-- just test:\n python setup.py build_ext --inplace\n export PYTHONPATH=$PYTHONPATH:`pwd`\n export PATH=$PATH:`pwd`/bin\n\n # OR go ahead and install\n sudo pip install -e .\n\n\n # RUNNING\n # download some sample .bam files from our server and histogram them to binary .npy\n mkdir data && cd data\n tree-hmm convert \n --download_first \\\n --base_url \"https://cbcl.ics.uci.edu/public_data/tree-hmm-sample-data/%s\" \\\n --chromosomes chr19 \\\n --bam_template \"wgEncodeBroadHistone{species}{mark}StdAlnRep{repnum}.REF_chr19.bam\" \\\n --marks Ctcf H3k27me3 H3k36me3 H4k20me1 H3k4me1 H3k4me2 H3k4me3 H3k27ac H3k9ac \\\n --species H1hesc K562 Gm12878 Huvec Hsmm Nhlf Nhek Hmec\n\n\n # split chr19 into smaller pieces\n tree-hmm split observations.chr19.npy\n\n\n # do inference on the resulting chunks (creates infer_out directory)\n tree-hmm infer 5 --max_iter 3 --approx poc --run_local \"observations.chr19.chunk*.npy\"\n\n\n # convert the inferred states into BED format for viewing on the genome browser\n tree-hmm q_to_bed infer_out/mf/*\n\n\n # upload the BED files to UCSC, analyze, etc\n\n\nPrerequisites\n-------------\n- Scipy\n- Matplotlib (optional for plotting)\n- PySAM (optional for converting SAM files to 0/1 Numpy matrices)\n- Cython >= 0.15.1+ (required for converting .pyx => .c => .so)\n\nPlatform-specific installation instructions are available in the quickstart. \nPlease note we don't support windows!\n\n\nInstalling tree-hmm\n-------------------\n- The easiest way::\n\n sudo pip install treehmm\n\n- For the latest development version::\n \n sudo pip install git+https://github.com/uci-cbcl/tree-hmm.git#egg=treehmm\n\n- To be able to hack on the code yourself::\n\n git clone https://github.com/uci-cbcl/tree-hmm.git\n cd tree-hmm\n sudo pip install -e .\n\n- Or alternatively, if you don't have admin rights::\n\n git clone https://github.com/uci-cbcl/tree-hmm.git\n cd tree-hmm\n python setup.py build_ext --inplace\n export PYTHONPATH=`pwd`:$PYTHONPATH\n export PATH=`pwd`/bin:$PATH\n\n\nRunning the commands\n--------------------\nUse the tree-hmm command from the command line to perform the major\ncommands, including:\n\n- **convert** a set of BAM-formatted mapped reads to numpy matrices \n\n- **split** a chromosome into several variable-length chunks, \n determined via a gaussian convolution of the raw read signal \n\n- **infer** underlying chromatin states from a converted binary matrix\n\n- **q_to_bed** convert the numpy probability matrices into BED files\n (maximum a posteriori state)\n\nEach of these tasks has its own command-line help, accessible via::\n\n tree-hmm convert --help\n tree-hmm split --help\n tree-hmm infer --help\n tree-hmm q_to_bed --help\n\n\nData Conversion\n---------------\n::\n\n tree-hmm convert \\\n --download_first \\\n --marks Ctcf H3k4me2 \\\n --species H1hesc K562\n\nWe require at least one BAM input file for each species/mark combination.\nYou can specify the naming convention your data follows via the \n``--bam_template`` argument.\n\nDuring this step, all of the BAM files will be scanned and\nhistogrammed. Replicants are pooled together and the reads are binarized\nusing a poisson background rate specific to each mark/species\ncombination, i.e.,::\n\n bgrate_{il} = \\frac{\\sum_{t=1}^T count_{itl}} {T}\n markpresence_{il} = poisson.survival(count_{itl}, bg_rate_{il}) < max_pvalue\n\nwhere `max_pvalue` can be specified by the user. This simply imposes a\nthreshold value that's specific to a species/mark combination and should\naccount for sequencing depth differences and a variable number of\nreplicates. Finally, the histograms are split by chromosome and, by\ndefault, written to ``observations.{chrom}.npy``.\n\nThe ENCODE histone data available on UCSC can be downloaded by\nspecifying the ``--download_first`` option, or you can specify other\ndatasets by changing ``--base_url`` and ``--bam_template``, just so long\nas the files are named systematically. See the quickstart above for an\nexample using sample ENCODE data from hg19's chr19.\n\nThe species and marks in use can be modified either by directly editing\nthe treehmm/static.py file or by specifying ``--species`` and/or\n``--marks`` in the convert step. Again, see the quickstart for an\nexample.\n\nThis step creates a new file ``start_positions.pkl`` which contains\ninformation about the original coordinates and the species and marks\nused during the conversion. This file (or its twin created by the\n``split`` step) is required by the final ``q_to_bed`` step.\n\nNote: there is already preliminary support for missing marks/species\nalready present which I can make more user-friendly if there is\ninterest. There is also some work on allowing continuous observations (rather\nthan the current binary-only observations). Raise an issue/feature request on \nGithub if you're interested.\n\n\nSplitting the data\n------------------\nSince the majority of the genome will not have any signal in any of the\nmarks, it is very useful to split up the histone data and only perform\ninference on the regions with adequate signal. This step can also speed\nup inference as only a portion of the genome is used and each chunk is\nconsidered independent of all others.\n\nWe use a gaussian smoothing filter over the binarized data and cut out\nregions where there is no signal in any species or mark. A histogram of\nthe resulting read lengths is drawn to help identify how much of the\ngenome is retained for a given setting. The defaults retained about 50%\nof hg19 on the ENCODE data.\n::\n\n tree-hmm split \\\n observations.chr*.npy \n start_positions.pkl \\\n --min_reads .1 \\\n --gauss_window_size 5 \\\n --min_size 25\n\nNote: the ``--gauss_window_size`` and ``--min_size`` are in terms of\n*bins*. So if you want a smoothing window that acts over 10kb up and\ndownstream (20kb total), and had specified a ``--window_size`` of 200bp\nin the convert phase, you'd specify a ``--gauss_window_size`` of 50.\n\nThis step creates a new file ``start_positions_split.pkl`` which retains\ninformation about the original read coordinates. This file (or its twin\ncreated by the ``convert`` step) is required during the final\n``q_to_bed`` step.\n\n\nInference\n---------\nFor inference, you must specify the number of hidden states to infer and\none or more input files. There are also many parameters you can\nfine-tune with defaults as used in the paper.\n\nInference will try to submit jobs to an SGE cluster. If you don't have\nsuch a cluster, make sure you specify the ``--run_local`` option. If you\ndo, you should clean up the ``SGE_*`` files when inference is complete.\nThose files contain parameters and return values job submissions.\n\nThere are several inference engines implemented for comparison\nincluding:\n\n:mean field approximation (mf): the simplest variational\n approximation, with every node optimized independently. Memory\n use is O(I \\* T \\* K), that is, it scales linearly with the \n number of species (I), the number of bins (T), and the number of states\n (K). \n\n:loopy belief propagation (loopy): similar to mean field in that\n each node is optimized independently, but is has a non-monotonic\n energy trajectory. Works well in some applications, but not very \n well in ours. Memory use is O(I \\* T \\* K). \n\n:product of chains (poc): the entire chain from each species is \n solved exactly, but different chains are optimized indepently.\n Memory use is O(I \\* T \\* K^2). This mode performed the best in\n our testing.\n\n:Exact inference using a cliqued HMM (clique): the entire graph is\n solved exactly using a naive cliquing approach. Memory use is\n O(T \\* K^I). **This mode EATS memory** and is not recommended for \n K > 10.\n\nThere are also a few more \"experimental\" modes: \n\n:Independent chains (indep): the entire chain from each species is\n solved exactly, but there are no connections between different \n chains (this is how ChromHMM handles joint inference). \n Memory use is O(I \\* T \\* K^2).\n\n:Exact inference using the Graphical Models Toolkit (gmtk): exact\n inference using the Graphical Models Toolkit. If you have GMTK\n installed, this package provides an alternative to the clique \n mode. We found it to be a bit slower clique, but has much more \n reasonable memory usage. The easiest way to get GMTK might be to \n follow the documentation of \n `segway `_, which\n requires it to run.\n\n\nChanging the phylogeny\n**********************\nThe phyologeny connecting each species is specified in `treehmm/static.py` and\nis in the form of a python dictionary, each of whose keys are the child cell \ntype and whose values are the parent cell type. For example, the default \nphylogeny used for the ENCODE data specifies that ``H1hesc`` is the parent of \nall other cell types::\n\n phylogeny = {'H1hesc':'H1hesc', \n 'Huvec':'H1hesc',\n 'Hsmm':'H1hesc',\n 'Nhlf':'H1hesc',\n 'Gm12878':'H1hesc',\n 'K562':'H1hesc',\n 'Nhek':'H1hesc',\n 'Hmec':'H1hesc',\n 'Hepg2':'H1hesc'}\n\nA few things to note:\n\n- Cell types that specify themselves as their own parents (like ``H1hesc``\n above) are considered root nodes (they have no parents) and use a different\n set of parameters than cell types with parents.\n- While each cell type is allowed to have zero or as many children as you want,\n each cell type is only allowed to have a single parent. This is enforced\n already since dictionaries can't have duplicate keys.\n- Connecting the tree in a loop (s.t. there is no root node) violates a\n fundamental assumption in Bayesian networks (they are supposed to be\n directed, Acyclic Graphs). The code may run okay, but you will probably get\n incorrect results.\n- You can have multiple roots in a graph (e.g., two independent trees or one \n tree and a single, unrelated cell type). This might help in learning a global\n set of parameters (more data), but shouldn't affect the inference quality \n within each tree. Cell types without parents or children will behave exactly\n like standard hidden Markov models.\n- We have preliminary code that uses a separate transition parameterization for\n each parent:child relationship. While this mode should increase accuracy and\n may reveal some unique trends along each branch of your phylogeny, keep in \n mind that the number of parameters is greatly increased and can lead to \n overfitting. You may want to reduce the number of states in use (K). If\n you're interested in this mode, contact me/raise an issue on Github.\n- Internal \"hidden\" nodes are possible using the ``--mark_avail`` parameter\n (which in turn is allowing some marks/species to be missing). This mode \n has some weird side-effects when run on heterogeneous mark combinations and\n wasn't pursued further.\n\n\nRunning on SGE vs. locally\n**************************\nBy default, the inference stage in tree-hmm will try to submit jobs through the\nSGE grid. If it can't do so, it will fall back to running the jobs locally.\nHere are a few tips to make things run faster:\n\n- If you haven't split your data into chunks using the ``split`` subcommand,\n you should set the ``--chunksize`` to 1. That way, each chromosome will be\n handled by a different job.\n- For split data running on SGE, set the ``--chunksize`` fairly high (like \n 100, or even more if you have tens of thousands of chunks). This will start\n fewer SGE jobs that will each run longer and save you from being yelled at \n by your system administrators\n- If you're not using SGE, you should explicitly set ``--run_local`` since \n it will use a more efficient message passing algorithm (inter-process \n communication rather than writing parameters and results to disk).\n- If you are using SGE, be sure to clean up the (many, many) ``SGE_*`` files\n which serve as temporary messages outputs while the job is running.\n\n\nQuick Example\n*************\nTo infer K=18 states, but only do five M-step iterations, up to 10\nE-step iterations per M-step, use the product-of-chains approximation on\nthe entire converted and split set of files, iterating until the change\nin free energy is < 1e-5 in either the E-step or the M-step and running\nin parallel locally rather than on an SGE grid::\n\n tree-hmm infer \\\n 18 \\\n --max_iter 5 \\\n --max_E_iter 10 \\\n --approx poc \\\n --epsilon 1e-5 \\\n --epsilon_e 1e-5 \\\n --run_local \\\n \"observations.chr*.chunk*.npy\"\n\nAfter running this, you'll find a new directory ``infer_out/mf/TIMESTAMP/``.\n\nIn this directory, you'll find several png formatted files showing the\nfree energy trajectory across iterations, the parameters as they are\nlearned (plotted at each iteration) as well as the Q distributions\n(marginal probabilities of each node being in a particular state) and\nthe contributions of each job's inferred state to each parameter (i.e.,\nthe transition matrices alpha, beta, gamma, and theta as well as the\nemission matrix e).\n\n\nPost-processing\n---------------\nTo make any sense of the actual genomic segmentation, you'll need to\nconvert the marginal Q probabilities into BED-formatted files. If you\nused the ``split`` subcommand, you need to specify the\n``start_positions_split.pkl`` file generated by that command::\n\n tree-hmm q_to_bed infer_out/mf/TIMESTAMP/ start_positions_split.pkl\n\nIf you did not use the ``split``, you may use the original pkl file::\n\n tree-hmm q_to_bed infer_out/mf/TIMESTAMP/ start_positions.pkl\n\nThese commands will find and output the most likely state assignment in\neach bin for all species to a set of bed files \n``treehmm_states.{species}.state{k}.bed``.\n\nNote that this is the maximum a posteriori (MAP) assignment, NOT the\nmost likely joint configuration. ChromHMM also outputs the MAP, whereas\nSegway uses the most likely joint configuration or viterbi path. The\n``gtmk`` inference mode can find the most likely joint configuration,\nbut downstream tools are lacking at the moment. If you're interested in\nthis, please raise an issue on Github.\n\nYou can also get the full probability matrix (not just most likely\nstate) by specifying ``--save_probs``. This step relies on the\n``start_positions.pkl`` file generated during the ``split`` phase. You\nmay specify where that file is located via ``--start_positions``. If you\ndon't want to split your data beyond by-chromosome, I can modify this\nstep accordingly. Again, please raise an issue on Github if you're\ninterested.\n\nFinally, you may want to check out `pybedtools `_ or \n`Galaxy `_ to do downstream analysis.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/uci-cbcl/tree-hmm", "keywords": null, "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "treehmm", "package_url": "https://pypi.org/project/treehmm/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/treehmm/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/uci-cbcl/tree-hmm" }, "release_url": "https://pypi.org/project/treehmm/0.1.0/", "requires_dist": null, "requires_python": null, "summary": "Variational Inference for tree-structured Hidden-Markov Models", "version": "0.1.0" }, "last_serial": 812176, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "769706a9af6ba52e325cf3356244bad9", "sha256": "c7cadc0a0c9eff80888872fe7cd93333f3281f4ee7c33af31bb2c19c7af2d713" }, "downloads": -1, "filename": "treehmm-0.1.0.tar.gz", "has_sig": false, "md5_digest": "769706a9af6ba52e325cf3356244bad9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 83258, "upload_time": "2013-07-09T22:36:01", "url": "https://files.pythonhosted.org/packages/19/89/b4e1a693e78c5343344980b732c9d85c32ea391737ed708fd0b5dddabe61/treehmm-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "769706a9af6ba52e325cf3356244bad9", "sha256": "c7cadc0a0c9eff80888872fe7cd93333f3281f4ee7c33af31bb2c19c7af2d713" }, "downloads": -1, "filename": "treehmm-0.1.0.tar.gz", "has_sig": false, "md5_digest": "769706a9af6ba52e325cf3356244bad9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 83258, "upload_time": "2013-07-09T22:36:01", "url": "https://files.pythonhosted.org/packages/19/89/b4e1a693e78c5343344980b732c9d85c32ea391737ed708fd0b5dddabe61/treehmm-0.1.0.tar.gz" } ] }