{ "info": { "author": "Byron Ruth", "author_email": "b@devel.io", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Framework :: Django", "Intended Audience :: Developers", "Intended Audience :: Healthcare Industry", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Topic :: Internet :: WWW/HTTP" ], "description": "# Varify Data Warehouse Development Guide\n\n[![Build Status](https://travis-ci.org/cbmi/varify-data-warehouse.png?branch=master)](https://travis-ci.org/cbmi/varify-data-warehouse) [![Coverage Status](https://coveralls.io/repos/cbmi/varify-data-warehouse/badge.png)](https://coveralls.io/r/cbmi/varify-data-warehouse)\n\n## Need some help?\nJoin our chat room and speak with our dev team: http://www.hipchat.com/gZcKr0p3y\n\n## Dependencies\n\nListed are the download links to each dependency, however most OSes have a\npackage manager or binaries that can be easily installed. Most of the below\nlinks describe alternate download and install methods.\n\nOn Mac OS X, [Homebrew](http://mxcl.github.com/homebrew/) is the recommended\nway to install most of these libraries.\n\n- [Python 2.6+](http://python.org/download/releases/2.6.9/)\n- [Redis 2.6+](http://redis.io/download)\n- [PostgreSQL 9.2+](http://www.postgresql.org/download/)\n- [Memcached](http://memcached.org)\n\n## Setup & Install\n\nDistribute, Pip and virtualenv are required. To check if you have them:\n\n```bash\nwhich pip easy_install virtualenv\n```\n\nIf nothing prints out, install the libraries corresponding to the commands\nbelow:\n\n_Watch out for sudo! The root user `$PATH` most likely does not include\n`/usr/local/bin`. If you did not install Python through your distro's package\nmanager, use the absolute path to the new Python binary to prevent installing\nthe above libraries with the wrong version (like Python 2.4 on CentOS 5),\ne.g. `/usr/local/bin/python2.7`._\n\n```bash\ncurl http://python-distribute.org/distribute_setup.py | python\ncurl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python\npip install virtualenv\n```\n\nCreate your virtualenv:\n\n```bash\nvirtualenv vdw-env\ncd vdw-env\n. bin/activate\n```\n\nClone the repo:\n\n```bash\ngit clone https://github.com/cbmi/varify-data-warehouse.git\ncd varify-data-warehouse\n```\n\nInstall the requirements:\n\n```bash\npip install -r requirements.txt\n```\n\nUnder Mac OS X 10.8 or later, with XCode 5.1 or later, the following may be necessary in order for pip to install requirements:\n\n```bash\nexport CFLAGS=-Qunused-arguments\n```\n\n[Start the postgres server](http://www.postgresql.org/docs/9.2/static/server-start.html). This *may* look something like:\n```\ninitdb /usr/local/var/postgres -E utf8\n\npg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start\n```\n\nCreate the varify database, you might first want to make sure you are a user\n```\ncreateuser --user postgres -s -r yourusername\ncreatedb varify\n\n```\n\nStart memcached\n```bash\nmemcached -d\n```\n\nStart redis\n```\nredis-server /usr/local/etc/redis.conf\n```\n\nIf you are on a Mac, you will need to start postfix to allow SMTP:\n```\nsudo postfix start\n```\n\nInitialize the Django and Varify schemas\n```\n./bin/manage.py syncdb\n./bin/manage.py migrate\n```\n\nThen either start the built-in Django server:\n\n```bash\n./bin/manage.py runserver\n```\n\nor run a `uwsgi` process:\n\n```bash\nuwsgi --ini server/uwsgi/local.ini --protocol http --socket 127.0.0.1:8000 --check-static _site\n```\n\n## Local Settings\n\n`local_settings.py` is intentionally not versioned (via `.gitignore`). It should\ncontain any environment-specific settings and/or sensitive settings such as\npasswords, the `SECRET_KEY` and other information that should not be in version\ncontrol. Defining `local_settings.py` is not mandatory but will warn if it does\nnot exist.\n\n## Pipeline\n\nThe following describes the steps to execute the loading pipeline, the performance of the pipeline, and the process behind it.\n\n\n#### NOTE: All VCF files being loaded into Varify must be annotated with the [CBMi fork of SnpEff](https://github.com/CBMi-BiG/snpEff). The key difference is that the CBMi fork attempts to generate valid HGVS for insertions and deletions, including those which require \"walking and rolling\" to identify the correct indel frame while the standard SnpEff version only contains a partial implementation of HGVS notation [as noted here](http://snpeff.sourceforge.net/SnpEff_manual.html#filters).\n\n### Retrieving Test Data\n\nWe have provided a [set of test data](https://github.com/cbmi/varify-demo-data) to use to test the load pipeline or use as sample data when first standing up your Varify instance. To use the test data, run the commands below.\n\n```bash\nwget https://github.com/cbmi/varify-demo-data/archive/0.1.tar.gz -O varify-demo-data-0.1.tar.gz\ntar -zxf varify-demo-data-0.1.tar.gz\ngunzip varify-demo-data-0.1/CEU.trio.2010_03.genotypes.annotated.vcf.gz\n```\n\nAt this point, the VCF and MANIFEST in the `varify-demo-data-0.1` directory are ready for loading in the pipeline. You can use the `varify-demo-data-0.1` directory as the argument to the `samples queue` command in the _Queue Samples_ step below if you want to just load this test data.\n\n#### Tmux (optional)\n\nSince the pipeline can take a while to load large collections(see Performance section below), you may want to consider following the Tmux steps to attach/detach to/from the load process.\n\n[Tmux](http://robots.thoughtbot.com/post/2641409235/a-tmux-crash-course) is like [screen](http://www.gnu.org/software/screen/), just newer. It is useful for detaching/reattaching sessions with long running processes.\n\n**New Session**\n\n```bash\ntmux\n```\n\n**Existing Session**\n\n```bash\ntmux attach -t 0 # first session\n```\n\n### Activate Environment\n\n```bash\nsource bin/activate\n```\n\n#### Define RQ_QUEUES\n\nFor this example, we will assume you have `redis-server` running on `localhost:6379` against the database with index 0. If you have redis running elsewhere simply update the settings below with the address info and DB you wish to use. Open your `local_settings.py` file and add the following setting:\n\n```python\nRQ_QUEUES = {\n 'default': {\n 'HOST': 'localhost',\n 'PORT': 6379,\n 'DB': 0,\n },\n 'samples': {\n 'HOST': 'localhost',\n 'PORT': 6379,\n 'DB': 0,\n },\n 'variants': {\n 'HOST': 'localhost',\n 'PORT': 6379,\n 'DB': 0,\n },\n}\n```\n\n#### Queue Samples\n\nOptionally specify a directory, otherwise it will recursively scan all directories defined in the `VARIFY_SAMPLE_DIRS` setting in the Varify project.\n\n```bash\n./bin/manage.py samples queue [directory]\n```\n\n#### Kick Off Workers\n\nYou can technically start as many of each type for loading data in parallel, but this may cause undesired database contention which could actually slow down the loading process. A single worker for `variants` is generally preferred and two or three are suitable for the `default` type.\n\n```bash\n./bin/manage.py rqworker variants &\n./bin/manage.py rqworker default &\n```\n\nNote, these workers will run forever, if there is only a single sample being loaded, the `--burst` argument can be used to terminate the worker when there are no more items left in the queue.\n\n#### Monitor Workers\n\nYou can monitor the workers and the queues using the `rq-dashboard` or `rqinfo`. Information on setting up and using those services can be found [here](http://python-rq.org/docs/monitoring/).\n\n#### Post-Load\n\nAfter the batch of samples have been loaded, two more commands need to be executed to update the annotations and cohort frequencies. These are performed _post-load_ for performance reasons.\n\n```bash\n./bin/manage.py variants load --evs --1000g --sift --polyphen2 > variants.load.txt 2>&1 &\n./bin/manage.py samples allele-freqs > samples.allele-freqs.txt 2>&1 &\n```\n\n### Performance\n\n- File size: 610 MB\n- Variant count: 1,794,055\n\n#### Baseline\n\nIteration over flat file (no parsing) with batch counting (every 1000)\n\n- Time: 80 seconds\n- Memory: 0\n\n#### Baseline VCF\n\nIteration over VCF parsed file using PyVCF\n\n- Time: 41 minutes (extrapolated)\n- Memory: 246 KB\n\n### Parallelized Queue/Worker Process\n\n#### Summary of Workflow\n\n1. Fill Queue\n2. Spawn Worker(s)\n3. Consume Job(s)\n - Validate Input\n - (work)\n - Validate Output\n - Commit\n\n#### Parallelism Constraints\n\nThe COPY command is a single statement which means the data being loaded is\nall or nothing. If multiple samples are being loaded in parallel, it is likely\nthey will have overlapping variants.\n\nTo prevent integrity errors, workers will need to consult one or more\ncentralized caches to check if the current variant has been _addressed_\nalready. If this is the case, the variant will be skipped by the worker.\n\nThis incurs a second issue in that downstream jobs depend on the existence of\nsome data that does not yet exist because another worker has not yet committed\nits data. In this case, non-matches will be queued up in the `deferred` queue\nthat can be run at a later time, after the `default` queue is empty or in\nparallel with the `default` queue.\n\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/cbmi/varify-data-warehouse", "keywords": "vcf varify harvest orm genome", "license": "BSD", "maintainer": null, "maintainer_email": null, "name": "vdw", "package_url": "https://pypi.org/project/vdw/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/vdw/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/cbmi/varify-data-warehouse" }, "release_url": "https://pypi.org/project/vdw/0.4.1/", "requires_dist": null, "requires_python": null, "summary": "Models and loader for processing, storing, and querying genetic data", "version": "0.4.1" }, "last_serial": 1274281, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "5cfd95de968fbe3c48c3207dd4c20263", "sha256": "c97dc820e5894a0ce7f4b2948f5f4f1c40f7d827aba4f99149872275c6cac66a" }, "downloads": -1, "filename": "vdw-0.1.0.tar.gz", "has_sig": false, "md5_digest": "5cfd95de968fbe3c48c3207dd4c20263", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 102310, "upload_time": "2014-09-03T00:58:02", "url": "https://files.pythonhosted.org/packages/56/fc/562095500087b322d2c2bfba6f291d8766eff5a9fa9c2952262ee82de2f7/vdw-0.1.0.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "c7bade298050944b7b3214beb454009b", "sha256": "098c53d30568765d415294a247cacd58537d541660b382be5acf53051117bb11" }, "downloads": -1, "filename": "vdw-0.2.0.tar.gz", "has_sig": false, "md5_digest": "c7bade298050944b7b3214beb454009b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 102397, "upload_time": "2014-09-09T19:09:21", "url": "https://files.pythonhosted.org/packages/d5/aa/451c5b5f1f95cc92b84f1956ef9ef9131d9ce4112af4337108b3289bf757/vdw-0.2.0.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "e8ca056211f123855405d0d11cd28b14", "sha256": "f068b2fd762f15b13508c3e20e3bf7b0a7aa6a183241fdb2e1b78c9ab9fb108b" }, "downloads": -1, "filename": "vdw-0.3.0.tar.gz", "has_sig": false, "md5_digest": "e8ca056211f123855405d0d11cd28b14", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 102385, "upload_time": "2014-09-26T00:42:45", "url": "https://files.pythonhosted.org/packages/01/49/f7dac535f2ac78353456fffe6e903917790f7934f758d128e93d494ada7e/vdw-0.3.0.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "cd50a4e0c397aac594a6f2fea9769395", "sha256": "5a61e0e2d0c9b5c22d2d9b831ec9f690c50aeba87eb85897cb1b7942a055d80c" }, "downloads": -1, "filename": "vdw-0.4.0.tar.gz", "has_sig": false, "md5_digest": "cd50a4e0c397aac594a6f2fea9769395", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 114782, "upload_time": "2014-10-17T02:16:53", "url": "https://files.pythonhosted.org/packages/26/e0/31d4aed698ee4986f8672936a5299a4ca94cccad55966234ff12b1e3bf54/vdw-0.4.0.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "c286c1fd8674dfc546dfd49b12c2d2da", "sha256": "126706bd68e874fa2a1c53ba08569e010fa2f458cdb772657ecdac9d26fd0d84" }, "downloads": -1, "filename": "vdw-0.4.1.tar.gz", "has_sig": false, "md5_digest": "c286c1fd8674dfc546dfd49b12c2d2da", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 115244, "upload_time": "2014-10-17T20:17:41", "url": "https://files.pythonhosted.org/packages/59/2f/dcf34bab64569603d38962069c8824522de027cdee2a17d559e612681933/vdw-0.4.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c286c1fd8674dfc546dfd49b12c2d2da", "sha256": "126706bd68e874fa2a1c53ba08569e010fa2f458cdb772657ecdac9d26fd0d84" }, "downloads": -1, "filename": "vdw-0.4.1.tar.gz", "has_sig": false, "md5_digest": "c286c1fd8674dfc546dfd49b12c2d2da", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 115244, "upload_time": "2014-10-17T20:17:41", "url": "https://files.pythonhosted.org/packages/59/2f/dcf34bab64569603d38962069c8824522de027cdee2a17d559e612681933/vdw-0.4.1.tar.gz" } ] }