{ "info": { "author": "Chuan-Sheng Foo", "author_email": "csfoo@cs.stanford.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering" ], "description": "# genomelake\n[![CircleCI](https://circleci.com/gh/kundajelab/genomelake.svg?style=svg)](https://circleci.com/gh/kundajelab/genomelake)[![Coverage Status](https://coveralls.io/repos/github/kundajelab/genomelake/badge.svg)](https://coveralls.io/github/kundajelab/genomelake)\n\nEfficient random access to genomic data for deep learning models.\n\nSupports the following types of input data:\n\n- bigwig\n- DNA sequence\n\ngenomelake extracts signal from genomic inputs in provided BED intervals.\n\n## Requirements\n- python 2.7 or 3.5\n- bcolz\n- cython\n- numpy\n- pybedtools\n- pysam\n\n## Installation\nClone the repository and run:\n\n`python setup.py install`\n\n## Getting started: training a protein-DNA binding model\nExtract genome-wide sequence data into a genomelake data source:\n```python\nfrom genomelake.backend import extract_fasta_to_file\n\ngenome_fasta = \"/mnt/data/annotations/by_release/hg19.GRCh37/hg19.genome.fa\"\ngenome_data_directory = \"./hg19_data_directory\"\nextract_fasta_to_file(genome_fasta, genome_data_directory)\n```\n\nUsing a BED intervals file with labels, a genome data source, and genomelake's `ArrayExtractor`, generate input DNA sequences and labels:\n```python\nimport pybedtools\nfrom genomelake.extractors import ArrayExtractor\nimport numpy as np\n\ndef batch_iter(iterable, batch_size):\n it = iter(iterable)\n try:\n while True:\n values = []\n for n in range(batch_size):\n values += (next(it),)\n yield values\n except StopIteration:\n yield values\n\ndef generate_inputs_and_labels(intervals_file, data_source, batch_size=128):\n bt = pybedtools.BedTool(intervals_file)\n extractor = ArrayExtractor(data_source)\n intervals_generator = batch_iter(bt, batch_size)\n for intervals_batch in intervals_generator:\n \tinputs = extractor(intervals_batch)\n\tlabels = []\n\tfor interval in intervals_batch:\n\t labels.append(float(interval.name))\n labels = np.array(labels)\n yield inputs, labels\n```\n\nTrain a keras model of JUND binding to DNA using 101 base pair intervals and labels in `./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz`:\n```python\nfrom keras.models import Sequential\nfrom keras.layers import Conv1D, Flatten, Dense\n\nintervals_file = \"./examples/JUND.HepG2.chr22.101bp_intervals.tsv.gz\"\ninputs_labels_generator = generate_inputs_and_labels(intervals_file, genome_data_directory)\n\nmodel = Sequential()\nmodel.add(Conv1D(15, 25, input_shape=(101, 4)))\nmodel.add(Flatten())\nmodel.add(Dense(1, activation='sigmoid'))\n\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\nmodel.fit_generator(inputs_labels_generator, steps_per_epoch=100)\n```\n\nHere is the expected result:\n```\n100/100 [==============================] - 7s - loss: 0.0584 - acc: 0.9905 \n```\n\n## License\ngenomelake is released under the BSD-3 license. See ``LICENSE`` for details.\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kundajelab/genomelake", "keywords": "deeplearning neuralnets genomics", "license": "BSD-3", "maintainer": "", "maintainer_email": "", "name": "genomelake", "package_url": "https://pypi.org/project/genomelake/", "platform": "", "project_url": "https://pypi.org/project/genomelake/", "project_urls": { "Homepage": "https://github.com/kundajelab/genomelake" }, "release_url": "https://pypi.org/project/genomelake/0.1.4/", "requires_dist": null, "requires_python": "", "summary": "Simple and efficient random access to genomic data for deep learning models.", "version": "0.1.4" }, "last_serial": 3763519, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "bed27099a06d968c8ad55775284d37a6", "sha256": "bd43c5feb0850cb645753235a7aa42672bbf019b2a71aef25ea69ebc59aac4c6" }, "downloads": -1, "filename": "genomelake-0.1.1.tar.gz", "has_sig": false, "md5_digest": "bed27099a06d968c8ad55775284d37a6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 85333, "upload_time": "2018-02-27T03:49:43", "url": "https://files.pythonhosted.org/packages/61/ff/e44bf090afa15196ab8b561fa006df736d03a9112614f613f097f5232b20/genomelake-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "5b01525fdb87e8b69bea4fb126259e03", "sha256": "26667a7e3b01cfb65952ab128569d2302102bea96f1a63f2f26415f51ace9be1" }, "downloads": -1, "filename": "genomelake-0.1.2.tar.gz", "has_sig": false, "md5_digest": "5b01525fdb87e8b69bea4fb126259e03", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 88130, "upload_time": "2018-02-28T03:01:00", "url": "https://files.pythonhosted.org/packages/7f/a4/3d4211b0cf4ef57bb2103b37cfe8d0a4484f77d45bf5969fc02d2521fc4a/genomelake-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "b36801371d9ce796095109cd1ee501b7", "sha256": "a4987f9212e122e8f6ded701bfbecc4d4c7e60efe6e999e71cc1f1f66b832b30" }, "downloads": -1, "filename": "genomelake-0.1.3.tar.gz", "has_sig": false, "md5_digest": "b36801371d9ce796095109cd1ee501b7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 88481, "upload_time": "2018-02-28T04:14:22", "url": "https://files.pythonhosted.org/packages/7d/2c/aadd37728e26d89dbbe47200d60d861b11e4416b707fa4ec16cf9e1f1141/genomelake-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "6ce6e2977502c489a7473cc7bc0c7b16", "sha256": "fcec11be1cce15f08de0df34916f946a9470ed911050f5ab7098a1279e892253" }, "downloads": -1, "filename": "genomelake-0.1.4.tar.gz", "has_sig": false, "md5_digest": "6ce6e2977502c489a7473cc7bc0c7b16", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 88469, "upload_time": "2018-04-14T00:27:20", "url": "https://files.pythonhosted.org/packages/07/f9/bfefaffe5478615f126704c385d0f38383d9dd912a65fb090eced83ba01e/genomelake-0.1.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6ce6e2977502c489a7473cc7bc0c7b16", "sha256": "fcec11be1cce15f08de0df34916f946a9470ed911050f5ab7098a1279e892253" }, "downloads": -1, "filename": "genomelake-0.1.4.tar.gz", "has_sig": false, "md5_digest": "6ce6e2977502c489a7473cc7bc0c7b16", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 88469, "upload_time": "2018-04-14T00:27:20", "url": "https://files.pythonhosted.org/packages/07/f9/bfefaffe5478615f126704c385d0f38383d9dd912a65fb090eced83ba01e/genomelake-0.1.4.tar.gz" } ] }