{ "info": { "author": "Luca Pinello", "author_email": "lpinello@jimmy.harvard.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX", "Programming Language :: Python", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "HAYSTACK\r\n========\r\nEpigenetic Variability and Motif Analysis Pipeline \r\n--------------------------------------------------\r\n\r\nSummary\r\n-------\r\nHaystack is a suite of computational tools to study \r\nepigenetic variability, cross-cell-type plasticity of chromatin states and transcription factors (TFs) motifs providing mechanistic insights into chromatin structure, cellular identity and gene regulation. \r\n\r\nHaystack identifies highly variable regions across different cell types also called _hotspots_, and the potential regulators that mediate the cell-type specific variation through integration of multiple data-types. \r\n\r\nHaystack can be used with histone modifications data, DNase I hypersensitive sites data and methylation data obtained for example by ChIP-seq, DNase-Seq and Bisulfite-seq assays and measured across multiple cell-types. In addition, it is also possible to integrate gene expression data obtained from array based or RNA-seq approaches.\r\n\r\nIn particular, Haystack highlights enriched TF motifs in variable and cell-type specific regions and quantifies their activity and specificity on nearby genes if gene expression data are available.\r\n\r\nA summary of the pipeline and an example on H3k27ac data is shown in the following figure:\r\n\r\n![Haystack Pipeline](http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/Final_figure.png)\r\n\r\n\r\n**(A)** Haystack overview: modules and corresponding functions. **(B)** Hotspot analysis on H3k27ac: signal tracks, variability track and the hotspots of variability are computed from the ChIP-seq aligned data; in addition, the regions specific for a given cell type are extracted. **(C)** Motif analysis on the regions specific for the H1hesc cell line: Pou5f1::Sox2 is significant; p-value and q-value, motif logo and average profile are calculated. **(D)** Transcription factor activity for Sox2 in H1esc (star) compared to the other cell types (circles), x-axis specificity of Sox2 expression (z-score), y-axis effect (z-score) on the gene nearby the regions containing the Sox2 motif. \r\n\r\nHaystack was designed to be highly modular. The whole pipeline can be called using the _haystack_pipeline_ command or alternatively the different modules can be used and combined indipendently. For example it is possible to use only the motif analysis calling the _haystack_motifs_ module on a given set of genomic regions. A nice description of each module is present in the **_How to use HAYSTACK_** section.\r\n\r\nInstallation and Requirements\r\n-----------------------------\r\nTo install HAYSTACK, some dependencies must be installed before running the setup:\r\n\r\n1) Python 2.7 Anaconda: http://continuum.io/downloads\r\n\r\n2) Java: http://java.com/download\r\n\r\n3) C compiler / make. For Mac with OSX 10.7 or greater, open the terminal app and type and execute the command 'make', which will trigger the installation of OSX developer tools.Windows systems are not officially supported.\r\n\r\nAfter checking that the required software is installed you can install Haystack from the official Python repository following these steps:\r\n\r\n1) Open a terminal window\r\n\r\n2) Type the command:\r\n\t\r\n\tpip install haystack_bio --no-use-wheel --verbose\r\n\r\n\r\nAlternatively if want to install the package without the PIP utility:\r\n\r\n1) Download the setup file: \r\n https://github.com/lucapinello/HAYSTACK/archive/master.zip\r\n or download this one if you want preloaded the human and mouse genomes (hg19 and mm9): \r\n http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/haystack_setup_with_genomes.zip\r\n \r\n2) Decompress the file, you will get a folder called: Haystack-master\r\n\r\n3) Open a terminal window and go to the folder where you have decompressed the zip file, for example:\r\n\r\n\r\n cd ~/Downloads\r\n cd Haystack-master\r\n\r\n4) Type the installation command: \r\n\r\n python setup.py install\r\n\r\n**IMPORTANT**: The setup will automatically create a folder in your HOME folder called *HAYASTACK\\_dependencies*, and will put all the required dependencies there. __If this folder is deleted, HAYSTACK will not work!__\r\n\r\nIf you want to put the folder in a different location, you need to set the environment variable: \r\n\r\n HAYSTACK_DEPENDENCIES_FOLDER \r\n\r\nFor example to put the folder in /home/lpinello/other_stuff you can write in the terminal **BEFORE** the installation:\r\n\r\n export HAYSTACK_DEPENDENCIES_FOLDER=/home/lpinello/other_stuff\r\n\r\nDocker Image\r\n------------\r\nIf you like Docker, we also provide a Docker image:\r\n\t\r\n\thttps://hub.docker.com/r/lucapinello/haystack_bio/\r\n\r\nTo use the image first install Docker: http://docker.com\r\n\r\nThen type the command:\r\n\r\n\tdocker pull lucapinello/haystack_bio\r\n\r\nSee an example on how to run Haystack with a Docket image see the section **Testing HAYSTACK** below. __If you get memory errors try to allocate at least 8GB to the docker image in order to run Haystack__.\r\n\r\nThe current version is compatible only with Unix like operating systems on 64 bit architectures and was tested on:\r\n- CentOS 6.5\r\n- Debian 6.0\r\n- Ubuntu 12.04 and 14.04 LTS\r\n- OSX Maverick and Mountain Lion\r\n\r\n \r\nOperating System Notes\r\n----------------------\r\n**UBUNTU (tested on 14.04 LTS) in the Amazon Web Service (AWS) Cloud**\r\n\r\n1. Launch and connect to the Amazon Instance you have chosen from the AWS console (is suggested to use an m3.large ) or to your Ubuntu machine.\r\n\r\n2. Create a swap partition (**this step is only for the AWS cloud**)\r\n ```\r\n sudo dd if=/dev/zero of=/mnt/swapfile bs=1M count=20096\r\n sudo chown root:root /mnt/swapfile\r\n sudo chmod 600 /mnt/swapfile\r\n sudo mkswap /mnt/swapfile\r\n sudo swapon /mnt/swapfile\r\n sudo sh -c \"echo '/mnt/swapfile swap swap defaults 0 0' >> /etc/fstab\"\r\n sudo swapon -a\r\n ```\r\n3. Install dependencies\r\n ```\r\n sudo apt-get update && sudo apt-get update && sudo apt-get install git wget default-jre python-setuptools python-pip python-dev python-numpy python-scipy python-matplotlib python-pandas python-imaging python-setuptools unzip ghostscript make gcc g++ zlib1g-dev zlib1g -y \r\n ```\r\n \r\n __If you are installing it on a docker image you don't need the sudo before each apt-get command__\r\n\r\n4. Install Haystack \r\n ```\r\n sudo pip install haystack_bio --no-use-wheel --verbose\r\n ```\r\n \r\n5. Download and run the test dataset\r\n ```\r\n wget http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/haystack_test_dataset_h3k27ac.tar.gz\r\n tar xvzf haystack_test_dataset_h3k27ac.tar.gz\r\n cd TEST_DATASET\r\n haystack_pipeline samples_names.txt hg19\r\n ```\r\n \r\n All the results will be stored in the folder HAYSTACK_PIPELINE_RESULT\t\r\n\r\n**Apple OSX**\r\n\r\nTo install HAYSTACK on OSX you need the _Command Line Tools_ (usually shipped with Xcode). \r\nIf you don't have them you can download from here: \r\nhttps://developer.apple.com/downloads/index.action\r\n\r\nYou may need to create a free apple developer account.\r\n\r\nTo generate the motif logo you need a recent version of XQuartz, download and install the dmg from here: http://xquartz.macosforge.org/landing/.\r\n\r\nUpdating from Yosemite may break the motif logo generation.\r\nIf you don't see the motif logo in the output of the haystack_motifs utility, please install the latest version XQuartz:http://xquartz.macosforge.org/landing/.\r\n\r\nAlternatively if you don't want to update XQuartz you can fix the problem from the terminal typing the following commands:\r\n```\r\nsudo ln -s /opt/X11 /usr/X11\r\nsudo ln -s /opt/X11 /usr/X11R6\r\n```\r\n\r\nIn addition, you need to install Java for Windows.\r\n\r\nNote: If you install HAYSTACK in a custom folder please make sure to select a path without white spaces.\r\n\r\n\r\nPrecomputed Analysis\r\n--------------------\r\n\r\nWe have run Haystack on several ENCODE datasets for which you can download the the precomputed results (variability tracks, hotspots, specific regions, enriched motifs and activity planes):\r\n\r\n1. Analysis on 12 ChIP-seq tracks of H3k27ac in human cell lines + gene expression: http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/HAYSTACK_H3k27ac.tar.gz\r\n2. Analysis on 17 DNase-seq tracks in human cell lines + gene expression: (Gain) http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/HAYSTACK_DNASE.tar.gz and (Loss) http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/HAYSTACK_DNASE_DEPLETED.tar.gz\r\n3. Analysis on 10 RRBS-seq tracks of DNA-Methylation in human cell lines + gene expression: http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/HAYSTACK_Methylation.tar.gz\r\n4. Analysis on 17 ChIP-seq tracks of H3k27me3 in human cell lines + gene expression: http://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/HAYSTACK_H3k27me3.tar.gz\r\n\r\n\r\nHow to use HAYSTACK\r\n-------------------\r\nHAYSTACK consists of 5 modules:\r\n\r\n1) **haystack_hotspots**: find the regions that are variable across different ChIP-seq, DNase-seq or Bisulfite-seq tracks (only BigWig processed file are supported for methylation data). The input is a folder containing bam files (with PCR duplicates removed) or bigwig (must be .bw), or a tab delimited text file with two columns containing: 1. the sample name and 2. the path of the corresponding .bam/.bw file. For example you can write inside a file called _samples_names_hotspot.txt_ something like that:\r\n```\r\nK562\t./INPUT_DATA/K562H3k27ac_sorted_rmdup.bam\t\r\nGM12878\t./INPUT_DATA/Gm12878H3k27ac_sorted_rmdup.bam\t\r\nHEPG2\t./INPUT_DATA/Hepg2H3k27ac_sorted_rmdup.bam\t\r\nH1hesc\t./INPUT_DATA/H1hescH3k27ac_sorted_rmdup.bam\t\r\nHSMM\t./INPUT_DATA/HsmmH3k27ac_sorted_rmdup.bam\t\r\nNHLF\t./INPUT_DATA/NhlfH3k27ac_sorted_rmdup.bam\r\n```\r\nThe output will consist of:\r\n- The normalized bigwig files for each track\r\n- The hotspots i.e. the regions that are most variable\r\n- The regions that are variable and specific for each track, this means that the signal is more enriched to a particular track compared to the rest.\r\n- A session file (.xml) for the IGV software (http://www.broadinstitute.org/igv/) from the Broad Institute to easily visualize all the tracks produced, the hotspots and the specific regions for each cell line. To load it just drag and drop the file _OPEN_ME_WITH_IGV.xml_ from the output folder on top of the IGV window or alternatively load it in IGV with File-> Open Session... If you have trouble opening the file please update your IGV version. Additonaly, please don't move the .xml file only, you need all the files in the output folder to correctly load the session.\r\n\r\n**_Examples_**\r\nSuppose you have a folder called /users/luca/mybamfolder you can run the variability analysis with: \r\n\t\r\n\thaystack_hotspots /users/luca/mybamfolder hg19\r\nIf you have instead a file with the samples description, like the _samples_names_hotspot.txt_ you can run the variability analysis with: \r\n\t\r\n\thaystack_hotspots samples_names_hotspot.txt hg19\r\n \t\r\n2) **haystack_motifs**: find enriched transcription factor motifs in a given set of genomic regions\r\nThe input is a set of regions in .bed format (http://genome.ucsc.edu/FAQ/FAQformat.html#format1) and the reference genome, the output consist of an HTML report with:\r\n- motif enriched with p and q values\r\n- motif profiles and logos\r\n- list of regions with a particular motifs and coordinates of the motifs in those regions\r\n- list of closest genes to the regions with a particular motif \r\n\r\n**_Examples_**\r\nTo analyze the bed file file _myregions.bed_ on the _hg19_ genome run:\r\n\t\r\n\thaystack_motifs myregions.bed hg19\r\n\r\nTo specify a custom background file for the analysis, for example _mybackgroundregions.bed_ run:\r\n\t\r\n\thaystack_motifs myregions.bed hg19 --bed_bg_filename mybackgroundregions.bed\r\n\r\nTo use a particular motif database (the default is JASPAR) use:\r\n\t\r\n\thaystack_motifs myregions.bed hg19 --meme_motifs_filename my_database.meme\r\n\r\nThe database file must be in the MEME format: http://meme.nbcr.net/meme/doc/meme-format.html#min_format\r\n\r\n3) **haystack_tf_activity_plane**: quantifies the specificity and the activity of the TFs highlighed by the **haystack_motif** integrating gene expression data.\r\n\r\nThe input consist of an 1. output folder of the **haystack_motif** tool, 2. a set of files containing gene expression data specified in a tab delimed file and 3. the target cell-type name to use to perfom the analysis. Each gene expression data file must be a tab delimited text file with two columns: 1. gene symbol 2. gene expression value. Such a file (one for each cell-type profiled) should look like this:\r\n```\r\nRNF14\t7.408579\r\nUBE2Q1\t9.107306\r\nUBE2Q2\t7.847002\r\nRNF10\t9.500193\r\nRNF11\t7.545264\r\nLRRC31\t3.477048\r\nRNF13\t7.670409\r\nCBX4\t7.070998\r\nREM1\t6.148991\r\nREM2\t5.957589\r\n.\r\n.\r\n.\r\n```\r\nThe file that describe the samples for example a file called _sample_names_tf_activity.txt_ should contain something like this:\r\n```\r\nK562\t./INPUT_DATA/K562_genes.txt\r\nGM12878\t./INPUT_DATA/GM12878_genes.txt\r\nHEPG2\t./INPUT_DATA//HEPG2_genes.txt\r\nH1hesc\t./INPUT_DATA/h1hesc_genes.txt\r\nHSMM\t./INPUT_DATA/HSMM_genes.txt\r\nNHLF\t./INPUT_DATA/NHLF_genes.txt\r\n```\r\n\r\nThe output is a set of figures each containing the TF activity plane for a given motif.\r\n\r\n**_Example_**\r\nSuppose the utility **haystack_motif** created the folder called _HAYSTACK_MOTIFS_on_K562/_ analyzing the cell type named K562 and you have wrote the _sample_names_tf_activity.txt_ as above you can run the TF activity analysis with:\r\n\r\n\thaystack_tf_activity_plane HAYSTACK_MOTIFS_on_K562/ sample_names_tf_activity.txt K562\r\n\r\n4) **haystack_pipeline**: executes the wholw pipeline automatically, i.e. 1) and 2) and optionally 3) (if gene expression files are provided) finding hotspots, specific regions, motifs and quantifiying their activity on nearby genes.\r\n\r\nThe input is a tab delimited text file with two or three columns containing 1. the sample name 2. the path of the corresponding bam file 3. the path of the gene expression file with the same format described in 3); Note that this last column is optional. \r\n\r\nFor example you can have a file called _samples_names.txt_ with something like that:\r\n```\r\nK562\t./INPUT_DATA/K562H3k27ac_sorted_rmdup.bam\t./INPUT_DATA/K562_genes.txt\r\nGM12878\t./INPUT_DATA/Gm12878H3k27ac_sorted_rmdup.bam\t./INPUT_DATA/GM12878_genes.txt\r\nHEPG2\t./INPUT_DATA/Hepg2H3k27ac_sorted_rmdup.bam\t./INPUT_DATA//HEPG2_genes.txt\r\nH1hesc\t./INPUT_DATA/H1hescH3k27ac_sorted_rmdup.bam\t./INPUT_DATA/h1hesc_genes.txt\r\nHSMM\t./INPUT_DATA/HsmmH3k27ac_sorted_rmdup.bam\t./INPUT_DATA/HSMM_genes.txt\r\nNHLF\t./INPUT_DATA/NhlfH3k27ac_sorted_rmdup.bam\t./INPUT_DATA/NHLF_genes.txt\r\n```\r\n\r\nAlternatively you can specify a folder containing .bam files (with PCR duplicates removed) or .bw files (bigwig format).\r\n\r\n**_Examples_**\r\nSuppose you have a folder called /users/luca/mybamfolder you can run the command with: \r\n\r\n\thaystack_pipeline /users/luca/mybamfolder hg19 \r\n\r\nNote:In this case the pipeline run 1) and 2), but not 3) since no gene expression data are provided.\r\n\r\nIf you have instead a file with the samples description containing .bam or .bw filenames (note: it is not possible to mix .bam and .bw) and gene expression data, like the _samples_names.txt_ described above you can run the whole pipeline with: \r\n\t\r\n\thaystack_pipeline samples_names.txt hg19\r\n\r\n5) **download_genome**: it allows you to download and add a reference genomes from UCSC to Haystack in the appropriate format. To download a genome run: \r\n\t\r\n\t download_genome genome_name \r\n\r\n**_Example_**\r\nTo download the human genome assembly hg19 run: \r\n\t\r\n\tdownload_genome hg19\r\n\r\nNote: Probably you don't need to call this command explicitely since it is called when the other commands need to download a particular assembly.\r\n\r\nYou can get more details about all the parameters of each of these 5 commands using the -h or --help flag that prints a nice description.\r\n\r\n\r\nTesting HAYSTACK\r\n----------------\r\n\r\nTo test the whole pipeline you can download this set of bam file from the ENCODE project:\r\nhttp://bcb.dfci.harvard.edu/~lpinello/HAYSTACK/haystack_test_dataset_h3k27ac.tar.gz\r\n\r\nDecompress the file with the following command: \r\n\t\r\n\ttar xvzf haystack_test_dataset_h3k27ac.tar.gz\r\n\t\r\nGo into the folder with the test data:\r\n\r\n\tcd TEST_DATASET\r\n\r\nThen run the haystack_pipeline command using the provided samples_names.txt file :\r\n\r\n\thaystack_pipeline samples_names.txt hg19\r\n\t\r\nIf you use a Docker image instead run with the following command:\r\n\r\n\tdocker run -v ${PWD}:/DATA -w /DATA -i lucapinello/haystack_bio haystack_pipeline samples_names.txt hg19 \r\n\r\nIf you run Docker on Window you have to specify the full path:\r\n\r\n\tdocker run -v //c/Users/Luca/Downloads/TEST_DATASET:/DATA -w /DATA -i lucapinello/haystack_bio haystack_pipeline samples_names.txt hg19 \r\n\r\nThis will recreate the panels and the plots showed in the figure present in the summary, plus other panels and plots for all the other cell-types contained in the test dataset.\r\n\r\nCitation\r\n--------\r\n*Please cite the following article if you use HAYSTACK in your research*:\r\n * Luca Pinello, Jian Xu, Stuart H. Orkin, and Guo-Cheng Yuan. Analysis of chromatin-state plasticity identifies cell-type specific regulators of H3K27me3 patterns PNAS 2014; published ahead of print January 6, 2014, doi:10.1073/pnas.1322570111\r\n\r\nContacts\r\n--------\r\nPlease send any comment or bug to lpinello AT jimmy DOT harvard DOT edu \r\n\r\nThird part software included and used in this distribution\r\n-----------------------------------------------------------\r\n1. PeakAnnotator: http://www.ebi.ac.uk/research/bertone/software\r\n2. FIMO from the MEME suite (4.9.1): http://meme.nbcr.net/meme/\r\n3. WebLogo: http://weblogo.berkeley.edu/logo.cgi\r\n4. Samtools (0.1.19): http://samtools.sourceforge.net/\r\n5. Bedtools (2.20.1): https://github.com/arq5x/bedtools2\r\n6. bedGraphToBigWig and bigWigAverageOverBed from the UCSC Kent's Utilities: http://hgdownload.cse.ucsc.edu/admin/jksrc.zip", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/lucapinello/Haystack", "keywords": "", "license": "UNKNOWN", "maintainer": "", "maintainer_email": "", "name": "haystack_bio", "package_url": "https://pypi.org/project/haystack_bio/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/haystack_bio/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/lucapinello/Haystack" }, "release_url": "https://pypi.org/project/haystack_bio/0.4.0/", "requires_dist": null, "requires_python": null, "summary": "Epigenetic Variability and Transcription Factor Motifs Analysis Pipeline", "version": "0.4.0" }, "last_serial": 2089595, "releases": { "0.3.9": [ { "comment_text": "", "digests": { "md5": "eb28448185015ab6e156a0df55179e88", "sha256": "da1997680c234a2fa319b02408025d0a5db15e90c21bf2fd98ce5fc91faf5369" }, "downloads": -1, "filename": "haystack_bio-0.3.9.tar.gz", "has_sig": false, "md5_digest": "eb28448185015ab6e156a0df55179e88", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20464508, "upload_time": "2016-04-28T13:49:05", "url": "https://files.pythonhosted.org/packages/7e/99/801eeefe20bbc177c06eabf7b65975453a3e0fd048c112b8b5cdfad3ded5/haystack_bio-0.3.9.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "901a2cdbab4f7061a9afbd675bc26662", "sha256": "2dd668bddcca6b544fe9eb82e0b65db9943b16a7c4d7509da5b631a4b06035f8" }, "downloads": -1, "filename": "haystack_bio-0.4.0.tar.gz", "has_sig": false, "md5_digest": "901a2cdbab4f7061a9afbd675bc26662", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13492583, "upload_time": "2016-04-28T21:48:34", "url": "https://files.pythonhosted.org/packages/f0/4c/d7b1da325b52c780d72f01d578ccf93c8278ece0ce0e2f05c4ea4be6060e/haystack_bio-0.4.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "901a2cdbab4f7061a9afbd675bc26662", "sha256": "2dd668bddcca6b544fe9eb82e0b65db9943b16a7c4d7509da5b631a4b06035f8" }, "downloads": -1, "filename": "haystack_bio-0.4.0.tar.gz", "has_sig": false, "md5_digest": "901a2cdbab4f7061a9afbd675bc26662", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13492583, "upload_time": "2016-04-28T21:48:34", "url": "https://files.pythonhosted.org/packages/f0/4c/d7b1da325b52c780d72f01d578ccf93c8278ece0ce0e2f05c4ea4be6060e/haystack_bio-0.4.0.tar.gz" } ] }