{ "info": { "author": "Ashley Tehranchi", "author_email": "mike.dacre@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 3", "Topic :: Utilities" ], "description": "cisVar\n======\n\ncisVar is a pipeline to estimate pre-ChIP frequencies from pooled post-ChIP\nfrequencies via a regression on the genotypes of the individuals in the pool. To\nuse this code you must have a mapped BAM file for each pool (a single pool is\nfine) as well as genotype info for all individuals in the pool in VCF format.\nAdditional information can be provided also, see below for information.\n\nFor an overview of this method, or to cite this work, please see the following:\n\nTehranchi, Ashley K., Marsha Myrthil, Trevor Martin, Brian L. Hie, David Golan,\nand Hunter B. Fraser. \u201c`Pooled ChIP-Seq Links Variation in Transcription Factor\nBinding to Complex Disease Risk `_.\u201d\nCell 165, no. 3 (April 21, 2016): 730\u201341.\n\nThis code was written by Ashley Tehranchi with minor modifications by Mike\nDacre. It was produced at Stanford University, but is released here under the\nMIT license.\n\nCurrent version: 2.0.0b3\n\n\n.. contents:: **Contents**\n\n\nOverview\n--------\n\n``cisVar.py`` is the pipeline written in python3 and uses ``regression_qtls.R``\n(``regression_qtls.R`` is called inside of ``cisVar.py`` so you will not run it\ndirectly).\n\nIn addition, the code at ``scripts/combine.py`` can be used to create combined\nand merged (per-SNP) pandas dataframes from the outputs of ``cisVar.py`` for\nmultiple samples.\n\nA complete example Snakemake pipeline is provided in ``pipeline``, you should be\nable to run it by modifying only the ``cisvar_config.json`` file, see the\nSnakemake section below.\n\nRight now the default minor allele frequency filter is ``0.1>MAF<0.99``. To\nchange these and other regression constants, edit the ``regressions_qtls.R``\nscript.\n\nInstallation\n~~~~~~~~~~~~\n\nInstall via PyPI:\n\n.. code:: shell\n\n pip install cisVar\n\nOr install from github:\n\n.. code:: shell\n\n pip install https://github.com/TheFraserLab/cisVar/tarball/master\n\nAlternatively, you can just clone the repo and use it directly from\nthere.\n\n.. code:: shell\n\n git clone https://github.com/TheFraserLab/cisVar.git\n\nIt should work with python 2 or 3, but python 3 is highly recommended.\n\nPrereqs\n.......\n\nRequires ``samtools v1.9`` and ``bedtools v2.26.0`` as well as the\nfollowing python modules (installed automatically by pip):\n\n``pandas``, ``numpy``, ``psutil``, ``snakemake``, ``wget``\n\nAdditionally requires that ``R`` with ``Rscript`` are installed, with\nthe ``ggplot2`` module installed.\n\nUsage\n~~~~~\n\nTo run this code on your own data you need:\n\n- VCF(s) with genotypes and ref/alt alleles for every SNP you wish to\n test\n- A mapped BAM file, ideally this will make use of `Hornet\n `_ to remove reference bias.\n\nIn addition, the following data can be helpful:\n\n- A list of individuals to filter genotypes with (one newline separated file of\n individuals per pool)\n\n- A file with ref and alt alleles for your SNPs of interest, to modify those in\n the genotype VCF files (BED/VCF/txt)\n\n- A file to limit the SNPs to consider to a subset of those in the VCF\n (BED/VCF/txt)\n\nExample pipeline:\n\n.. code:: shell\n\n cisVar.py prep -F SampleName -i individuals.txt.gz --chrom-format chr /path/to/geno/*vcf.gz\n\n cisVar.py mpileup -F SampleName -f fastaFile -B sortedBam\n\n cisVar.py post -F SampleName -r readDepth -a allelesFile\n\n cisVar.py geno -F SampleName -r readDepth -i individualsFile -g genotypesFile\n\n cisVar.py qtls -F SampleName -r readDepth -n numberIndividuals\n\n cisVar.py tidy -F SampleName -r readDepth\n\n scripts/combine.py {sample}.readDepth.regression.pd\n\nA readDepth of 20 is generally optimal, the sample name can be whatever you\nwant, but should be unique per sample. ``{sample}`` is a placeholder to allow\nany number of samples to be combined in the last step.\n\nSnakemake\n.........\n\nThe above pipeline can be automated with\n`Snakemake `_.\n\nTo use, install cisVar, navigate to the root of your project, and run\n``cisVar.py get_snake`` to copy the Snakefile and config file over . Then edit\nthe ``cisvar_config.json`` file to match your needs.\n\nYou will also need to edit the ``Snakefile`` to set the ``script_prep`` string\nto match what is needed by your system.\n\nThe following are the config options for that file:\n\n+-------------+-------------------------------------------------------+\n| Option | Description |\n+=============+=======================================================+\n| name | A general name for this run, file prefix will be |\n| | ``..`` |\n+-------------+-------------------------------------------------------+\n| sample_name | The name of the sample, default is population. Used |\n| | only in the combination of multiple samples. |\n+-------------+-------------------------------------------------------+\n| samples | A list of samples, can just be a list, or a |\n| | dictionary of {sample:group}, the 'group' in this |\n| | case allows the use of the same genotype files for |\n| | multiple samples, can also be a path to a separate |\n| | json file |\n+-------------+-------------------------------------------------------+\n| read_depth | An integer depth to require for each SNP to be |\n| | considered |\n+-------------+-------------------------------------------------------+\n| max_cores | Used only when parsing VCFs, if you have multiple VCF |\n| | files ( e.g.\u00a0per chromosome), they will be parsed in |\n| | parallel up to this many cores (or max avaialable on |\n| | machine) |\n+-------------+-------------------------------------------------------+\n| sort_vcfs | Either 1 or 0, if 1 assumes that VCF files contain a |\n| | ``chr#`` string in the file name, and sorts the order |\n| | of files to be chr1->22,X,Y,MT. Don't use if your |\n| | VCFs don't have ``chr#`` in the name |\n+-------------+-------------------------------------------------------+\n| chrom_form | 'chr', 'num', 'ignore': Force format of chromosome |\n| at | name to be ``chr#`` or ``#``. This ensures that all |\n| | input files have the same format. Use ignore to do |\n| | nothing. |\n+-------------+-------------------------------------------------------+\n| bams | A path to the mapped BAM files, must contain the |\n| | ``{sample}`` string (unless you only have one bam), |\n| | e.g. ``/path/to/{sample}.sorted.bam``, ``{sample}`` |\n| | must be in samples |\n+-------------+-------------------------------------------------------+\n| cisVar | Path to the cisVar repository |\n+-------------+-------------------------------------------------------+\n| vcfs | Can be a single path (for one vcf), a list of vcfs, |\n| | or a glob string (e.g. |\n| | ``/path/to/vcfs/*.vcf.comm.gz``) |\n+-------------+-------------------------------------------------------+\n| genome_fa | Path to a FastA file of the genome you mapped to, |\n| | single file only. |\n+-------------+-------------------------------------------------------+\n| inds | Optional: used to filter VCFs so that the genotype |\n| | files contain only the individuals in the sample, |\n| | e.g. ``/path/to/inds/{sample}.ind.txt.gz``. Newline |\n| | separated file of individuals. |\n+-------------+-------------------------------------------------------+\n| locs | Optional: a BED/VCF/text file of SNP locations to |\n| | consider, used to limit the total to be a subset of |\n| | the genotype file. |\n+-------------+-------------------------------------------------------+\n| alleles | Optional: a BED/VCF/text file of alternate ref/alt |\n| | alleles. Must be a subset of the genotype VCFs. If |\n| | there is an entry in this file, it's ref/alt alleles |\n| | will be used instead of those in the genotype file |\n+-------------+-------------------------------------------------------+\n\n\nNote the last three files are optional, also if ``samples`` is a dict, then the\nvalue will be used in place of the sample. For example, if you have two samples\nfor the same population that are ``yri1`` and ``yri2``, but they both use the\nsame genotype file ``yri.geno.vcf``, you can make samples ``{'yri1': 'yri',\n'yri2': 'yri'}`` and then ``yri`` will be used to pick the ind, loc, and allele\nfiles\n\nCluster\n'''''''\n\nTo run on a cluster, run ``cisVar.py get_snake`` with ``-x`` and edit the\n``cluster.json`` file to match your cluster environment, then run e.g.:\n\n.. code:: shell\n\n snakemake -j 100 --cluster-config cluster.json \\\n --cluster \"sbatch -n {threads} -t {params.time} --mem={resources.mem_mb} -p {cluster.queue} -o {cluster.out} -e {cluster.err}\" \\\n all\n\nor\n\n.. code:: shell\n\n snakemake -j 100 --cluster-config cluster.json \\\n --cluster \"qsub -l nodes=1:ppn={threads} -l walltime={params.time} -l mem={resources.mem_mb}MB -o {cluster.out} -e {cluster.err}\" \\\n all\n\nTo set the maximum allowed memory per job, add the argument ``--resources\nmem_mb=32000``. Note, this is for the whole pipeline, not per job, because\nsnakemake is stupid.\n\nTo also combine files, replace ``all`` with ``combine`` at the end of the\ncommand.\n\nScript help\n-----------\n\nBelow are help options available on the command line for cisVar, all these steps\nare run by the above snakemake pipeline.\n\ncisVar.py\n~~~~~~~~~\n\n.. code::\n\n usage: cisVar.py [-h] {prep,mpileup,post,geno,qtls,tidy,get_snake} ...\n\n cisVar: Find cis QTLs based on an experimental selection method\n\n Ashley Tehranchi \n\n Stanford University\n\n Version: 2.0.0b1\n Created: 2015-12-12\n Updated: 2018-05-16\n\n Example usage:\n cisVar prep -F test_new -i individuals.txt.gz --chrom-format chr\n cisVar mpileup -F -f -B \n cisVar post -F -r \n cisVar geno -F -r -i \n cisVar qtls -F -r -n \n cisVar tidy -F -r -p out.dataframe.pandas -t out.dataframe.txt\n\n Note:\n The qtls regression step will use approximately 32GB of memory on an averaged-\n sized dataset.\n\n The geno step will use approximately 20GB of memory on the same dataset.\n\n positional arguments:\n {mpileup,post,geno,qtls}\n prep Prepare genotype files\n mpileup (m) Run mpileup\n post (p) Run POST frequency calculation\n geno (g) Munge genotypes to prepare for regression\n qtls (q, regression, r)\n Run the regression\n tidy (t) Tidy up regression, call open/clsoed\n\n optional arguments:\n -h, --help show this help message and exit\n\nprep\n....\n\nThis step converts VCFs into genotype and individual files that can be used by\nthe pipeline.\n\n.. code::\n\n usage: cisVar.py prep [-h] [-F PREFIX_NAME] [-r TRIAL_DEPTHS] [-i ALL_INDS]\n [-l LIMIT_FILE] [-a ALLELE_FILE]\n [--chrom-format {chr,num,ignore}] [--include-indels]\n [-c CORES]\n vcf_files [vcf_files ...]\n\n Prepare genotype files\n\n optional arguments:\n -h, --help show this help message and exit\n\n Run Options:\n -F PREFIX_NAME, --SampleName PREFIX_NAME\n sample/population name (default: cis_var)\n -r TRIAL_DEPTHS, --readDepth TRIAL_DEPTHS\n minimum read depth per variant (default: 20)\n\n Prep Options:\n -i ALL_INDS, --all-inds ALL_INDS\n File of individuals in all groups, one per line\n -l LIMIT_FILE, --limit-file LIMIT_FILE\n BED/VCF/txt file of SNPs to consider\n -a ALLELE_FILE, --allele-file ALLELE_FILE\n BED/VCF/txt file of alleles to override VCF allels\n (subset of vcf)\n --chrom-format {chr,num,ignore}\n chr: make format \"chr#\", num: make format \"#\", ignore:\n do nothing (default: ignore)\n --include-indels Do not skip indels\n -c CORES, --cores CORES\n Number of cores to use (default: all)\n vcf_files VCF files with genotypes\n\nmpileup\n.......\n\nThis is just a simple wrapper for samtools mpileup\n\n.. code::\n\n usage: cisVar.py mpileup [-h] [-F PREFIX_NAME] [-r TRIAL_DEPTHS] -f\n ALLCHRFASTA -B SORTEDBAM [-p MPILEUPBEDFILE]\n\n Run mpileup\n\n optional arguments:\n -h, --help show this help message and exit\n\n Run Options:\n -F PREFIX_NAME, --SampleName PREFIX_NAME\n sample/population name (default: cis_var)\n -r TRIAL_DEPTHS, --readDepth TRIAL_DEPTHS\n minimum read depth per variant (default: 20)\n\n mpileup Options:\n -f ALLCHRFASTA, --fasta ALLCHRFASTA\n fasta file with all chromosomes (Required)\n -B SORTEDBAM, --BAMfile SORTEDBAM\n sorted BAM file (Required)\n -p MPILEUPBEDFILE, --mpileupBEDfile MPILEUPBEDFILE\n BED to use instead of the BED generated in the prep\n phase (Do not use if possible, use prep with limit\n instead)\n\npost\n....\n\nThis step actually calculates the POST-frequencies for the data.\n\n.. code::\n\n usage: cisVar.py post [-h] [-F PREFIX_NAME] [-r TRIAL_DEPTHS] [-a GENOSFILE]\n\n Run POST frequency calculation\n\n optional arguments:\n -h, --help show this help message and exit\n\n Run Options:\n -F PREFIX_NAME, --SampleName PREFIX_NAME\n sample/population name (default: cis_var)\n -r TRIAL_DEPTHS, --readDepth TRIAL_DEPTHS\n minimum read depth per variant (default: 20)\n\n POST Options (Deprecated):\n -a GENOSFILE, --allelesFile GENOSFILE\n The genotypes file, (Optional, default is file created\n in prep)\n\ngeno\n....\n\nThis step converts the genotype file made in the prep step into a matrix that\ncan be used in the regression. It is important that this genotype file is\nperfectly sorted to match the outputs of the POST step.\n\n.. code::\n\n usage: cisVar.py geno [-h] [-F PREFIX_NAME] [-r TRIAL_DEPTHS] [-g GENOSFILE]\n [-i INDIVIDUALSLIST]\n\n Munge genotypes to prepare for regression\n\n optional arguments:\n -h, --help show this help message and exit\n\n Run Options:\n -F PREFIX_NAME, --SampleName PREFIX_NAME\n sample/population name (default: cis_var)\n -r TRIAL_DEPTHS, --readDepth TRIAL_DEPTHS\n minimum read depth per variant (default: 20)\n\n Genotype Options:\n -g GENOSFILE, --genoFile GENOSFILE\n The genotypes file, (Optional, default is file created\n in prep)\n -i INDIVIDUALSLIST, --individualsFile INDIVIDUALSLIST\n list of individuals matching genotype matrix; one indv\n per line\n\nqtls\n....\n\nThis is the actual regression step, it makes sure all the files are in the right\nplace and then calls ``regression_qtls.R`` to do the actual regression.\n\n.. code::\n\n usage: cisVar.py qtls [-h] [-F PREFIX_NAME] [-r TRIAL_DEPTHS] [-n NUMINDV]\n\n Run the regression\n\n optional arguments:\n -h, --help show this help message and exit\n\n Run Options:\n -F PREFIX_NAME, --SampleName PREFIX_NAME\n sample/population name (default: cis_var)\n -r TRIAL_DEPTHS, --readDepth TRIAL_DEPTHS\n minimum read depth per variant (default: 20)\n\n Regression Options:\n -n NUMINDV, --numberIndividuals NUMINDV\n The number of individuals in the pool (if omitted,\n calculated from genotype file length)\n\nThe regression produces z-scores and p-values, and additionally writes\ncoefficients and some simple summary plots in separate files.\n\ntidy\n....\n\nThis step calls the open/closed alleles and produces a final integrated file\nwith all available data as both a tad-delimited file and as a pandas dataframe.\n\n.. code::\n\n usage: cisVar.py tidy [-h] [-F PREFIX_NAME] [-r TRIAL_DEPTHS] [-b BEDFILE]\n [-t TEXTFILE] [-p PANDASFILE]\n\n Tidy up regression, call open/closed\n\n optional arguments:\n -h, --help show this help message and exit\n\n Run Options:\n -F PREFIX_NAME, --SampleName PREFIX_NAME\n sample/population name (default: cis_var)\n -r TRIAL_DEPTHS, --readDepth TRIAL_DEPTHS\n minimum read depth per variant (default: 20)\n\n inputs:\n -b BEDFILE, --bedfile BEDFILE\n BED file to extract rsIDs from (optional)\n\n outputs:\n -t TEXTFILE, --textfile TEXTFILE\n Parsed output\n -p PANDASFILE, --pandasfile PANDASFILE\n Parsed dataframe\n\nget_snake\n.........\n\nThis option just downloads the Snakefile and config files from this repo, for\neasy access when code is installed via pip.\n\n.. code::\n\n usage: cisVar.py get_snake [-h] [-x]\n\n Download Snakefile and config to current dir\n\n optional arguments:\n -h, --help show this help message and exit\n -x, --extra Get additional sample and cluster configs\n\ncombine.py\n~~~~~~~~~~\n\nThis script is separate and is in the ``scripts`` folder. It takes a search string\nas an input and produces both combined and merged DataFrames. The combined\ndataframe is just all dataframes combined in order with sample data added as a\ncolumn and to the index. The merged dataframe is a collapsed dataframe that has\none entry per SNP with p-values combined using Fisher's method and supporting\npopulation data. It also includes information on the level of support for the\nopen and closed calls.\n\nThe search string should match your prefix and depth from the main pipeline. For\nexample, if you used a name of 'cis_var' plus a sample name (the variable part)\nof e.g. CEU and YRI, and a read depth of 20, your search string would be:\n``cis_var.{sample}.20.regression.pd``.\n\nThe script will write ``cis_var.combined.20.regression.pd`` and\n``cis_var.merged.20.regression.pd``.\n\n.. code::\n\n usage: combine.py [-h] [-c COLUMN_NAME] [--no-merge] search_str\n\n Combine a bunch of cisVar pandas files by sample (e.g. population).\n\n Requires a search string such as prefix.{sample}.regression.pd.\n\n Writes\n ------\n prefix.combined.regression.pd\n A simple merger of all DataFrames\n prefix.merged.regression.pd\n A per-snp merger based on p-value\n\n positional arguments:\n search_str e.g. name.{sample}.regression.pd, used to find files\n\n optional arguments:\n -h, --help show this help message and exit\n -c COLUMN_NAME, --column-name COLUMN_NAME\n Name for combine column, e.g. population\n --no-merge Produce only a combined dataframe, not a merged\n dataframe. merging can add half an hour over\n combination, which takes seconds\n\nplot_fun.R\n~~~~~~~~~~\n\nThere is an additional script in ``scripts`` called ``plot_fun.R`` that takes a\nsingle argument\u2014the output of the regression step (e.g.\n``cis_var.YRI.20.totals.txt``) and creates a simple density pre-freq vs post freq\nplot.\n", "description_content_type": "", "docs_url": null, "download_url": "https://github.com/TheFraserLab/cisVar/archive/v2.0.0b3.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/TheFraserLab/cisVar", "keywords": "ATACseq ChIPseq regression bioinformatics", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "cisVar", "package_url": "https://pypi.org/project/cisVar/", "platform": "", "project_url": "https://pypi.org/project/cisVar/", "project_urls": { "Download": "https://github.com/TheFraserLab/cisVar/archive/v2.0.0b3.tar.gz", "Homepage": "https://github.com/TheFraserLab/cisVar" }, "release_url": "https://pypi.org/project/cisVar/2.0.0b3/", "requires_dist": null, "requires_python": "", "summary": "Calculate both pre and post frequencies for ChIP or ATAC style data", "version": "2.0.0b3" }, "last_serial": 3874381, "releases": { "2.0.0b3": [ { "comment_text": "", "digests": { "md5": "e5443bccfc7882d2777b2c135e58fe7d", "sha256": "4270291538b055be275920122badc3594a97bab7ea6d55464e1d1db256f8ee8a" }, "downloads": -1, "filename": "cisVar-2.0.0b3.tar.gz", "has_sig": false, "md5_digest": "e5443bccfc7882d2777b2c135e58fe7d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 45294, "upload_time": "2018-05-18T01:59:00", "url": "https://files.pythonhosted.org/packages/d8/12/58ebc7867fa7b88ccfb6d879a47c0f60899e6e8113789f294872f6127812/cisVar-2.0.0b3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e5443bccfc7882d2777b2c135e58fe7d", "sha256": "4270291538b055be275920122badc3594a97bab7ea6d55464e1d1db256f8ee8a" }, "downloads": -1, "filename": "cisVar-2.0.0b3.tar.gz", "has_sig": false, "md5_digest": "e5443bccfc7882d2777b2c135e58fe7d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 45294, "upload_time": "2018-05-18T01:59:00", "url": "https://files.pythonhosted.org/packages/d8/12/58ebc7867fa7b88ccfb6d879a47c0f60899e6e8113789f294872f6127812/cisVar-2.0.0b3.tar.gz" } ] }