{ "info": { "author": "Woltjen & Bourque labs", "author_email": "jean.monlong@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# MHcut\n\nMHcut was run on deletions from dbSNP and ClinVar (see commands in the [*scripts-dbSNP-ClinVar* folder](scripts-dbSNP-ClinVar)). \nThe results can be explored online through the [MHcut Browser](https://mhcut-browser.genap.ca/).\nThe dataset was also deposited on FigShare: [https://doi.org/10.6084/m9.figshare.9118364](https://doi.org/10.6084/m9.figshare.9118364).\n\n- [Installation](#installation)\n- [Preparing the reference genome](#preparing-the-reference-genome-and-jellyfish-index)\n- [Usage](#usage)\n- [Output](#output)\n- [Test installation](#test-installation)\n- [Install dependencies](#install-dependencies)\n- [Docker image](#docker-image)\n- [Methods](#methods)\n\n# Installation\n\n*Python 2.7 or higher (but not Python 3).*\n\nMHcut is in [PyPI](https://pypi.org/project/MHcut/) and can be installed with:\n\n```sh\npip install MHcut ## add --user if you don't have root permissions\n```\n\nOr for the latest version on GitHub:\n\n```sh\ngit clone --recursive https://github.com/jmonlong/MHcut.git\ncd MHcut\npip install . ## add --user if you don't have root permissions\n```\n\nIf using pip install `--user` make sure to add `/home/$(whoami)/.local/bin` to your `$PATH` if you want to run the MHcut script.\n\nYou will also need [JellyFish](http://www.genome.umd.edu/jellyfish.html).\nFollow the instruction in its Homepage or find [more information below](#install-dependencies).\n\nThese dependencies are not particularly \"painful\" to install but we also built a **Docker container** as an alternative (see [Docker instructions](#docker-image)).\n\n# Preparing the reference genome and JellyFish index\n\nFirst download and unzip the reference genome, for example:\n\n```sh\nwget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz\ngunzip hg38.fa.gz\n```\n\nEventually, you can index the genome using the following command:\n\n```sh\nMHcut -ref hg38.fa\n```\n\nOtherwise this indexing will be done automatically the first time that MHcut is run (might take a few extra minutes).\n\nAfter indexing the genome, JellyFish can quickly find the number of exact matches for a particular sequence.\nTo index 23-mers in the reference genome in both strands:\n\n```sh\njellyfish count --out-counter-len 1 -C -m 23 -s 100M hg38.fa\n```\n\nNote: Use `-t` to use multiple cores, e.g. `-t 10` to use 10 cores.\n\nThe output file `mer_counts.jf` will later be given to MHcut using `-jf` (see Usage below).\n\n# Usage\n\n```sh\nMHcut -var NCBI_Variation_Viewer_data_uniq.tsv -ref hg38.fa -jf mer_counts.jf -out MHcut-NCBI-chrX\n```\n\nThe required parameters are:\n\n- *-var* a TAB delimited file starting with chr/start/end columns and potentially followed by other columns (e.g. rsid, gene). The first row contains the column names.\n- *-ref* a fasta file with the reference genome.\n- *-jf* the 23-mers count file created by JellyFish. \n- *-out* the prefix for the output files (TSV files). \n\nOther optional parameters:\n\n- *-minvarL* the minimum length for a variant to be considered. Default is `3`.\n- *-minMHL* the minimum length of the MH. Default is `3`.\n- *-minm1L* the minimum length of the first stretch if the microhomology. Default is `3`.\n- *-nofilt* don't filter variants without MH. all input variants will be present in the output *-variants* file. If used, the following parameters will NOT be taken into account: -minMHL, -minhom, -minm1L.\n- *-maxConsMM* the maximum number of consecutive mismatches allowed when extending the MH. Default is `1`.\n- *-maxTail* the maximum distance betweem the MHs and a PAM cut to be considered valid. Currently default at `50`. Relevant for large variants.\n- *-minhom* the minimum ratio of homology in the whole microhomology. Default is `0`.\n- *-PAM* the PAM sequence. Default is `NGG`. Possibly several separated by \",\".\n- *-PAMcut* the cut position relative to the PAM motif. Default is `-3`\n- *-minLnhm* the minimum length of nested MH to be considered in the nested MH check. Default is `3`.\n- *-2fls* report results for both flank configurations instead of the one with strongest MH.\n- *-noShift* use input coordinates without trying to shift the variant to find the best MH.\n- *-restart* will check for existing output files and continue from the variant (useful when long jobs hit a cluster's walltime).\n\n\n# Output\n\n## The \"variant\" file\n\nNamed `PREFIX-variants.tsv`, the \"variant\" file has one line per input variant with information about the MH found and if a valid PAM cut is available or not.\n\nCurrently the columns of the output are:\n\n- The columns of the input `.tsv` file. For example: *chr*, *start*, *end*, *rsid*, *gene*.\n- *flank*/*mhScore* the flank configuration (1:outer-inner, 2:inner-outer) and MH score (\"Microhomology search\" above).\n- *mhL*: MH length.\n- *mh1L*: number of first consecutive matches.\n- *hom*: proportion of matches.\n- *nbMM*: number of mismatches.\n- *mhMaxCons*: longest stretch of consecutive matches in the MH.\n- *mhDist*: distance between the end of MH and the variant boundary.\n- *mh1Dist*: distance between the end of the first consecutive matches and the variant boundary.\n- *MHseq1*/*MHseq2*: sequences of the MH.\n- *GC*: the GC content of the MH sequence (maximum of *MHseq1*/*MHseq2*).\n- *pamMot*: the number of PAM motives in a valid location, no matter how unique the protospacer is.\n- *pamUniq*: the number of PAM motives in a valid location and with unique the protospacer.\n- *guidesNoNMH*: the number of guides that have no nested MH.\n- *guidesMinNMH*: the number of nested MH for the guide which have the least amount of nested MH.\n- *max2cutsDist*: the distance between the two unique cuts the furthest from each other. *NA* if *pamUniq*<2.\n- *maxInDelphiFreqmESC*,*maxInDelphiFreqU2OS*,*maxInDelphiFreqHEK293*,*maxInDelphiFreqHCT116*,*maxInDelphiFreqK562*: the maximum frequency predicted by inDelphi for this exact deletion for different cell types. \n- *maxInDelphiFreqMean* the maximum average frequency predicted by inDelphi for this exact deletion (average across the different cell types).\n\n## The \"guide\" file\n\nNamed `PREFIX-guides.tsv`, the \"guide\" file has one line per protospacer. \nThis means that the same variant can be present several times if several valid PAM cuts are available. \n\nCurrently the columns of the output are the same as for the \"variant\" file with the following additional columns:\n\n- *protospacer*/*protoStrand* the sequence and strand of the protospacer.\n- *pamSeq* the PAM.\n- *mm0* the number of position in the genome where the sequence align with no mismatches.\n- *m1Dist1* and *m1Dist2*: the distance between the PAM cut the left or right stretch of perfect match, respectively.\n- *mhDist1* and *mhDist2*: the distance between the PAM cut the left or right micro-homology, respectively.\n- *nbNMH* the number of nested MH.\n- *largestNMH* the size of the largest nested MH.\n- *nmhScore* the MMEJ score of the best **n**ested **m**icro-**h**omology MH (best defined as the highest MMEJ score).\n- *nmhSize* the MH length of the best **n**ested **m**icro-**h**omology MH (best defined as the highest MMEJ score).\n- *nmhVarL* the length of the variant created by the best **n**ested **m**icro-**h**omology MH (best defined as the highest MMEJ score).\n- *nmhGC* the GC content of the best **n**ested **m**icro-**h**omology MH (best defined as the highest MMEJ score).\n- *nmhSeq* the sequence of the best **n**ested **m**icro-**h**omology MH (best defined as the highest MMEJ score).\n- *inDelphiFreqmESC*,*inDelphiFreqU2OS*,*inDelphiFreqHEK293*,*inDelphiFreqHCT116*,*inDelphiFreqK562*: the frequency predicted by inDelphi for this exact deletion. \n- *inDelphiFreqMean*: the average frequency predicted by inDelphi for this exact deletion across the different cell types. \n\n## The \"cartoon\" file\n\nNamed `PREFIX-cartoons.tsv`, the \"cartoon\" file has one paragraph per variant with:\n\n1. The corresponding line from the \"variant\" file (e.g. MH metrics, PAM found or not).\n1. The location of the micro-homology. `|` means a match, `x` a mismatch.\n1. The sequence of the flanks and variant. `-` marks the limit of the variant. Eventually `...` marks the middle part of a large variant that is not shown for clarity purpose.\n1. The location of valid PAM cuts. `\\` and `/` depending on the strand of the PAM motif. `X` means there are cuts on each side of the base in opposite strand.\n\nFor example, a 3 bases perfect MH with 3 valid PAM cuts:\n\n```\nchr8\t41725834\t41725853\tANK1\tTrue\t3\t3\t1.0\t0\t17\tGCG\tGCG\n ||| |||\nGCGTGTCGTCGTTGCGGGCC-GCGATGTGCAGGGCCGGGAG-GCGCACCTTCCCCTTGGTGC\n____________________ _______\\___\\\\_______ ____________________\n```\n\n# Test installation\n\nA small dummy dataset is provided in the `testdata` folder, to test that the installation was completed successfully.\n\nTry running:\n\n```shell\ncd testdata\njellyfish count --out-counter-len 1 -C -m 23 -s 100M chr20-1Mbp.fa\nMHcut -var test-chr20-1Mbp.tsv -ref chr20-1Mbp.fa -jf mer_counts.jf -out test\n```\n\nOr for a more comprehensive run:\n\n```sh\nMHcut -var test-chr20-1Mbp.tsv -ref chr20-1Mbp.fa -jf mer_counts.jf -out test -nofilt -minvarL 1 -indelphi\n```\n\n# Install dependencies\n\n## JellyFish\n\nThe easiest way to install JellyFish is to download a binary and include it in your *PATH*.\nThe latest releases of [JellyFish](http://www.genome.umd.edu/jellyfish.html) provides a **macosx** binary and a **linux** binary.\nSee for examples the [2.2.10 release](https://github.com/gmarcais/Jellyfish/releases/tag/v2.2.10).\n\nOnce the binary downloaded it's just a matter of making it executable and updating your PATH to find it.\nFor example (works for both linux and macosx binary):\n\n```sh\nchmod +x jellyfish-linux\nmkdir -p ~/bin\nmv jellyfish-linux ~/bin/jellyfish\nexport PATH=~/bin:$PATH\n```\n*Add the last line to your `~/.basrc` file to make sure the PATH is always correct.*\n\nOtherwise you can always compile it from source.\nFor example if you want to build it in a folder `~/soft`:\n\n```sh\nmkdir -p ~/soft\ncd ~/soft\nwget https://github.com/gmarcais/Jellyfish/releases/download/v2.2.10/jellyfish-2.2.10.tar.gz\ntar -xzvf jellyfish-2.2.10.tar.gz\ncd jellyfish-2.2.10\n./configure --prefix=`pwd`\nmake\nmake install\nexport PATH=~/soft/jellyfish-2.2.10/bin:$PATH\n```\n\n*Add the last line to your `~/.basrc` file to make sure the PATH is always correct.*\n\nIf you prefer to use Docker, see the [Docker instructions](#docker-image) below.\n\n# Docker image\n\nDocker simplifies the installation of tools by preparing an image of an OS with all the requirements and dependencies to run the tool.\nNo need for the user to install many different libraries and tools on their system, the docker image can be used directly.\nThe user still has to [install Docker](https://docs.docker.com/install/) though.\n\n## Docker crash course\n\nTo run a command `command arg1 arg2 ...` within a docker container, a typical docker command looks like this:\n\n```shell\ndocker run -v PATHL:PATHC -w PATHC imageName command arg1 arg2 ...\n```\n\n- `-v PATHL:PATHC` links the local folder `PATHL` to the folder `PATHC` in the container. Typically `PATHL` contains the input files and will be where we want the output files written.\n- `-w PATHC` means that the working directory (where the command will be run) is `PATHC`, i.e. we want to run the command in the folder that we linked with the previous parameter (and that contains the input files).\n- `imageName` is the name of the docker container. If not built manually, Docker will try to download it from Docker Hub or other platforms.\n\nHence for us, a dockerized command might look like that:\n\n```shell\ndocker run -v \"`pwd`\":/home -w /home jmonlong/mhcut MHcut -var clinvar-grch38-all-deletion.tsv -ref hg38.fa -out docker-test\n```\n\nWe link the current folder (`` `pwd` ``) with a `home` folder in the container that we will use as working directory and run the python command.\n\n## Docker workflow for MHcut\n\n*Note: this assumes that the reference genome file was downloaded and the input TSV file is ready.*\n\nThe [MHcut docker image](https://hub.docker.com/r/jmonlong/mhcut) will automatically be downloaded by Docker the first time that `docker run jmonlong/mhcut` is run.\n\nIndexing the reference with JellyFish (if not already done):\n\n```shell\ndocker run -v \"`pwd`\":/home -w /home jmonlong/mhcut jellyfish count --out-counter-len 1 -C -m 23 -s 100M hg38.fa\n```\n\nRunning MHcut:\n\n```shell\ndocker run -v \"`pwd`\":/home -w /home jmonlong/mhcut MHcut -var clinvar-grch38-all-deletion.tsv -ref hg38.fa -jf mer_counts.jf -out docker-test\n```\n\nYou can test it on the test data:\n\n```sh\ncd testdata\ndocker run -v \"`pwd`\":/home -w /home jmonlong/mhcut jellyfish count --out-counter-len 1 -C -m 23 -s 100M chr20-1Mbp.fa\ndocker run -v \"`pwd`\":/home -w /home jmonlong/mhcut MHcut -var test-chr20-1Mbp.tsv -ref chr20-1Mbp.fa -jf mer_counts.jf -out test\n```\n\n## Optional: build the Docker image manually\n\nAn image of MHcut is [available on Docker Hub](https://hub.docker.com/r/jmonlong/mhcut) but one might still want to build an image manually to use the current version on the master branch.\n\nAfter cloning the repo, we can build the container manually by running the following command at the root of the repo (where the `Dockerfile` is):\n\n```shell\ndocker build -t mhcut .\n```\n\n# Methods\n\n## Microhomology search\n\n Flank 1: |||x||| |||x|||\n\tAGTGCCGTTAATCAGAGGTC-GGGCTGTGATGGTC-GGGGTGTTGTCGTTGACGTC\n\tFlank 2: |||| ||||\n\nFor each flank the microhomology is extended until the end of the variant as long as:\n\n- first base is a match.\n- no 2 consecutive mismatches.\n\nTwo flanking configurations exists: outer-inner (1) and inner-outer (2).\nIf flank1-variant-flank2, configuration 1 is MH between flank1 and variant, configuration 2 is MH between flank2 and variant. \nIn other words:\n- Outer-inner: MH between 3' variant sequence and 5' flanking sequence.\n- Inner-outer: MH between 5' variant sequence and 3' flanking sequence.\n\nMHcut uses a score to choose the \"best\" flank. \nThe score is currently the number of matches + number of consecutive first matches.\nIn the example above, *Flank 1* is chosen (score: 9 vs 8).\n\n## Shifting deletion\n\nSometimes the same deletion can be represented by different coordinates. \nFor this reason, MHcut will try to shift the deletion when possible and pick the coordinates that result in the highest micro-homology.\n\nFor example, in the previous example, the following representation has a better homology and represent the exact same deletion:\n\n ||||||| |||||||\n\tAGTGCCGTTAATCAGAGGTCGGG-CTGTGATGGTCGGG-GTGTTGTCGTTGACGTC\n\n## PAM cut search\n\n ||||x|| ||||x||\n\tAGTGCCGTTAATCAGAGGTC-GGGCTGTGATGGTC-GGGCAGTTGTCGTTGACGTC\n\t <---------->\n\nPAM cuts are searched between the end of the first exact match stretch of the MH and the variant boundary. \nIn addition, a PAM cut is allowed to be within the MH region if after the first 3 positions. \nIn the example above, the first valid cut is between the G and C, and the last valid cut between the C and G.\n\nWe consider only PAMs that result in the a cut at less than 50 bp from the variant's breakpoints. \nThis can be changed with the `-maxTail` parameter (see [Usage](#usage)).\nIt saves time by skipping PAMs in the middle of large deletions, as they couldn't be used in practice anyway.\n\nPAM cuts are enumerated in both strands. \nFor each valid cut, the protospacer sequence is retrieved.\n\nProtospacers are checked against the genome to count the number of exact matches: *mm0* represents the number of position with full alignment and no mismatch. \nIf *mm0* is equal to one, i.e. a unique match in the genome, the PAM is considered *unique*.\n\nFor each protospacer/cut, we also list other MHs that flank the cut and could be used preferentially instead of the one desired.\nThese nested MH could decrease the efficiency of recreating the deletion.\nOnly exact MHs of at least 3 bp are considered and if at least as close from each other as the target MH.\nAmong others, the output contains information about the best nested MH (shortened to *nmh*) defined as the nested MH with the highest pattern score ([Bae et al 2014](http://www.nature.com.proxy3.library.mcgill.ca/articles/nmeth.3015)).\n\nEach protospacer/cut is also tested with [inDelphi](https://indelphi.giffordlab.mit.edu/) to provide predicted frequency of the desired deletion.\n\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/WoltjenLab/MHcut", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "MHcut", "package_url": "https://pypi.org/project/MHcut/", "platform": "", "project_url": "https://pypi.org/project/MHcut/", "project_urls": { "Homepage": "http://github.com/WoltjenLab/MHcut" }, "release_url": "https://pypi.org/project/MHcut/1.0.0/", "requires_dist": [ "pyfaidx", "tqdm", "argparse", "numpy (==1.15.3)", "pandas (==0.23.4)", "scikit-learn (==0.20.0)", "scipy (==1.1.0)" ], "requires_python": "", "summary": "Micro-Homology cut finder", "version": "1.0.0" }, "last_serial": 5602680, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "c5bf9dff855f58aeff8b5da07f4f5f81", "sha256": "d5054f56c401e791e9b85f73a455999153dfacbf23574446429329efcef5506e" }, "downloads": -1, "filename": "MHcut-0.1.1-py2-none-any.whl", "has_sig": false, "md5_digest": "c5bf9dff855f58aeff8b5da07f4f5f81", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 31742, "upload_time": "2019-07-25T06:47:33", "url": "https://files.pythonhosted.org/packages/72/bd/835fc0df6bbba04310682303fbb06a27a5dff5e9dd972d3116d6658c1290/MHcut-0.1.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "78cdf2df6695390bebd03d860e5e691f", "sha256": "82aed2e5beb33010259869ada4c7fc4ffb345c62d9bf2d57f1f0bd38f3133c99" }, "downloads": -1, "filename": "MHcut-0.1.1.tar.gz", "has_sig": false, "md5_digest": "78cdf2df6695390bebd03d860e5e691f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31494, "upload_time": "2019-07-25T06:48:13", "url": "https://files.pythonhosted.org/packages/e5/b7/0c43d47df24aec275bf317f5fe667a04d5b832cae3d32bfed9c0724613b6/MHcut-0.1.1.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "e33294c2c01885e10f2bf57154cec87a", "sha256": "e33678a64b45bfb07d4e580585a89332e8297b139d67dfc2aad9016d155625e8" }, "downloads": -1, "filename": "MHcut-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "e33294c2c01885e10f2bf57154cec87a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 49153, "upload_time": "2019-07-30T00:50:06", "url": "https://files.pythonhosted.org/packages/ec/4c/af0805b2c8cbe3f456d90ec0091aaa67df3db82f0310d5fc609439abb6d1/MHcut-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "02e68260b4b72389c4ddc5a4c7b73ea6", "sha256": "e3a1f1a77379a963f704235f66081a7a759560fdc12821073a18f21ecb271679" }, "downloads": -1, "filename": "MHcut-1.0.0.tar.gz", "has_sig": false, "md5_digest": "02e68260b4b72389c4ddc5a4c7b73ea6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 34044, "upload_time": "2019-07-30T00:50:08", "url": "https://files.pythonhosted.org/packages/c3/72/9b9242a1bfb234cc6734f987f9856ff56051ac8eae7542e5784ab430d879/MHcut-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e33294c2c01885e10f2bf57154cec87a", "sha256": "e33678a64b45bfb07d4e580585a89332e8297b139d67dfc2aad9016d155625e8" }, "downloads": -1, "filename": "MHcut-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "e33294c2c01885e10f2bf57154cec87a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 49153, "upload_time": "2019-07-30T00:50:06", "url": "https://files.pythonhosted.org/packages/ec/4c/af0805b2c8cbe3f456d90ec0091aaa67df3db82f0310d5fc609439abb6d1/MHcut-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "02e68260b4b72389c4ddc5a4c7b73ea6", "sha256": "e3a1f1a77379a963f704235f66081a7a759560fdc12821073a18f21ecb271679" }, "downloads": -1, "filename": "MHcut-1.0.0.tar.gz", "has_sig": false, "md5_digest": "02e68260b4b72389c4ddc5a4c7b73ea6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 34044, "upload_time": "2019-07-30T00:50:08", "url": "https://files.pythonhosted.org/packages/c3/72/9b9242a1bfb234cc6734f987f9856ff56051ac8eae7542e5784ab430d879/MHcut-1.0.0.tar.gz" } ] }