{ "info": { "author": "Jonathan Belyeu", "author_email": "jrbelyeu@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 3" ], "description": "[![Build Status](https://travis-ci.org/ryanlayer/samplot.svg?branch=master)](https://travis-ci.org/ryanlayer/samplot)\n\n
\n\n
\n\n`samplot` is a command line tool for rapid, multi-sample structural variant\nvisualization. `samplot` takes SV coordinates and bam files and produces\nhigh-quality images that highlight any alignment and depth signals that\nsubstantiate the SV.\n\n## Usage\n```\nUsage: samplot.py [options]\n\nOptions:\n -h, --help show this help message and exit\n --marker_size=MARKER_SIZE\n Size of marks on pairs and splits (default 3)\n -n TITLES Space-delimited list of plot titles. Use quote marks to include spaces (i.e. \\\"plot 1\\\" \\\"plot 2\\\")\"\n -r REFERENCE Reference file for CRAM\n -z Z Number of stdevs from the mean (default 4)\n -b BAMS Bam file names (CSV)\n -o OUTPUT_FILE Output file name\n -s START Start range\n -e END End range\n -c CHROM Chromosome range\n -w WINDOW Window size (count of bases to include), default(0.5 *\n len)\n -d MAX_DEPTH Max number of normal pairs to plot\n -t SV_TYPE SV type\n -T TRANSCRIPT_FILE GFF of transcripts\n -A ANNOTATION_FILE Space-delimited list of bed.gz tabixed files of annotations (such as repeats, mappability, etc.)\n -a Print commandline arguments\n -H PLOT_HEIGHT Plot height\n -W PLOT_WIDTH Plot width\n -j Create only the json file, not the image plot\n --long_read=LONG_READ\n Min length of a read to be a long-read (default 1000)\n --common_insert_size Set common insert size for all plots\n```\n\n## Installing\nSince samplot runs as a Python script, the only requirements to use it are a working version of Python (2 or 3) and the required Python [libraries](https://raw.githubusercontent.com/ryanlayer/samplot/master/requirements.txt). Installation of these libraries can be performed easily by using conda:\n```\nconda install -y --file https://raw.githubusercontent.com/ryanlayer/samplot/master/requirements.txt\n```\n\nIf you have issues with `pysam`, then you may need to update your conda channels:\n```\nconda config --add channels r\nconda config --add channels bioconda\n```\n\nAll of these libraries are also available from [pip](https://pypi.python.org/pypi/pip). \n\nYou can download samplot by cloning the git repository:\n```\ngit clone https://github.com/ryanlayer/samplot.git\n```\nNo other installation is required.\n\n## Examples: \n\nSamplot requires either BAM files or CRAM files as primary input. If you use\nCRAM, you'll also need a reference genome like one used the the 1000 Genomes\nProject\n(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz).\n\n### Basic use case\nUsing data from NA12878, NA12889, and NA12890 in the \n[1000 Genomes Project](http://www.internationalgenome.org/about), we will\ninspect a possible deletion in NA12878 at 4:115928726-115931880 with respect\nto that same region in two unrelated samples NA12889 and NA12890.\n\nThe following command will create an image of that region:\n```\ntime samplot/src/samplot.py \\\n -n NA12878 NA12889 NA12890 \\\n -b samplot/test/data/NA12878_restricted.bam \\\n samplot/test/data/NA12889_restricted.bam \\\n samplot/test/data/NA12890_restricted.bam \\\n -o 4_115928726_115931880.png \\\n -c chr4 \\\n -s 115928726 \\\n -e 115931880 \\\n -t DEL\n\nreal 0m9.450s\nuser 0m9.199s\nsys 0m0.217s\n```\n\nThe arguments used above are:\n\n`-n` The names to be shown for each sample in the plot\n\n`-b` The BAM/CRAM files of the samples (space-delimited)\n\n`-o` The name of the output file containing the plot\n\n`-c` The chromosome of the region of interest\n\n`-s` The start location of the region of interest\n\n`-e` The end location of the region of interest\n\n`-t` The type of the variant of interest\n\nThis will create an image file named `4_115928726_115931880.png`, shown below:\n\n\n\n### Downsampling \"normal\" pairs\n\nThe runtime of `samplot` can be reduced by only plotting a portion of the concordant \npair-end reads (+/- strand orientation, within z s.d. of the mean insert size where z \nis a command line option the defaults to 4). If we rerun the prior example, but only plot\na random sampling of 100 normal pairs we get a similar result 3.6X faster.\n\n```\ntime samplot/src/samplot.py \\\n -n NA12878 NA12889 NA12890 \\\n -b samplot/test/data/NA12878_restricted.bam \\\n samplot/test/data/NA12889_restricted.bam \\\n samplot/test/data/NA12890_restricted.bam \\\n -o 4_115928726_115931880.d100.png \\\n -c chr4 \\\n -s 115928726 \\\n -e 115931880 \\\n -t DEL \\\n -d 100\n\nreal 0m2.621s\nuser 0m2.466s\nsys 0m0.124s\n```\n\n\n\n\n### Gene and other genomic feature annotations\n\nGene annotations (tabixed, gff3 file) and genome features (tabixed, bgzipped, bed file) can be \nincluded in the plots.\n\nGet the gene annotations:\n```\nwget ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/Homo_sapiens.GRCh37.82.gff3.gz\nbedtools sort -i Homo_sapiens.GRCh37.82.gff3.gz \\\n| bgzip -c > Homo_sapiens.GRCh37.82.sort.gff3.gz\ntabix Homo_sapiens.GRCh37.82.sort.gff3.gz\n```\n\nGet genome annotations, in this case Repeat Masker tracks and a mappability track:\n```\nwget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig\nbigWigToBedGraph wgEncodeDukeMapabilityUniqueness35bp.bigWig wgEncodeDukeMapabilityUniqueness35bp.bed\nbgzip wgEncodeDukeMapabilityUniqueness35bp.bed\ntabix wgEncodeDukeMapabilityUniqueness35bp.bed.gz\n\ncurl http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz \\\n| bgzip -d -c \\\n| cut -f 6,7,8,13 \\\n| bedtools sort -i stdin \\\n| bgzip -c > rmsk.bed.gz\ntabix rmsk.bed.gz\n```\n\nPlot:\n```\nsamplot/src/samplot.py \\\n -n NA12878 NA12889 NA12890 \\\n -b samplot/test/data/NA12878_restricted.bam \\\n samplot/test/data/NA12889_restricted.bam \\\n samplot/test/data/NA12890_restricted.bam \\\n -o 4_115928726_115931880.d100.genes_reps_map.png \\\n -c chr4 \\\n -s 115928726 \\\n -e 115931880 \\\n -t DEL \\\n -d 100 \\\n -T Homo_sapiens.GRCh37.82.sort.gff3.gz \\\n -A rmsk.bed.gz wgEncodeDukeMapabilityUniqueness35bp.bed.gz\n\nreal 0m2.784s\nuser 0m2.633s\nsys 0m0.129s\n```\n\n\n\n## Generating images from a VCF file\nTo plot images from structural variant calls in a VCF file, use samplot's\n`samplot_vcf.py` script. This accepts a VCF file and the BAM files of samples\nyou wish to plot, outputting images and the index for a web page for review. \n\n### Usage\n```\n$ python samplot/src/samplot_vcf.py -h\nusage: note that additional arguments are passed through to samplot.py\n [-h] [--vcf VCF] [-d OUT_DIR] [--ped PED] [--dn_only]\n [--min_call_rate MIN_CALL_RATE] [--filter FILTER]\n [-O {png,pdf,eps,jpg}] [--max_hets MAX_HETS]\n [--min_entries MIN_ENTRIES] [--max_entries MAX_ENTRIES]\n [--max_mb MAX_MB] [--important_regions IMPORTANT_REGIONS] -b BAMS\n [BAMS ...]\n\noptional arguments:\n -h, --help show this help message and exit\n --vcf VCF, -v VCF VCF file containing structural variants\n -d OUT_DIR, --out-dir OUT_DIR\n path to write output PNGs\n --ped PED path ped (or .fam) file\n --dn_only plots only putative de novo variants (PED file\n required)\n --min_call_rate MIN_CALL_RATE\n only plot variants with at least this call-rate\n --filter FILTER simple filter that samples must meet. Join multiple\n filters with '&' and specify --filter multiple times\n for 'or' e.g. DHFFC < 0.7 & SVTYPE = 'DEL'\n -O {png,pdf,eps,jpg}, --output_type {png,pdf,eps,jpg}\n type of output figure\n --max_hets MAX_HETS only plot variants with at most this many\n heterozygotes\n --min_entries MIN_ENTRIES\n try to include homref samples as controls to get this\n many samples in plot\n --max_entries MAX_ENTRIES\n only plot at most this many heterozygotes\n --max_mb MAX_MB skip variants longer than this many megabases\n --important_regions IMPORTANT_REGIONS\n only report variants that overlap regions in this bed\n file\n -b BAMS [BAMS ...], --bams BAMS [BAMS ...]\n Space-delimited list of BAM/CRAM file names\n```\n\n`samplot_vcf.py` can be used to quickly apply some basic filters to variants. Filters are applied via the `--filter` argument, which may be repeated as many times as desired. Each expression specified with the `--filter` option is applied separately in an OR fashion, which `&` characters may be used within a statement for AND operations.\n\n### Example:\n```\npython samplot_vcf.py \\\n --filter \"SVTYPE == 'DEL' & SU >= 8\" \\\n --filter \"SVTYPE == 'INV' & SU >= 5\" \\\n --vcf example.vcf\\\n -d test/\\\n -O png\\\n --important_regions regions.bed\\\n -b example.bam > samplot_commands.sh\n```\nThis example will create a directory named test (in the current working directory). A file named `index.html` will be created inside that directory. Samplot commands will be printed out for the creation of plots for all samples/variants that pass the above filters, assuming that the `samplot.py` script is in the same directory as the `samplot_vcf.py` script.\n\n**Filters:** The above filters will remove all samples/variants from output except:\n* `DUP` variants with at least `SU` of 8\n* `INV` variants with `SU` of at least 5\n\nThe specific `FORMAT` fields available in your VCF file may be different. I recommend SV VCF annotation with [duphold](https://github.com/brentp/duphold) by [brentp](https://github.com/brentp).\n\nFor more complex expression-based VCF filtering, try brentp's [slivar](https://github.com/brentp/slivar), which provides similar but more broad options for filter expressions.\n\n**Region restriction.** Variants can also be filtered by overlap with a set of region (for example, gene coordinates for genes correlated with a disease). The `important_regions` argument provides a BED file of such regions for this example.\n\n**Filtering for de novo SVs** \nUsing a [PED](https://gatkforums.broadinstitute.org/gatk/discussion/7696/pedigree-ped-files) file with `samplot_vcf.py` allows filtering for variants that may be spontaneous/de novo variants. This filter is a simple Mendelian violation test. If a sample 1) has valid parent IDs in the PED file, 2) has a non-homref genotype (1/0, 0/1, or 1/1 in VCF), 3) passes filters, and 4) both parents have homref genotypes (0/0 in VCF), the sample may have a de novo variant. Filter parameters are not applied to the parents. The sample is plotted along with both parents, which are labeled as father and mother in the image. \n\nExample call with the addition of a PED file:\n\n
\npython samplot_vcf.py \\\n    --filter \"SVTYPE == 'DEL' & SU >= 8\" \\\n    --filter \"SVTYPE == 'INV' & SU >= 5\" \\\n    --vcf example.vcf\\\n    -d test/\\\n    -O png\\\n    --ped family.ped\\\n    --important_regions regions.bed\\\n    -b example.bam > samplot_commands.sh\n
\n\n**Additional notes.** \n* Variants where fewer than 95% of samples have a call (whether reference or alternate) will be excluded by default. This can be altered via the command-line argument `min_call_rate`.\n* If you're primarily interested in rare variants, you can use the `max_hets` filter to remove variants that appear in more than `max_hets` samples.\n* Large variants can now be plotted easily by samplot through use of `samplot.py`'s `zoom` argument. However, you can still choose to only plot variants larger than a given size using the `max_mb` argument. The `zoom` argument takes an integer parameter and shows only the intervals within +/- that parameter on either side of the breakpoints. A dotted line connects the ends of the variant call bar at the top of the window, showing that the region between breakpoint intervals is not shown.\n* By default, if fewer than 6 samples have a variant and additional homref samples are given, control samples will be added from the homref group to reach a total of 6 samples in the plot. This number may be altered using the `min_entries` argument.\n* Arguments that are optional in `samplot.py` can by given as arguments to `samplot_vcf.py`. They will be applied to each image generated.\n\n\n#### CRAM inputs\nSamplot also support CRAM input, which requires a reference fasta file for\nreading as noted above. Notice that the reference file is not included in this\nrepository due to size. This time we'll plot an interesting duplication at\nX:101055330-101067156.\n\n```\nsamplot/src/samplot.py \\\n -n NA12878 NA12889 NA12890 \\\n -b samplot/test/data/NA12878_restricted.cram \\\n samplot/test/data/NA12889_restricted.cram \\\n samplot/test/data/NA12890_restricted.cram \\\n -o cramX_101055330_101067156.png \n -c chrX \\\n -s 101055330 \\\n -e 101067156 \\\n -t DUP \\\n -r hg19.fa\n```\n\n\nThe arguments used above are the same as those used for the basic use case, with the addition of the following:\n\n`-r` The reference file used for reading CRAM files\n\n#### Plotting without the SV \nSamplot can also plot genomic regions that are unrelated to an SV. If you do\nnot pass the SV type option (`-t`) then the top SV bar will go away and only\nthe region that is given by `-c` `-s` and `-e` will be displayed.\n\n#### Long read (Oxford nanopore and PacBio) and linked read support\nAny alignment that is longer than 1000 bp are treated as a longread, and\nthe plot design will focus on aligned regions and gaps. Aligned regions are in orange, and gaps follow the same DEL/DUP/INV color code used for short reads. The height of the alignment is based on the size of its largest gap.\n\n\n\nIf the bam file has an MI tag, then the reads will be treated as linked reads.\nThe plots will be similar to short read plots, but all alignments with the same MI is plotted at the same height according to alignment with the largest gap in the group. A green line connects all alignments in a group.\n\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ryanlayer/samplot.git", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "samplot", "package_url": "https://pypi.org/project/samplot/", "platform": "", "project_url": "https://pypi.org/project/samplot/", "project_urls": { "Homepage": "https://github.com/ryanlayer/samplot.git" }, "release_url": "https://pypi.org/project/samplot/1.0.3/", "requires_dist": [ "numpy", "matplotlib", "pysam (>=0.15.2)", "statistics" ], "requires_python": "", "summary": "plotting package for genomic structural variation", "version": "1.0.3" }, "last_serial": 5372669, "releases": { "1.0.3": [ { "comment_text": "", "digests": { "md5": "c6659c660268473c2abdd60b89291fc6", "sha256": "6492d82a01f936723afa08185a25802e393df17c93e23357a0130ab779f28e79" }, "downloads": -1, "filename": "samplot-1.0.3-py2-none-any.whl", "has_sig": false, "md5_digest": "c6659c660268473c2abdd60b89291fc6", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 34731, "upload_time": "2019-06-07T18:36:45", "url": "https://files.pythonhosted.org/packages/81/6b/3b741027ccb1a35946720578cd2bba8b64c3923e648d4f47e53c643a25a2/samplot-1.0.3-py2-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c6659c660268473c2abdd60b89291fc6", "sha256": "6492d82a01f936723afa08185a25802e393df17c93e23357a0130ab779f28e79" }, "downloads": -1, "filename": "samplot-1.0.3-py2-none-any.whl", "has_sig": false, "md5_digest": "c6659c660268473c2abdd60b89291fc6", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 34731, "upload_time": "2019-06-07T18:36:45", "url": "https://files.pythonhosted.org/packages/81/6b/3b741027ccb1a35946720578cd2bba8b64c3923e648d4f47e53c643a25a2/samplot-1.0.3-py2-none-any.whl" } ] }