{ "info": { "author": "Alastair Maxwell/University of Glasgow", "author_email": "alastair.maxwell@glasgow.ac.uk", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Environment :: Console", "Intended Audience :: Education", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7" ], "description": "ScaleHD: Automated Huntington Disease genotyping\n=========================================================\nScaleHD is a package for automating the process of genotyping microsatellite repeats in Huntington Disease data.\nWe utilise machine learning approaches to take into account natural data 'artefacts', such as PCR slippage and somatic\nmosaicism, when processing data. This provides the end-user with a simple to use platform which can robustly predict genotypes from input data.\n\nBy default, input is a pair of unaligned .fastq sequence data -- both forward and reverse reads, per sample. We utilise both forward and reverse\nreads in order to reduce the complex dimensionality issue posed by Huntington Disease's multiple repeat tract genetic structure. Reverse reads allow\nus to determine the current sample's CCG state -- this provides us with a mechanism by which to more easily call the entire genotype. Forward reads\nare utilised in a similar approach, to determine the CAG and intervening structure.\n\nThe general overview of the application is as follows:\n1) Input FastQ files are subsampled, if an overwhelming number of reads are present. This can be overruled with the -b flag.\n2) Sequence quality control is carried out per the user's instructions. We reccomend triming of any 5-prime spacer+primer combinations, for optimal alignment.\n3) Alignment of these files, to a typical HD structure (CAG_1_1_CCG_2) reference, is carried out.\n4) Assemblies are scanned with Digital Signal Processing to detect any possible atypical structures (e.g. CAG_2_1_CCG_3).\n4.1) If no atypical alleles are detected, proceed as normal.\n4.2) If atypical alleles are detected, a custom reference is generated, and re-alignment to this is carried out.\n5) With the appropriate allele information and sequence assembly(ies) present, sampled are genotyped.\n6) Output is written for the current sample; the procedure is repeated for the next sample in the queue (if present).\n\n\nWhat's New\n==========\n* Added an n-aligned matrix of repeat-count distributions, on a SHD instance-wide basis.\n* Instances of SHD will output the utilised configuration file with other results.\n* Removed the -b/--boost flag, and made subsampling the default behaviour (given acceptable read-count in raw input data)\n* Added the -b/--broadscope flag, which forces alignment and DSP to be executed on all present reads (i.e. no subsampling).\n* Added the -e/--enshrine flag, which forces SHD to retain all aligned reads which are not uniquely mapped (which are removed by default).\n* Implemented DSP to function within the intervening sequence, rather than utilising a string derision method.\n* Added many report flags for SHD's instance report output -- more on this in the Output section of this readme.\n\n\nInstallation Prerequisites\n==========================\nIf you do not have sudo access (to install requisite packages), you should run ScaleHD in a user-bound local python environment,\n or discrete installation. This guide will assume you have sudo access. However, we detail an extra stage on setting up a local\n python environment for use with ScaleHD.\n\n0. (Optional 1 - no sudo) Python 2.7 Setup\n ~~~~\n $ cd desired-directory\n $ tar jvzf Python-2.7.tar.bz2\n $ cd Python-2.7\n $ ./configure --enable-shared --prefix=/your/custom/installation/path\n $ make\n $ make install\n ~~~~\n\n0. (Optional 2 - no sudo) Bash profile edit.. in your ~/.bash_profile file\n ~~~~\n $ export PATH=/your/custom/installation/path/bin:$PATH\n $ export LD_LIBRARY_PATH=/your/custom/installation/path/lib:$LD_LIBRARY_PATH\n ~~~~\n\n1. Get PIP if not already installed!\n ~~~~\n $ wget https://bootstrap.pypa.io/get-pip.py\n $ python ~/path/to/get-pip.py\n ~~~~\n\n2. Install Cython/Scipy stack separately (Setuptools seems to install incorrectly..)\n ~~~~\n $ pip install cython\n $ pip install scipy\n $ pip install numpy\n ~~~~\n\n3. Install ScaleHD from src (pip coming soon...)\n ~~~~\n $ cd ~/path/to/ScaleHD/src/\n $ python setup.py install\n ~~~~\n\n4. Install required third-party binaries. Please make sure any binaries you do install are included on your $PATH so that they can be found by your system. \n**Please note**, ScaleHD will utilise GNU WHICH/TYPE to determine if a command is on your $PATH. If either WHICH/TYPE or a dependency is missing, ScaleHD will inform you and exit.\n ~~~~\n Quality Control:\n Cutadapt\n FastQC (Java required)\n Alignment:\n BWA \n SeqTK\n Samtools\n Generatr (setup.py will install this for you)\n Genotyping:\n R \n Samtools\n Generatr (as above)\n Picard (alias required*)\n GATK (alias required*)\n ~~~~\n*aliases are required for certain third party software which are not distributed as installable binaries. An example of an alias would\nlook like:\n\n alias gatk=\"java -jar /Users/home_dir/Documents/Builds/GenomeAnalysisTK.jar\"\n\n5. Check that libxml2-dev and libxslt-dev are installed...\n\nUsage\n=====\nGeneral usage is as follows:\n\n $ scalehd [-h/--help] [-v] [-c CONFIG] [-t THREADS] [-e] [-b] [-g] [-j \"jobname\"] [-o OUTPUT]\n e.g.\n $ scalehd -v -c ~/path/to/config.xml -t 12 -j \"ExampleJobPrefix\" -o ~/path/to/master/output\n\nScaleHD flags are:\n\n -h/--help:: Simple help message explaining flags in detail\n -v/--verbose:: Enables verbose mode in the terminal (i.e. shows user feedback)\n -c/--config:: Will execute all settings specified in the given ArgumentConfig.xml [filepath].\n -t/--threads:: Number of threads to utilise. Mainly will affect alignment performance [integer].\n -e/--enshrine:: Forces aligned reads which are not uniquely mapped to be retained; default behaviour without this flag removes said reads.\n -b/--broadscope:: Forces subsampling of raw and aligned reads to be disabled.\n -g/--groupsam:: Groups all aligned assemblies generated into one output folder, with appropriate sample names. If not specified, assemblies will be left in the sample's specific output subfolder.\n -j/--jobname:: Specifies a prefix to use for the root output directory. Optional. If you specify a JobName that already\n exists within your specified -o output folder, ScaleHD will prompt the user to decide if they wish to delete the pre-existing folder and replace.\n -o/--output:: Desired output directory.\n\nData Primer\n===========\nA short note on the requirements of filenames/structure for ScaleHD to function. A sample's filename (here, ExampleSampleName) must adhere to the following structure:\n\n ExampleSampleName_R1.fastq\n ExampleSampleName_R2.fastq\n\nYou must utilise both forward (R1) and reverse (R2) reads, per sample pair. If the respective files do not end in _R1.fastq (.fq) or _R2.fastq (.fq), ScaleHD will not run correctly.\nSince this is a highly HD specific application, we can offer some insight into providing the best approaches for genotyping. Due to the similarity of both repeat tracts in HD (CAG and CCG), which\nare flanking an intervening sequence, that in itself is highly similar to both regions, alignment can be fussy about your input data. Thus, we highly recommend trimming any spacers or primers present\non the 5Prime end of your reads; this enables reads to start at the same position and provides the aligner with a more discrete boundary between the different HD repeat tracts.\n\nIndividual settings for different stages in ScaleHD are set within a configuration XML document. The particular acceptable data types/ranges for each parameter varies. The configuration XML document for ScaleHD settings must also adhere to the following structure:\n\n \n \n \n \n \n \n\nWith each parameter data type/rule being as follows:\n\n CONFIG\n data_dir: Must be a real path, with an even number of ONLY *.fastq or *.fq files within.\n forward_reference: Must be a real reference file (*.fasta, *.fa or *.fas).\n reverse_reference: See forward_reference.\n INSTANCE\n quality_control: Boolean, TRUE/FALSE\n sequence_alignment: Boolean, TRUE/FALSE\n atypical_realignment: Boolean, TRUE/FALSE\n genotype_prediction: Boolean, TRUE/FALSE\n snp_calling: Boolean, TRUE/FALSE\n TRIM\n trim_type: String, \"Quality\", \"Adapter\" or \"Both\"\n quality_threshold: Integer, within the range 0-38\n adapter_flag: String, one of: '-a','-g','-a$','-g^','-b'. ([See Cutadapt](http://cutadapt.readthedocs.io/en/stable/guide.html#removing-adapters))\n adapter: String, consisting of only 'A','T','G','C'\n error_tolerance: Float, within the range of 0.0 to 1.0 (in 0.01 increments).\n ALIGNMENT\n All flags present are direct equivalents of parameters present in BWA-MEM. \n See [the BWA manual for more information](http://bio-bwa.sourceforge.net/bwa.shtml).\n PREDICTION\n plot_graphs: Boolean, TRUE/FALSE\n\nOutput\n======\nA brief overview of flags provided in the output is as follows:\n\n SampleName:: The extracted filename of the sample that was processed.\n Primary/Secondary GTYPE:: Allele genotype in the format CAG_x_y_CCG_z\n Status:: Atypical or Typical structure\n BSlippage:: Slippage ratio of allele's read peak ('N minus 2' to 'N minus 1)', over 'N'.\n Somatic Mosaicism:: Mosaicism ratio of allele's read peak ('N plus 1' to 'N plus 10'), over 'N'\n Confidence:: Confidence in genotype prediction (0-100).\n Exception Raised:: If, during a particular stage of the pipeline, exceptions caused the processing to fail, this flag will inform the user in which stage it crashed.\n Homozygous Haplotype:: If True, both alleles have an identical genotype.\n Neighbouring Peaks:: If True, both alleles exist within the same CCG distribution, neighbouring each other.\n Diminished Peaks:: If True, an expanded peak has very few reads and was detected independently. Manual inspection recommended.\n Novel Atypical:: If True, an intervening sequence structure that has not been readily observed before was detected. Manual inspection recommended.\n Alignment Warning:: If True, determining the CCG value(s) returned more peaks than is 'possible'. Manual inspection recommended.\n Atypical Alignment Warning:: In the case of atypical re-alignment, particularly awful alignment quality can return more than one peak; which should not happen.\n CCG Rewritten:: CCG was rewritten from the FOD-derived value -- i.e. DSP overwrote the FOD results.\n CCG Zygosity Rewritten:: A sample (aligned to a typical reference) that was heterozygous (CCG), was detected to be an atypical homozygous (CCG) sample.\n CCT Uncertainty:: The most common CCT 'sizes' returned from DSP were too similar in count (e.g. CCT2 == 54%, CCT3 == 46%) to be certain.\n SVM Failure:: SVM CCG zygosity calling was incorrect, as a result of the resultant confusion matrix providing differing results from a brute force ratio check. Manual inspection highly recommended.\n Differential Confusion:: The allele sorting algorithm is confused between a potential neighbouring peak, and a homozygous haplotype. Manual inspection highly recommended.\n Peak Inspection Warning:: At least one allele failed inspection on the repeat-count distribution the genotype(s) was(were) derived from. Common in very low read count samples/poor sequencing.\n Low Distribution Reads:: A warning which is triggered when at least one allele's repeat count distribution contains an unappealingly low number of reads.\n Low Peak Reads:: A fatal warning which is triggered when, in a given repeat count distribution, the returned N value contains a very low number of reads. Manual inspection highly recommended.\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/helloabunai/ScaleHD-ALSPAC", "keywords": "machine-learning genotyping bioinformatics huntington-disease", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "ScaleHD-ALSPAC", "package_url": "https://pypi.org/project/ScaleHD-ALSPAC/", "platform": "", "project_url": "https://pypi.org/project/ScaleHD-ALSPAC/", "project_urls": { "Homepage": "https://github.com/helloabunai/ScaleHD-ALSPAC" }, "release_url": "https://pypi.org/project/ScaleHD-ALSPAC/0.3/", "requires_dist": [ "lxml", "numpy", "seaborn", "matplotlib", "sklearn", "scipy", "peakutils", "pandas", "pysam", "regex", "PyPDF2", "reportlab", "rpy2", "generatr", "biopython" ], "requires_python": "", "summary": "Automated DNA micro-satellite genotyping.", "version": "0.3" }, "last_serial": 3617813, "releases": { "0.3": [ { "comment_text": "", "digests": { "md5": "f8e6880f4558b026e21c12cf1262b9c2", "sha256": "caabf9322dbc94bbc99435135bde9768347a517df793cb97ee0a54c49740a21e" }, "downloads": -1, "filename": "ScaleHD_ALSPAC-0.3-py2-none-any.whl", "has_sig": false, "md5_digest": "f8e6880f4558b026e21c12cf1262b9c2", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 66667, "upload_time": "2018-02-26T15:58:31", "url": "https://files.pythonhosted.org/packages/cd/93/6ed64e3dc8181daf0ff79710d6503b8e92b17f8610c69515d90059aa149b/ScaleHD_ALSPAC-0.3-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "707319c98caf5a39a9f4c2c20979412f", "sha256": "c272ffbab8fe52a500fcad5e91ebcdf4a407e195873b7b52ac93dfbe73bb91bf" }, "downloads": -1, "filename": "ScaleHD_ALSPAC-0.3.tar.gz", "has_sig": false, "md5_digest": "707319c98caf5a39a9f4c2c20979412f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61331, "upload_time": "2018-02-26T15:58:33", "url": "https://files.pythonhosted.org/packages/7b/a6/7277ceca763b87f3b8bbbce43771179e6e817520c20988db357ae1906fb9/ScaleHD_ALSPAC-0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f8e6880f4558b026e21c12cf1262b9c2", "sha256": "caabf9322dbc94bbc99435135bde9768347a517df793cb97ee0a54c49740a21e" }, "downloads": -1, "filename": "ScaleHD_ALSPAC-0.3-py2-none-any.whl", "has_sig": false, "md5_digest": "f8e6880f4558b026e21c12cf1262b9c2", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 66667, "upload_time": "2018-02-26T15:58:31", "url": "https://files.pythonhosted.org/packages/cd/93/6ed64e3dc8181daf0ff79710d6503b8e92b17f8610c69515d90059aa149b/ScaleHD_ALSPAC-0.3-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "707319c98caf5a39a9f4c2c20979412f", "sha256": "c272ffbab8fe52a500fcad5e91ebcdf4a407e195873b7b52ac93dfbe73bb91bf" }, "downloads": -1, "filename": "ScaleHD_ALSPAC-0.3.tar.gz", "has_sig": false, "md5_digest": "707319c98caf5a39a9f4c2c20979412f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61331, "upload_time": "2018-02-26T15:58:33", "url": "https://files.pythonhosted.org/packages/7b/a6/7277ceca763b87f3b8bbbce43771179e6e817520c20988db357ae1906fb9/ScaleHD_ALSPAC-0.3.tar.gz" } ] }