{ "info": { "author": "Tali Raveh-Sadka, Shahar Azulay, and Ami Tavory", "author_email": "atavory@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: C++", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "=================================================================\nallein_zu_haus: Needleman-Wunsch Quality-Aware Sequence Alignment\n=================================================================\n\n\nThis pacakge implements a `Needleman\u2013Wunsch `_\nglobal alignment with `affine gap penalties `_, taking into account the confidence in each base pair of the read sequence.\n\nThe classic Needleman-Wunsch algorithm finds the optimal global match assuming all read nucleotides have been identified with certainty. In practice, \n`NGS (next generation sequencing) `_ identifies nucleotides with varying levels of qualities, and popular formats, e.g., `FASTQ `_ are expressely built for describing these qualities. This package modifies the algorithm to take qualities into account (see mathematical_rationale_ below).\n\nThis package is designed to be used in conjunction with a read aligner such as `Bowtie2 `_ or `GEM `_. Ideally, these tools would be configured to take basepair quality into account when searching for hits. Since this is currently not the case, one can use the allein_zu_haus package in order to realign reads to reference sequences at the positions reported by the read aligner, and extract more accurate alignment scores and `CIGAR `_ strings.\n\nThe package handles ambiguous base codes by weighing the different options according to the base priors provided. Read sequences must be drawn from the {A, C, G, T, N} alphabet, but reference sequences may contain any basepairs defined by the IUPAC nucleotide ambiguity code. Quality values are ignored for ambiguous read basepairs.\n\n\nUsage example:\n--------------\n\n\nMinimal Example\n~~~~~~~~~~~~~~~\n\n.. code:: python \n\n import allein_zu_haus\n import numpy as np\n\n max_read_len = 100 # Max length of read to be aligned\n max_ref_len = 2 * max_read_len # Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)\n mismatch_penalty = 4 # Penalty for mismatched nucleotides\n gap_open = 6 # Gep opening penalty\n gep_extend = 3 # Gap extension penalty\n aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)\n\n read = np.array(['A', 'C', 'G', 'T', 'A'], dtype=bytes) # Read sequence, given as numpy array, dtype=bytes\n ref = np.array(['A', 'G', 'G', 'T', 'A'], dtype=bytes) # Reference subsequence to be aligned, given as numpy array, dtype=bytes\n read_bp_probs = np.array([0.9, 0.99, 0.8, 0.99, 0.99]) # Confidence in each read basepair. See below on how to extract such values from read quality strings\n base_probs = np.array([0.25] * 4) # Basepair prior probabilities. Here, assuming uniform distribution on nucleotides. \n # See below example for more biologically relevant priors\n max_gaps = 2 # Maximal number of gaps allowed. Use small values to improve run time\n\n score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) #Aligner returns alignment score and CIGAR string\n\n\nSetting Up Priors & Qualities\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nOne way of extracting more meaningful basepair priors is by using the nucleotide frequencies observed in reference sequences. Assuming _ref_seq is a string variable holding the reference genome of interest this can be done by:\n\n.. code:: python\n \n def get_priors():\n return [(k, v / float(len(_ref_seq))) for (k, v) in collections.Counter(_ref_seq).items()]\n\n\nNaturally, more intricate priors can easily be designed.\n\nBasepair probabilities cab be extracted from quality strings reported in FASTQ files / read aligner output using this formula: \n\n.. math::\n\n 1 - 10^\\frac{-q}{10} \n\nwhere q is the ascii value of the quality character - 33 (assuming qualities are goven in Phred+33)\n\n.. code:: python\n\n def get_read_bp_probs(read_quality_string): # read quality_string: FASTQ quality string as reported in FASTQ file or read aligner output\n return 1 - np.power(10, -np.array([ord(e) - 33 for e in a]) / 10.)\n \n\nAs you can see, read_p now holds per basepair probabilities (1 - P_error). For example, for quality string '??CII' output will be [0.999, 0.999, 0.996, 0.9999, 0.9999]\n\n\nAligning multiple reads\n~~~~~~~~~~~~~~~~~~~~~~~\n\nThis package is optimized for multiple alignments, e.g., when iterating over the results of `GEM `_ or a FASTQ file, and checking the score of each result relative to some corresponding reference subsequence. For this reason, to use it, first create an object with the parameters relevant to all matches. \nSince reads usually have similar lengths, and since read aligners provide the position in the reference sequence to which the read was aligned, both length of read and reference sequence can be easily bound.\n\n.. code:: python \n\n import allein_zu_haus\n import numpy as np\n\n max_read_len = 100 \n max_ref_len = 2 * max_read_len\n mismatch_penalty = 4\n gap_open = 6\n gep_extend = 3\n aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)\n\n\nThen use the aligner repeatedly for each match, providing only the match specific parameters:\n\n.. code:: python \n\n for read, read_quality_string, ref, max_gaps in ...:\n # read, ref should be of type np.array with dtype=bytes\n read_bp_probs = get_read_bp_probs(read_quality_string) # Function for exatracting basepair confidence from FASTQ quality strings. See example above.\n base_probs = get_priors() # Function for computing priors from reference and/or read sequences. See example above\n\n score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) # multiple calls to aligner, with relevant sequences and priors\n\n\n.. _mathematical_rationale:\n\nMathematical Rationale\n----------------------\n\nSuppose we wish to find the optimal match between a read *D* and a reference *F*. Unfortunately, we cannot observe *D* directly, and instead only see *D'*, which is the sequence outputted by some imperfect `sequencing process `_. Say that at some point in the alignment algorithm we consider whether a nucleotide from *D'* matches a nucleotide from *F*. Define:\n\n\n* *b* :sub:`D'` is the nucleotide reported by the sequencing process.\n* *b* :sub:`D` is the true (unknown) uncleotide.\n* *b* :sub:`R` is the reference nucleotide.\n\n\nThe classic algorithm would assign the penalty\n\n.. math::\n\n \\mbox{penalty}(b_{D'}, b_R)\n\nwhereas the correct penalty should be\n\n.. math::\n\n \\mbox{penalty}(b_{D}, b_R) \\simeq\n \\sum_b \\left[ P\\left( B_D = b | B_{D'} = b_{D'} \\right) \\cdot \\mbox{penalty}(b, b_R) \\right]\n\nBy `Bayes' Theorem `_,\n\n.. math::\n\n P\\left( B_D = b | B_{D'} = b_{D'} \\right) \n = \n \\frac\n {\n P\\left( B_{D'} = b_{D'} | B_D = b \\right) \n \\cdot\n P\\left( B_D = b \\right) \n }\n {\n \\sum_{k = 'A', 'C', 'G', 'T} \n P\\left( B_{D'} = b_{D'} | B_D = k \\right) \n \\cdot\n P\\left( B_D = k \\right) \n } \n\nFor evaluating these terms, note that \n\n.. math::\n\n P\\left( B_D = b \\right) \n\nis the prior over the nucleotides (which must be given by the user), and\n\n\n.. math::\n\n P\\left( B_{D'} = b_{D'} | B_D = b \\right) \n =\n \\begin{cases}\n 1 - P_{\\mbox{err}} ,& \\text{if } b_{D'} = b, \\\\\n \\frac{P_{\\mbox{err}}}{3}, & \\text{otherwise}\n \\end{cases}\n\n\nwhere *P* :sub:`err` is the probability for error determined by the reported quality for this nucleotide.\n\n\nIssues\n------\n\nFeel free to open tickets at `https://bitbucket.org/taliraveh/allein_zu_haus/issues `_.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "UNKNOWN", "keywords": null, "license": "License :: OSI Approved :: BSD License", "maintainer": null, "maintainer_email": null, "name": "allein_zu_haus", "package_url": "https://pypi.org/project/allein_zu_haus/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/allein_zu_haus/", "project_urls": { "Download": "UNKNOWN", "Homepage": "UNKNOWN" }, "release_url": "https://pypi.org/project/allein_zu_haus/0.1.4/", "requires_dist": null, "requires_python": null, "summary": "Needleman-Wunsch Quality-Aware Sequence Alignment", "version": "0.1.4" }, "last_serial": 2077550, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "2c655f53add97a1fec4f13f384038f77", "sha256": "6e1f310cdd9976a980c0ba26f3b6513113fcb07865247d8238ca8bf5ae343147" }, "downloads": -1, "filename": "allein_zu_haus-0.1.0.tar.gz", "has_sig": false, "md5_digest": "2c655f53add97a1fec4f13f384038f77", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 122296, "upload_time": "2016-03-08T12:49:14", "url": "https://files.pythonhosted.org/packages/4f/dd/292b2e5e27a2975b977c4a15ebdb270e7578932ec2ca6b2439c31b667bdf/allein_zu_haus-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "0744dc9370c350edffa87b5c78886ab4", "sha256": "d466c7976ec1135855aa833534134488689c898c7cc94425f2e143f56255432e" }, "downloads": -1, "filename": "allein_zu_haus-0.1.1-py2.7-linux-x86_64.egg", "has_sig": false, "md5_digest": "0744dc9370c350edffa87b5c78886ab4", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 478154, "upload_time": "2016-03-09T13:21:24", "url": "https://files.pythonhosted.org/packages/4e/8b/2a311f634e9b831b14daf75a31978f2a892a0ad63e896964a1f263dcc79c/allein_zu_haus-0.1.1-py2.7-linux-x86_64.egg" }, { "comment_text": "", "digests": { "md5": "87fc0273c7ec302176f8c0bfa4f1b231", "sha256": "a10de41fb9fb08f42d743774343de9e73b85e9930c6e89a0690b29223d611605" }, "downloads": -1, "filename": "allein_zu_haus-0.1.1.tar.gz", "has_sig": false, "md5_digest": "87fc0273c7ec302176f8c0bfa4f1b231", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 122500, "upload_time": "2016-03-09T13:21:33", "url": "https://files.pythonhosted.org/packages/9d/22/9046c462eebef388a8eb092f8a3f666514323153ac61d428b4e56ebf7482/allein_zu_haus-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "18734d361dc53e692d80b07acab0c1e9", "sha256": "da595a5aac366bd36ca8b98c1cf013344b50d59ad9c19c11f285edb6de0427b1" }, "downloads": -1, "filename": "allein_zu_haus-0.1.2-py2.7-linux-x86_64.egg", "has_sig": false, "md5_digest": "18734d361dc53e692d80b07acab0c1e9", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 479942, "upload_time": "2016-03-09T13:24:55", "url": "https://files.pythonhosted.org/packages/0e/9f/0ad69a5198ccf98a4d089a7d73ce141d74f1634956e55093bd87b7d80d28/allein_zu_haus-0.1.2-py2.7-linux-x86_64.egg" }, { "comment_text": "", "digests": { "md5": "a0951589b76f0931dc90ab4a7bf08dcd", "sha256": "e80331effb404d468092a5dc0b58ca11b929cc52862be771a7ba7f29b20e8fa2" }, "downloads": -1, "filename": "allein_zu_haus-0.1.2.tar.gz", "has_sig": false, "md5_digest": "a0951589b76f0931dc90ab4a7bf08dcd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 123913, "upload_time": "2016-03-09T13:25:08", "url": "https://files.pythonhosted.org/packages/08/ad/a3d34cf64e7adffaac388f594ad73e099f78d2e034969d61a23c3f21652a/allein_zu_haus-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "946d394d025dd0d40a53dc21f8384371", "sha256": "057af830012f25e1df4be2112c6cbf1bed03f7ea19d020b772ad3f3eed45f988" }, "downloads": -1, "filename": "allein_zu_haus-0.1.3-py2.7-linux-x86_64.egg", "has_sig": false, "md5_digest": "946d394d025dd0d40a53dc21f8384371", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 480664, "upload_time": "2016-03-11T10:06:38", "url": "https://files.pythonhosted.org/packages/ec/fa/823dba205a7c4530a074e56d4ea92d3f15a33762891491adb35ec5033f16/allein_zu_haus-0.1.3-py2.7-linux-x86_64.egg" }, { "comment_text": "", "digests": { "md5": "f3127981103c1f835397f3fbcb6a3dfa", "sha256": "5605481e3ecc86e60cd19af78b38b6fa48cc2396af29020c9a9893e110100408" }, "downloads": -1, "filename": "allein_zu_haus-0.1.3.tar.gz", "has_sig": false, "md5_digest": "f3127981103c1f835397f3fbcb6a3dfa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 124391, "upload_time": "2016-03-11T10:06:47", "url": "https://files.pythonhosted.org/packages/d3/19/ed16ad0d0cb8a48240fd1082a0b4d84f04af330d633934554ac245568c23/allein_zu_haus-0.1.3.tar.gz" } ], "0.1.4": [] }, "urls": [] }