{
"info": {
"author": "Hannes Luidalepp",
"author_email": "luidale@gmail.com",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 4 - Beta",
"Environment :: Console",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
"Operating System :: Unix",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: Implementation :: PyPy",
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Bio-Informatics"
],
"description": "starpa\n======\n\n.. image:: https://img.shields.io/pypi/v/starpa.svg\n :target: https://pypi.python.org/pypi/starpa\n :alt: Latest PyPI version\n\n.. image:: https://travis-ci.org/luidale/starpa.png\n :target: https://travis-ci.org/luidale/starpa\n :alt: Latest Travis CI build status\n\n**Stable RNA processing product analyzer**\n\nTool to predict, quantify and characterize stable RNA processing products\nfrom RNA-seq data.\n\nOverview\n--------\nStarpa workflow is divided into multiple consecutive tasks which can be executed separately, \nas a freely chosen successive subsets or all tasks at once in sequential order.\nThis adds flexibility to the tool to use as an input RNA-seq data in various state of processing.\nFor example Starpa can handle raw data in FastQ format, but also trimmed reads (FastQ format)\nor aligned reads in SAM format.\n\nBoth paired-end (PE) and single-end (SE) sequencing reads are accepted as an input.\n\nIn addition, the tool is highly configurable and can handle multiple libraries in parallel manner (multiprocessing).\n\n**Tasks are following:**\n\n- *trim*\n\nCutadapt is used to trim low quality 3' end of the reads followed by adapter removal from 3' end \nof the reads. \n\nIn case of SE, the reads where 3' adapter was not trimmed are excluded. \nThis ensures that 3' end of the read is stable RNA processing products is estimated with higher \nconfidence.\n\n- *align*\n\nBowtie2 is used to align reads to the genome. All matches to the genome are recorded.\n\n- *sam_sort*\n\nFrom aligned reads the unmapped and discordantly mapped reads are discarded. In addition, only the reads belonging to \nbest stratum (class of alignment score) are retained while alignments with lower alignments score \nare excluded.\n\n- *pseudoSE*\n\nAlignments with too many mismatches and reads with too many genomic alignments are discarded.\nAll other reads get NH tag (if not present) describing the number of reported alignments. \nSequence and quality fields of secondary alignments are filled with sequence and quality data.\nIn the end the PE reads are converted to pseudo SE reads to ease subsequent analysis steps. \n\n- *identify*\n\nFlaimapper2 is used to predict stable RNA processing products. To ensure prediction of all\nprocessing products which share start or end positions, the reads are fractionated according \nto their length. Subsequently, Flaimmper2 is run on each fraction of reads separately and \nthe predicted processing products are filtered by the read count (estimation by \nFlaimapper-2) exceeding threshold set. The filtered predicted processing products are quantified \nmore precisely via bedtools intersect.\n\n- *cluster*\n\nQuantified processing products are filtered once again by the read counts (bedtools intersect)\nexceeding threshold and by relative coverage (average coverage of reads assigned to processing products \ndivided by average coverage of all reads aligned to the positions of processing products).\nNext, the processing products from all libraries analysed are combined (identifying unique species) \nand clustered.\n\nClustering is two step process:\n\na) clustering by overlap.\n\nAs the prediction of processing products by Flaimapper-2 is probabilistic, the predicted ends \nof the processing products in different libraries might slightly vary, as also the true ends. \nTherefore, the predicted processing products which do largely overlap and have some bases \n(adjustable) not overlapping are clustered and representative processing products for clusters \nare selected.\n\nb) clustering by sequence\n\nAs a majority of genomes contain repeating regions (repeat regions, rRNA operons, some tRNA genes etc)\nreads can be mapped to multiple positions resulting multiple processing products consisting \nfrom the same or similar set of reads.\nTo reduce the number of identical processing products they are clustered by sequence identity \nvia CDI-HIT-EST. Still the genomic matches of particular reads can be in genomic regions with different surrounding\nsequence/context (eg. different genes) therefore clustering solely based on sequence identity can result \nloss of information.\nTo avoid it the predicted processing products which cluster by sequence identity has to be supported by the \nclustering (again via CDI-HIT-EST) of the contigs they overlap with and representative processing product for the \nclusters are selected.\n\nIn addition, the contigs are identified and wig formatted files (containing coverage data of \nindividual libraries) are created.\n\n- *quantify*\n\nRepresentative processing products will be quantified using bedtools intersect in every library.\nAdditional characteristics will be gathered (relative coverage, coverage at single position level, \nconsensus sequence, quality of consensus sequence, genomic sequence, uniqueness). Quantification data\nis also converted to read per million of mapped reads (RPM), RPM of biotype and RPM of biotype groups.\n\nInstallation\n------------\n::\n\n pip install --user starpa\n\n\nRequirements\n^^^^^^^^^^^^\nStarpa is depending on following tools which have to be installed in your system:\n\n`Python3.4+ `_,\n`cutadapt `_,\n`bowtie2 `_,\n`samtools `_,\n`Flaimapper-2 `_,\n`bedtools `_,\n`CDI-HIT-EST `_.\n\nPython3 requires following packages which will be installed (if missing) during \nthe installation of starpa:\n\npyfaidx, docopt, schema\n\nCompatibility\n-------------\n**OS:**\n\nStarpa is compatible with UNIX like operating systems.\n\n**Input:**\n\n1) Colorspace reads are not supported.\n\n2) Both paired-end (PE) and single-end (SE) reads are supported.\n\nUsage\n-----\nUsage of starpa is as follows::\n\n Usage:\n starpa [-hv]\n starpa -s -e -c -i \n -o