{ "info": { "author": "Hannes Luidalepp", "author_email": "luidale@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: Unix", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "starpa\n======\n\n.. image:: https://img.shields.io/pypi/v/starpa.svg\n :target: https://pypi.python.org/pypi/starpa\n :alt: Latest PyPI version\n\n.. image:: https://travis-ci.org/luidale/starpa.png\n :target: https://travis-ci.org/luidale/starpa\n :alt: Latest Travis CI build status\n\n**Stable RNA processing product analyzer**\n\nTool to predict, quantify and characterize stable RNA processing products\nfrom RNA-seq data.\n\nOverview\n--------\nStarpa workflow is divided into multiple consecutive tasks which can be executed separately, \nas a freely chosen successive subsets or all tasks at once in sequential order.\nThis adds flexibility to the tool to use as an input RNA-seq data in various state of processing.\nFor example Starpa can handle raw data in FastQ format, but also trimmed reads (FastQ format)\nor aligned reads in SAM format.\n\nBoth paired-end (PE) and single-end (SE) sequencing reads are accepted as an input.\n\nIn addition, the tool is highly configurable and can handle multiple libraries in parallel manner (multiprocessing).\n\n**Tasks are following:**\n\n- *trim*\n\nCutadapt is used to trim low quality 3' end of the reads followed by adapter removal from 3' end \nof the reads. \n\nIn case of SE, the reads where 3' adapter was not trimmed are excluded. \nThis ensures that 3' end of the read is stable RNA processing products is estimated with higher \nconfidence.\n\n- *align*\n\nBowtie2 is used to align reads to the genome. All matches to the genome are recorded.\n\n- *sam_sort*\n\nFrom aligned reads the unmapped and discordantly mapped reads are discarded. In addition, only the reads belonging to \nbest stratum (class of alignment score) are retained while alignments with lower alignments score \nare excluded.\n\n- *pseudoSE*\n\nAlignments with too many mismatches and reads with too many genomic alignments are discarded.\nAll other reads get NH tag (if not present) describing the number of reported alignments. \nSequence and quality fields of secondary alignments are filled with sequence and quality data.\nIn the end the PE reads are converted to pseudo SE reads to ease subsequent analysis steps. \n\n- *identify*\n\nFlaimapper2 is used to predict stable RNA processing products. To ensure prediction of all\nprocessing products which share start or end positions, the reads are fractionated according \nto their length. Subsequently, Flaimmper2 is run on each fraction of reads separately and \nthe predicted processing products are filtered by the read count (estimation by \nFlaimapper-2) exceeding threshold set. The filtered predicted processing products are quantified \nmore precisely via bedtools intersect.\n\n- *cluster*\n\nQuantified processing products are filtered once again by the read counts (bedtools intersect)\nexceeding threshold and by relative coverage (average coverage of reads assigned to processing products \ndivided by average coverage of all reads aligned to the positions of processing products).\nNext, the processing products from all libraries analysed are combined (identifying unique species) \nand clustered.\n\nClustering is two step process:\n\na) clustering by overlap.\n\nAs the prediction of processing products by Flaimapper-2 is probabilistic, the predicted ends \nof the processing products in different libraries might slightly vary, as also the true ends. \nTherefore, the predicted processing products which do largely overlap and have some bases \n(adjustable) not overlapping are clustered and representative processing products for clusters \nare selected.\n\nb) clustering by sequence\n\nAs a majority of genomes contain repeating regions (repeat regions, rRNA operons, some tRNA genes etc)\nreads can be mapped to multiple positions resulting multiple processing products consisting \nfrom the same or similar set of reads.\nTo reduce the number of identical processing products they are clustered by sequence identity \nvia CDI-HIT-EST. Still the genomic matches of particular reads can be in genomic regions with different surrounding\nsequence/context (eg. different genes) therefore clustering solely based on sequence identity can result \nloss of information.\nTo avoid it the predicted processing products which cluster by sequence identity has to be supported by the \nclustering (again via CDI-HIT-EST) of the contigs they overlap with and representative processing product for the \nclusters are selected.\n\nIn addition, the contigs are identified and wig formatted files (containing coverage data of \nindividual libraries) are created.\n\n- *quantify*\n\nRepresentative processing products will be quantified using bedtools intersect in every library.\nAdditional characteristics will be gathered (relative coverage, coverage at single position level, \nconsensus sequence, quality of consensus sequence, genomic sequence, uniqueness). Quantification data\nis also converted to read per million of mapped reads (RPM), RPM of biotype and RPM of biotype groups.\n\nInstallation\n------------\n::\n\n pip install --user starpa\n\n\nRequirements\n^^^^^^^^^^^^\nStarpa is depending on following tools which have to be installed in your system:\n\n`Python3.4+ `_,\n`cutadapt `_,\n`bowtie2 `_,\n`samtools `_,\n`Flaimapper-2 `_,\n`bedtools `_,\n`CDI-HIT-EST `_.\n\nPython3 requires following packages which will be installed (if missing) during \nthe installation of starpa:\n\npyfaidx, docopt, schema\n\nCompatibility\n-------------\n**OS:**\n\nStarpa is compatible with UNIX like operating systems.\n\n**Input:**\n\n1) Colorspace reads are not supported.\n\n2) Both paired-end (PE) and single-end (SE) reads are supported.\n\nUsage\n-----\nUsage of starpa is as follows::\n\n Usage:\n starpa [-hv]\n starpa -s -e -c -i \n -o \n\n Arguments:\n\n task to start with\n tast to end with\n configuration file\n input folder\n output folder\n Options:\n -v, --version\n -h, --help\n -s , --start=\n -e , --end=\n -c , --config=\n -i , --input=\n -o , --output=\n\n|\n\n**Tasks**\n\nStarpa work-flow is divided into multiple consecutive tasks which can be executed:\n\n- separately\n- as a freely chosen successive subsets \n- all at once in sequential order\n\nTasks in sequential order:\n\n\ttrim, align, sam_sort, pseudoSE, identify, cluster, quantify\n\n**Configuration file**\n\n`Configuration file `_ \nis used to set various parameters which allow to adjust the \nperformance of the work-flow according to the user needs and input data.\nThe description of each parameter is given in the file itself.\n\nConfiguration file states also the location of following files:\n\nadapter files - adapter sequencies in fasta format\n\ngenome file - genome sequence in fasta format\n\nannotation file - in GFF or GFF3 format.\n\n`\"flaimapper parameter file\" `_ -\ndescribed in more deteil `here `_. Given Flaimapper-2 parameters file is adjusted to be suitable to predict processing products with rather defined ends.\n\n`\"library_file\" `_ - \ndescribing libraries to be analysed.\n\n\"library_file\" is a tabular file containing:\n 1) the name of the libraries\n\n 2) conditions they are derived from and \n\n 3) identifier of replicate \n\n(note that all three columns are separated by tab)\n\n::\n\n #Library number\tSample\tReplicate\n library1\tLB OD 0.4\tI\n library2\tLB OD 0.4\tII\n\n| \n\n`Configuration file `_,\n`\"flaimapper parameter file\" `_ and\n`\"library_file\" `_ are available in:\n\n::\n\n src/starpa/data\n\n|\n\n\n**Input folder**\n\nWhile running a single or multiple tasks, the input folder has to contain specific data \nrequired for the first task. \nFor the following task the preceding tasks will prepare proper data.\n\nEach task has different requirements for the input data:\n\n- *trim*\n\n| Sequencing data in `FastQ format `_.\n| Can be in PE or SE format which has to be indicated in \n `configuration file `_ .\n| FastQ files can be compressed as \".gz\", \".bz2\" or \".xz\".\n\n\n- *align*\n\n| Trimmed and cleaned reads in `FastQ format `_.\n| Can be in PE or SE format which has to be indicated in \n `configuration file `_ .\n| FastQ files can be compressed as \".gz\" (requires bowtie2.3.1+)\n\n\n- *sam_sort*\n\n| Aligned reads in SAM format. \n| Can be in PE or SE format which has to be indicated in \n `configuration file `_ .\n\n| BAM format is not currently supported.\n\n\n- *pseudoSE*\n\n| Aligned reads in SAM format. \n| Can be in PE or SE format which has to be indicated in \n `configuration file `_ .\n| File can not be sorted by position.\n\n| BAM format is not currently supported.\n\n\n- *identify*\n\n| Aligned SE or pseudoSE reads in SAM format. \n| Reads require NH tag to describe the number of reported alignments.\n\n| BAM format currently not supported.\n\n\n- *cluster*\n\n| Identified and quantified predicted processing products in BED format \n| (quantification at column #6).\n\n| folder bam:\n| \tAligned SE or pseudoSE reads in BAM format.\n| \tReads require NH tag to describe the number of reported alignments.\n\n| If task \"quantify\" will be also executed:\n| \tAdditional input folder (given by parameter \"quantify_sam_file_location\"):\n| \t\tAligned SE or pseudoSE reads in SAM format \n| \t\t(BAM format currently not supported).\n| \t\tReads require NH tag to describe the number of reported alignments.\n\n\n- *quantify*\n\n| Predicted processing products in BED format (preferentially representatives form clustering).\n\n| Additional input folder (given by parameter \"quantify_sam_file_location\"):\n|\tAligned SE or pseudoSE reads in SAM format (BAM format currently not supported).\n|\tReads require NH tag to describe the number of reported alignments.\n\n\n**Output folder**\n\nOutput folder will contain parameter folder:\n\n::\n\n parameters/\n\teg. config.txt\t\t\t-\tcopy of configuration file\n\targuments.txt\t\t\t-\tcommand line arguments\n\teg. libraries.txt\t\t-\tcopy of library file\n\teg. parameters.dev-2-100-2.txt\t-\tcopy of Flaimapper-2 parameter file\n \n\nEach task creates a subfolder with its name containing specific output \nof the task.\n\n| XXX - library name\n| strand - For or Rev\n| Y -\torder number of fragmented read group\n\n\n- *trim*\n\n::\n\n trim_info/\n\tXXX_triminfo.log\t-\tlog of task\n\tXXX_triminfo.error\t-\tcollected errors during trimming\n\n PE:\n discard/\n\tXXX_1_short.fq\t\t-\tforward reads discared while being too short after\n\t\t\t\t\ttrimming\n\tXXX_2_short.fq\t\t-\treverse reads discared while being too short after\n\t\t\t\t\ttrimming\n\t\t\t\t\t\t\t\n XXX_trim_1.fq\t\t\t-\ttrimmed forward reads\n XXX_trim_2.fq\t\t\t-\ttrimmed reverse reads\n\n SE:\n discard/\n\tXXX_short.fq\t\t-\treads discarded while being too short after \n\t\t\t\t\ttrimming\n\tXXX_untrimmed.fq\t-\treads discarded while having no adapter trimmed\n\t\n XXX_trim.fq\t\t\t-\ttrimmed reads\n\n- *align*\n\n::\n\n align_info/\n\tXXX_aligninfo.log\t-\tlog of task\n\t\n XXX.sam\t\t\t-\taligned reads\n\n- *sam_sort*\n\n::\n\n sort_info/\n\tXXX_sortinfo.log\t-\tlog of task\n\t\n XXX_unmapped.sam\t\t-\tunmapped reads\n XXX_sort.sam\t\t\t-\tprocessed reads\n\n- *pseudoSE*\n\n::\n\n pseudoSE_info/\n\tXXX_pseudoSEinfo.log\t\t-\tlog of task\n\t\n mismatched/\n\tXXX_pseudoSE_mismatch.sam\t-\treads discarded while having too many\n\t\t\t\t\t\tmismatches\n\t\t\t\t\t\t\t\t\t\t\n too_many_matches/\n\tXXX_pseudoSE_multimatch.sam\t-\treads discarded while haveing too many\n\t\t\t\t\t\tgenomic matches\n\t\t\t\t\t\t\t\t\t\t\n XXX_pseudoSE.sam\t\t\t-\tprocessed reads\n\t\n If oligoA allowed:\n oligoA/\n\tXXX-oligoA-mm_pseudoSE.sam\t-\treads with 3' oligoA (non-genome \n\t\t\t\t\t\tencoded) which would have otherwise \n\t\t\t\t\t\tdiscarded\n\tXXX-oligoA-pseudoSE.sam\t\t-\treads with 3' oligoA (non-genome\n\t\t\t\t\t\tencoded)\n\t\n- *identify*\n\n::\n\n flaimapper/\t\t\t\t\t\t\n\tflaimapper_info/\n\t\tXXX/\n\t\t\tXXX_strand_Y_flaimapper.information\t-\tlog of flaimapper\n\t\t\t\n\tflaimapper_temp/\n\t\tXXX/\n\t\t\tXXX_strand_Y_flaimapper.tab\t\t-\tflaimapper predicitons\n\t\t\t\n bam/\n\tXXX_strand.bam\t\t\t\t\t\t-\tstrand-wise sorted reads \n\t\t\t\t\t\t\t\t\tfrom input\n\tXXX_strand.bam.bai\t\t\t\t\t-\tindex of of bam file\n\tXXX_strand.sam \t\t\t\t\t\t-\tNOT NEEDED\n\t\n identify_info/\n\t XXX_strand_identifyinfo.log\t\t\t\t-\tlog of task\n\t \n XXX_strand_pp.BED\t\t\t\t\t\t-\tNOT NEEDED\n XXX_strand_pp_counted.BED\t\t\t\t\t-\tpredicted processing \n\t\t\t\t\t\t\t\t\tproducts with \n\t\t\t\t\t\t\t\t\tquantification\n\n\t\t\t\n- *cluster*\n\n::\n\n cd_hit_est/\n\tpp_cd_hit_est.info\t\t-\tlog of sequence identity based clustering \n\t\t\t\t\t\tof combined and overlap clustered predicted\n\t\t\t\t\t\tprocessing products via CD-HIT-EST\n\tpp_combined.cdhit\t\t-\tgenomic sequence of combined and overlap \n\t\t\t\t\t\tclustered predicted processing products\n\tpp_combined.cdhit.clstr\t\t-\tclusters of combined and overlap clustered\n\t\t\t\t\t\tpredicted processing products created via\n\t\t\t\t\t\tCD-HIT-EST\n\t\t\t\t\t\t\t\t\t\n contigs/\n\tXXX_contigs.BED\t\t\t-\tlist of contigs identified\n\tXXX/\n\t\tcontig_name.fasta\t-\tsequences of all reads belonging to the\n\t\t\t\t\t\tcorresponding contigs\n\t\tcontig_name.sam\t\t-\tall reads belonging to the\n\t\t\t\t\t\tcorresponding contigs\n\t\t\t\t\t\t\t\t\t\n contigs_meta/\n\tcombined_contigs_meta.BED\t-\tcombined contigs to be used to create \n\t\t\t\t\t\tmetacontigs from all libraries\n\tXXX_contigs_meta.BED\t\t-\tlist of contigs to be used to created\n\t\t\t\t\t\tmetacontigs\n\tmetacontig_cd_hit_est.info\t-\tlog of sequence identity based clustering \n\t\t\t\t\t\tof metacontigs via CD-HIT-EST\n\tmetacontigs.cdhit\t\t-\tgenomic sequence of metacontigs\n\tmetacontigs.cdhit.clstr\t\t-\tclusters of metacontigs created via\n\t\t\t\t\t\tCD-HIT-EST\n\tmetacontigs.BED\t\t\t-\tlist of metacontigs in bed format\n\tpp_to_metacontig.BED\t\t-\tcombined and overlap clustered predicted\n\t\t\t\t\t\tprocessing product match with metacontigs\n\t\t\t\t\t\tin BED-like format\n\t\t\t\t\t\t\t\t\t\n mpileup/\n\tXXX_strand_mpileup.info\t\t-\tlog of bedtools mpileup\n\t\n wig/\n\tXXX_strand.wig\t\t\t-\tstrand specific absolute read coverage\n\tXXX_strand_RPM.wig\t\t-\tstrand specific relative read coverage\n\t\t\t\t\t\tas read per million mapped reads (RPM)\n\t\t\t\t\t\t\t\t\t\n pp_clusterinfo.log\t\t\t-\tlog of task\n pp_unique.library_info\t\t\t-\tcombined predicted processing \n\t\t\t\t\t\tproducts and the origins of libraries\n pp_combined.BED\t\t\t-\trepresentatives of combined and overlap \n\t\t\t\t\t\tclustered predicted processing products \n\t\t\t\t\t\tin BED format\n pp_combined.cluster\t\t\t-\toverlap clusters of combined predicted \n\t\t\t\t\t\tprocessing products\n pp_combined.library_info\t\t-\trepresentatives of combined and overlap \n\t\t\t\t\t\tclustered predicted processing \n\t\t\t\t\t\tproducts and the origins of libraries\n pp_metacontig.BED\t\t\t-\trepresentatives of predicted processing\n\t\t\t\t\t\tproducts from pp_combined.BED clustered\n\t\t\t\t\t\tby sequence identity supported by \n\t\t\t\t\t\tmetacontig clustering in BED format\n pp_metacontig.cluster\t\t\t-\tsequence identity clusters of predicted \n\t\t\t\t\t\tprocessing products from pp_combined.BED\n\t\t\t\t\t\tsupported by metacontig clustering\n\n- *quantify*\n\n::\n\n libraries/\t\t\t\t\t-\tdata in library wise\n\tXXX.biotype_annotation.statistics\t-\tread alignement statistics\n\t\t\t\t\t\t\tby annotation biotypes\n\tXXX.gene_annotation.statistics\t\t-\tread alignement statistics\n\t\t\t\t\t\t\tby genes\n\tpp_metacontig_XXX_counted.BED\t\t-\tabsolute quantification of \n\t\t\t\t\t\t\tpredicted processing products \n\t\t\t\t\t\t\tin BED format\n\t\t\t\t\t\t\t\t\t\t\t\t\t\n collected.annotation2.statistics \t\t-\tcombined alignement\tstatistics\n\t\t\t\t\t\t\tby annotation biotypes\n pp_metacontig_biotype.BED\t\t\t-\tpredicted processing products\n\t\t\t\t\t\t\twith biotype in BED-like format\n pp_metacontig_biotype_match.BED\t\t-\tpredicted processing products\n\t\t\t\t\t\t\tmatch with genes in BED-like \n\t\t\t\t\t\t\tformat\n pp_metacontig_counts_total.tsv\t\t\t-\tabsolute quantification of \n\t\t\t\t\t\t\tpredicted processing products \n\t\t\t\t\t\t\tin BED format\n pp_metacontig_counts_RPM.tsv\t\t\t-\trelative quantification of \n\t\t\t\t\t\t\tpredicted processing products\n\t\t\t\t\t\t\tas read per million mapped reads\n\t\t\t\t\t\t\t(RPM) in BED format\n pp_metacontig_counts_biotype_RPM.tsv\t\t-\trelative quantification of \n\t\t\t\t\t\t\tpredicted processing products\n\t\t\t\t\t\t\tas RPM of biotype in BED format\n pp_metacontig_counts_groupped_biotype_RPM.tsv\t-\trelative quantification of \n\t\t\t\t\t\t\tpredicted processing products\n\t\t\t\t\t\t\tas RPM of biotype groups in BED \n\t\t\t\t\t\t\tformat\n pp_metacontig_cons_qual.tsv\t\t\t-\tquality of consensus sequence \n \t\t\t\t\t\t\tof predicted processing products\n\t\t\t\t\t\t\texpressed as frequency of the most\n\t\t\t\t\t\t\tabundant base in a given position\n pp_metacontig_cons_seq.tsv\t\t\t-\tconsensus sequence of predicted \n\t\t\t\t\t\t\tprocessing products\n pp_metacontig_coverage.tsv\t\t\t-\tcoverage of reads assigned to \n\t\t\t\t\t\t\tpredicted processing products \n\t\t\t\t\t\t\tat single position level\n pp_metacontig_genomic_seq.tsv\t\t\t-\tgenomic sequence of predicted \n\t\t\t\t\t\t\tprocessing products \n pp_metacontig_rel_cov.tsv\t\t\t-\trelative coverage of predicted \n\t\t\t\t\t\t\tprocessing products\n pp_metacontig_uniqness.tsv\t\t\t-\tmean number of genomic genomic \n\t\t\t\t\t\t\tmatches of reads assigned\n\t\t\t\t\t\t\tto the predicted processing \n\t\t\t\t\t\t\tproducts\n\nTo do\n-------------\n\nLicence\n-------\n`GNU General Public License v3.0 `_\n\nAuthors\n-------\n`starpa` was written by `Hannes Luidalepp `_", "description_content_type": "", "docs_url": null, "download_url": "https://github.com/luidale/starpa/archive/v0.3.0.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/luidale/starpa", "keywords": "", "license": "The GNU General Public License v3.0 (GPL3) - http://www.gnu.org/copyleft/gpl.html", "maintainer": "", "maintainer_email": "", "name": "starpa", "package_url": "https://pypi.org/project/starpa/", "platform": "", "project_url": "https://pypi.org/project/starpa/", "project_urls": { "Download": "https://github.com/luidale/starpa/archive/v0.3.0.tar.gz", "Homepage": "https://github.com/luidale/starpa" }, "release_url": "https://pypi.org/project/starpa/0.3.0/", "requires_dist": null, "requires_python": "", "summary": "Stable RNA processing product analyzer", "version": "0.3.0" }, "last_serial": 3835555, "releases": { "0.2.1": [ { "comment_text": "", "digests": { "md5": "da9a07166d7d5d17dbf9a164709b0581", "sha256": "bd625187f7b9d4cd45537dfa1bdc753e20c91a2af01a0658f110e612d186a292" }, "downloads": -1, "filename": "starpa-0.2.1.tar.gz", "has_sig": false, "md5_digest": "da9a07166d7d5d17dbf9a164709b0581", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 66890, "upload_time": "2018-03-23T01:15:47", "url": "https://files.pythonhosted.org/packages/6d/14/eb427760e36c125a0aa3eb34507485ad1e09bf320f9f5218178768982687/starpa-0.2.1.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "3de9d93c0bd08cba60c716ddafea329f", "sha256": "c8e7d7b5f7ddd8b569b3a378cc752c08b7a234ece018472fc2804a361deae675" }, "downloads": -1, "filename": "starpa-0.3.0.tar.gz", "has_sig": false, "md5_digest": "3de9d93c0bd08cba60c716ddafea329f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 69014, "upload_time": "2018-05-04T20:06:31", "url": "https://files.pythonhosted.org/packages/00/bc/8094ec1d23f3c00d12f44aa753d742e7d48e993e0f0c16cdfb20634ca853/starpa-0.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3de9d93c0bd08cba60c716ddafea329f", "sha256": "c8e7d7b5f7ddd8b569b3a378cc752c08b7a234ece018472fc2804a361deae675" }, "downloads": -1, "filename": "starpa-0.3.0.tar.gz", "has_sig": false, "md5_digest": "3de9d93c0bd08cba60c716ddafea329f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 69014, "upload_time": "2018-05-04T20:06:31", "url": "https://files.pythonhosted.org/packages/00/bc/8094ec1d23f3c00d12f44aa753d742e7d48e993e0f0c16cdfb20634ca853/starpa-0.3.0.tar.gz" } ] }