{ "info": { "author": "Eric Talevich", "author_email": "eric.talevich@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": "======\nFammer\n======\n\nUtilities for curating a hierarchical set of sequence profiles representing a\nprotein superfamily.\n\nIf you use this software in a publication, please cite our paper that describes\nit:\n\n Talevich, E. & Kannan, N. (2013) Structural and evolutionary adaptation of\n rhoptry kinases and pseudokinases, a family of coccidian virulence factors.\n *BMC Evolutionary Biology* 13:117\n doi:10.1186/1471-2148-13-117\n\n Available at: http://www.biomedcentral.com/1471-2148/13/117\n\nFreely distributed under the permissive BSD 2-clause license (see LICENSE).\n\nInstallation\n------------\n\nThe installation consists of a Python library, ``fammerlib``, and two scripts,\n``fammer.py`` and ``tmalign.py``.\n\nDownload the .zip file and unpack it, or clone this Git repository, to get the\nsource code.\n\nTo use all the features of Fammer, you'll need the following third-party\nprograms installed:\n\n- Python_ 2.7\n- MAFFT_\n- HMMer_ 3.0\n- MAPGAPS_ (optional)\n- TMalign_ (optional) -- for structural alignments\n- FastTree_ (optional) -- for clustering\n\n.. _Python: http://www.python.org/download/\n.. _MAFFT: http://mafft.cbrc.jp/alignment/software/\n.. _HMMer: http://hmmer.janelia.org/\n.. _MAPGAPS: http://mapgaps.igs.umaryland.edu/\n.. _TMalign: http://cssb.biology.gatech.edu/skolnick/webservice/TM-align/index.shtml\n.. _FastTree: http://www.microbesonline.org/fasttree/\n\n.. For hackers, also PRANK: http://code.google.com/p/prank-msa/\n\nIf you're on a Debian-based Linux system, check your package manager for these\nfirst to save yourself some time::\n\n sudo apt-get install mafft hmmer tm-align fasttree python-pip\n\nThen, install the Python library dependencies and Fammer itself as follows.\n\nRecommended:\n````````````\n\nInstall the Python packaging system pip or setuptools. Then run the setup\nscript, and all Python dependencies will be pulled in::\n\n python setup.py build\n python setup.py install\n\n(You might need root privileges for the last step.)\n\nManual:\n```````\n\nInstall the Python libraries Biopython_, biofrills_, biocma_ and networkx_.\nThen run the setup script as above.\n\n.. _Biopython: http://biopython.org/wiki/Download\n.. _biofrills: https://github.com/etal/biofrills\n.. _biocma: https://github.com/etal/biocma\n.. _networkx: http://networkx.lanl.gov/\n\n\n\nBasic usage\n-----------\n\nGlobal options:\n\n ``-h``, ``--help``\n Show a help message and basic usage.\n ``--quiet``\n Don't print status messages, only warnings and errors.\n\nSub-commands:\n\n `build`_\n Build profiles from a given directory tree.\n `update-fasta`_\n Replace original FASTA sequence sets with the ungapped sequences from\n the corresponding alignment (.aln) files, sorted by decreasing length.\n `scan`_\n Scan and classify input sequences using a set of profiles.\n `add`_\n Scan a target database with the given HMM profile set. Add hits that\n meet acceptance thresholds to the profile FASTA files.\n `refine`_\n Leave-one-out validation of HMM profiles.\n `cluster`_\n Split a sequence set into clusters (based on phylogeny).\n\n\nCommands\n--------\n\nbuild\n`````\n\nConstruct a profile database from a directory tree of family profile alignments.\n\nAssume we have a directory tree set up under ``Superfamily/`` as above.\nNext, run ``fammer.py build Superfamily`` to align all sequence files with\nMAFFT, and (recursively up) align the consensus sequences of each subfamily\ntogether::\n\n Superfamily/\n Group1/\n subfam1_Group1.fasta\n subfam1_Group1.aln\n subfam2_Group1.fasta\n subfam2_Group1.aln\n subfam3_Group1.fasta\n subfam2_Group1.aln\n ...\n Group1-Unclassified.fasta\n Group1-Unclassified.aln\n Group1.aln\n ...\n Superfamily-Unclassified.fasta\n Superfamily-Unclassified.aln\n Superfamily.aln\n\nThe alignments are in un-wrapped Clustal format.\n\nYou can manually adjust the alignments and rebuild, if desired, perhaps\niteratively. Only the \"parent\" family alignments will be rebuilt as needed, e.g.\nif ``subfam1_Group1.aln`` is edited, then only ``Group1.aln`` and\n``Superfamily.aln`` will be rebuilt the next time ``fammer.py build\nSuperfamily`` is called because the consensus sequences that constitute those\nalignments may have changed. (It's like Make.)\n\nFinally, use the option ``--hmmer`` to build profiles::\n\n Superfamily/\n Group1/\n subfam1_Group1.fasta\n subfam1_Group1.aln\n subfam2_Group1.hmm\n ...\n Group1.aln\n Group1.hmm\n ...\n Superfamily.aln\n Superfamily.hmm\n Superfamily_all.hmm # concatenated profiles\n Superfamily_all.hmm.{h3f,h3i,h3m,h3p} # indexes from hmmpress\n\nThe ``--mapgaps`` option works similarly, if you have the necessary programs\ninstalled.\n\nThe ``--clean`` option can be included with any of the above commands to remove\nintermediate files.\n\nIf you have included PDB structures in your directory tree and have a structure\nalignment program installed, the ``--pdb`` option will first create a structural\nalignment of the PDBs in the directory, then use that alignment as the seed for\nhigher-up alignments::\n\n Superfamily/\n Group1/\n subfam1_Group1.fasta\n subfam1_Group1.aln\n 1ATP.pdb\n 1O6K.pdb\n 3C4X.pdb\n ...\n Group1.pdb.seq # Alignment of 1ATP, 1O6K, 3C4X\n Group1.aln\n ...\n Superfamily.aln\n\nIn this example, the alignment generated by aligning the structures 1ATP, 1O6K\nand 3C4X is passed to MAFFT as a seed for ``Group1.aln``, along with the\nunaligned consensus sequences of each subfamily of Group1 (subfam1, subfam2,\n...). The seed sequences are removed from Group1.aln after the alignment of\nconsensus sequences is completed. This can help correctly align the more\ndivergent families and groups to each other.\n\nFor nested directory trees, the option ``--tree`` generates a Newick file\nrepresenting the structure of the directory tree. A tree based on the above\nexamples would look something like this (ignoring whitespace), created as\n``Superfamily.nwk``::\n\n ((subfam1_Group1, subfam2_Group1, subfam3_Group1,\n Group1-Unclassified)Group1,\n (subfam1_Group2, subfam2_Group2, subfam3_Group2,\n Group2-Unclassified)Group2,\n (subfam1_Group3, subfam2_Group3, subfam3_Group3,\n Group3-Unclassified)Group3,\n Superfamily-Unclassified)Superfamily;\n\nThis tree could be passed to RAxML as a constraint tree in an effort to identify\ndeeper subfamilies, for example.\n\n\nupdate-fasta\n````````````\n\nConvert the contents of the ``.aln`` sequence alignment files back to unaligned\nFASTA format, overwriting the corresponding ``.fasta`` files.\n\nAfter initially building a tree of sequence alignments, you might edit the\nClustal alignments, deleting spurious sequences or trimming the alignment to the\nedges of a conserved domain. With ``update-fasta``, you update the contents of\nthe unaligned sequence files to match the ``.aln`` files.\n\nThe next step is usually to either (a) do some sequence processing unrelated to\nfammer, e.g. clustering, or (b) realign everything. Since you've presumably\nremoved some junk from the input sequences, the resulting alignments may be\nbetter.\n\n\nscan\n````\n\nScan/search a set of sequences (FASTA) with the HMM profile database and assign a\nclassification to each hit.\n\nThis is essentially a set of wrappers to process the output of ``hmmsearch``,\nsimplifying the results for common use cases. The three output forms are:\n\n **summary** (default):\n Print two formated columns for each profile in the given HMM profile\n database that matched at least one hit: the name of the profile and the\n number of hits for which it was the best match.\n **table** (``--table``):\n For each sequence in the target sequence set that matched a profile in\n the HMM profile database, print the sequence ID/accession and the name\n of the best-matching profile, separated by a tab character.\n **sequence sets** (``--seqsets``):\n For each profile and matching sequence set (as they'd appear in summary\n output), write a file containing the matching sequences. The output\n filenames indicate the name of the source sequence file name and the\n matching HMM profile names.\n\nNote that ``--table`` and ``--seqsets`` can be combined.\n\n\nadd\n```\n\nScan a target database with the given HMM profile set and add hits that meet\na series of acceptance thresholds to the profile FASTA files.\n\nOnce you've constructed profiles from a collection of carefully selected\nsequences representing each subfamily, you can use this command to scan another\nsequence set and automatically add strong hits to the corresponding profile\nsequence sets. The target database could be the ``*-Unclassified.fasta``\nsequence sets, to catch any classifiable members that were not noticed\ninitially, or a larger sequence database like **refseq_proteins**, if you're\nconfident in your coverage of the superfamily and want to improve the\nsensitivity of your profiles.\n\n\nrefine\n``````\n\nLeave-one-out validation of sequence profiles.\nUnlike the other commands, this is non-recursive.\n\nGiven a target subdirectory and the name of the subdirectory's\n``*-Unclassified.fasta`` file (if not specified, it looks for\n*dirname*-Unclassified.fasta), scan each subfamily's sequence set (``.fasta``)\nwith the corresponding HMM profile (``.hmm``), and also scan the\n``-Unclassified.fasta`` file with all the HMMs to obtain scores for each\nsequence and each profile. Then, compare the scores of sequences in a subfamily,\nstarting with the worst-scoring sequence, to the highest-scoring \"unclassified\"\nsequence by the same profile. If, for a given profile, a classified sequence\nscores worst than an unclassified one, mark the classified one for removal from\nthe sequence profile.\n\nNote that if a member of a known subfamily was mistakenly placed in\n``-Unclassified.fasta`` (i.e. was missed by the initial classification), then\nmany of the legitimate members of the subfamily profile could score worse than\nthis high-scoring \"unclassified\" sequence and be erroneously marked for removal\nfrom the profile. This is easy enough to spot in the logged output. One way to\navoid it is to first use the ``add`` command with ``-Unclassified.fasta`` as\nthe target, to catch and classify such sequences beforehand.\n\ncluster\n```````\n\nExtract clusters from a sequence set based on phylogenetic relationships.\n\nUses FastTree to quickly build a tree with branch support values, then extracts\nwell-supported clades from the tree and writes the corresponding sequence sets\nto FASTA files. Unclustered sequences are written to another \"Unique\" file.\nAlso writes a phyloXML tree file (.xml) showing clusters as colorized clades.\n\n\nBundled modules\n---------------\n\ntasks\n`````\n\nServes the basic purpose of a build tool like Make or Rake: compare the time\nstamps of input and output files at each step of the `build`_ process, and only\nrebuild the outputs that are out of date. Also track intermediate files to clean\nup after the process successfully completes. See `this blog post about it\n`_.\n\ntmalign\n```````\n\nAlign multiple structures using TMalign_ for pairwise alignments and a minimum\nspanning tree constructed from the pairwise TM-scores to assemble the pairwise\nalignments into a multiple sequence alignment. This module can also be used as a\ncommand-line script.\n\nHow to curate profiles\n----------------------\n\nDirectory tree is the superfamily hierarchy\n```````````````````````````````````````````\n\nBegin by creating a directory tree with each subfamily's representative\nsequences in an unaligned FASTA file. The FASTA file names must end in\n``.fasta``.\n\nA simple un-nested layout looks like::\n\n Superfamily/\n subfam1.fasta\n subfam2.fasta\n subfam3.fasta\n ...\n Superfamily-Unclassified.fasta\n\nTypically, a protein superfamily will have some members that cannot be cleanly\nclassified into subfamilies.\n\nRecursive nesting is also allowed (in fact, that's the point of this project).\nIt looks like this::\n\n Superfamily/\n Group1/\n subfam1_Group1.fasta\n subfam2_Group1.fasta\n subfam3_Group1.fasta\n ...\n Group1-Unclassified.fasta\n Group2/\n subfam1_Group2.fasta\n subfam2_Group2.fasta\n subfam3_Group2.fasta\n ...\n Group2-Unclassified.fasta\n Group3/\n subfam1_Group3.fasta\n subfam2_Group3.fasta\n subfam3_Group3.fasta\n ...\n Group3-Unclassified.fasta\n ...\n Superfamily-Unclassified.fasta\n\n\nAt least 2 sequences are needed to define each subfamily. The sequences are\ninitially retrieved and organized manually. External databases can help; for\nexample, KinBase provides a classification scheme and representative sequences\nfor the protein kinase superfamily. The directory hierarchy defines the higher\nlevels of organization of the superfamily.\n\nNow build the initial alignments::\n\n fammer.py build --clean Superfamily/\n\n\nTrim the tails\n``````````````\n\nFor best results, sequences should all cover the same conserved domain region\nand align exactly at N- and C-termini. For example, the eukaryotic protein\nkinase domain begins 7 residues before the glycine-rich loop (GxGxxG motif) and\nends 12 residues after a conserved arginine in subdomain XI (RP[TS] motif).\nThe consensus function used in Fammer includes the flanking sequences before\nthe first conserved block or after the last conserved block, regardless of\n\"gappiness\", so even one sequence in a set can be used to define domain\nboundaries for the whole set.\n\nEdit the subfamily-level alignments (.aln) to trim the sequences to the domain\nboundaries. I use Vim's block-selection mode (Ctrl-v); you might prefer\nJalView. Do not edit higher-level .aln files; just look at them to see which\nsubfamily-level alignments have extended tails, then edit those .aln files.\n\nWhen you think you've trimmed all the .aln files to the conserved domain\nregion, update the corresponding unaligned sequence files (.fasta) to match::\n\n fammer.py update-fasta Superfamily/\n\nRebuild the alignments whose source sequence sets have changed::\n\n fammer.py build --clean Superfamily/\n\nMAFFT may create better alignments for sets where the unalignable regions have\nnow been removed. When this step completes, examine the top-level alignment\n(e.g. Superfamily.aln) -- are the domain boundaries aligned cleanly across all\nsequence sets? Repeat the process if any subfamilies need to be trimmed further.\n\nIf you realize you've trimmed a profile too far, use your version control\nsystem (you are using Git or Mercurial to take snapshots of the tree, right?)\nto revert the .fasta file to an earlier version, then rebuild and try trimming\nagain.\n\nIf you never had the full-length sequences for a subfamily to begin with, try\nto find one representative full-length sequence from a database like UniProt,\nadd it to the .fasta file, realign, and trim that sequence to the region it\nshould cover. This won't improve the HMM or MAPGAPS profile for that subfamily\nmuch, but it will help the higher-level alignments that include the consensus\nsequence of that subfamily.\n\n\nPDB files for structural alignment\n``````````````````````````````````\n\nWhile MAFFT typically creates good alignments within a subfamily, for\nhigh-level profiles it may struggle to align the consensus sequences of very\ndivergent families or groups to each other. In these cases, seeding with a\nstructural alignment can help line up homologous regions.\n\nManually identify the high-quality solved crystal structures that correspond to\nfamilies in your tree, and place those PDB files in the directory tree at the\nsame level as the subfamily they represent.\n\nOpen the PDB file (.pdb) in a text editor and:\n\n- Remove the ATOM records corresponding to residues outside the conserved\n domain.\n- If multiple chains are present, choose the best, most complete chain and\n delete the others. (Otherwise, Fammer will take the first chain by default.)\n\nTo determine the domain boundaries, load a \"reference\" PDB (e.g. 1ATP for\nkinases) and the other PDB together in PyMOL and align using the command\n\"cealign\" or \"fit\". Visually find which residues correspond to the start and\nend of the conserved domain. If your PDB structure diverges from the reference\nstructure drastically before or after a certain point (i.e. N- or C-terminal\nregion of the domain is non-homologous -- excluding inserts), it may be best to\ntruncate the PDB to remove the non-homologous portion as it cannot be aligned\naccurately.\n\nSave the edited .pdb file, and rebuild the higher-level profiles::\n\n fammer.py build --clean Superfamily/\n\nNow that PDB files are present in the directory tree, Fammer will writes\nstructural alignments in FASTA format to the .pdb.seq files.\n\nOnce complete, examine the top-level alignment (e.g. Superfamily.aln) for\nmisalignments. To fix these, first edit the corresponding .pdb.seq file.\nUsually there is just one structure that was misaligned. Rebuild and re-edit\nas necessary. If a particular PDB is very poorly aligned to the others, it may\nbe best to just remove it altogether -- it may have a different conformation\nfrom the others.\n\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/etal/fammer", "keywords": null, "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "fammer", "package_url": "https://pypi.org/project/fammer/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/fammer/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/etal/fammer" }, "release_url": "https://pypi.org/project/fammer/0.2/", "requires_dist": null, "requires_python": null, "summary": "Fammer: Tools for protein superfamily sequence profile alignment and curation.", "version": "0.2" }, "last_serial": 861257, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "d856e4080ab4f65fb34b206e548a9ade", "sha256": "cdf389f4d4d1fc277a97d89558a030cb001e8623598dd104c3ad65763a815220" }, "downloads": -1, "filename": "fammer-0.2.tar.gz", "has_sig": false, "md5_digest": "d856e4080ab4f65fb34b206e548a9ade", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26679, "upload_time": "2013-06-28T03:36:59", "url": "https://files.pythonhosted.org/packages/b4/82/35d814cf3347640b5555d1833be6e0b3584e0701cdc8f38604d1248c0996/fammer-0.2.tar.gz" }, { "comment_text": "", "digests": { "md5": "b7a4646a188418ea6ee07b1375de2ef8", "sha256": "4980c05973da1d770402b38eed06d0bf547d13b5dbbb8b3ccda87f9b1eea6cef" }, "downloads": -1, "filename": "fammer-0.2.zip", "has_sig": false, "md5_digest": "b7a4646a188418ea6ee07b1375de2ef8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33462, "upload_time": "2013-06-28T03:42:49", "url": "https://files.pythonhosted.org/packages/b6/6b/d40a1935dfb63b4c63a46308bc340020b48967633d6314a61f7503eb0023/fammer-0.2.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d856e4080ab4f65fb34b206e548a9ade", "sha256": "cdf389f4d4d1fc277a97d89558a030cb001e8623598dd104c3ad65763a815220" }, "downloads": -1, "filename": "fammer-0.2.tar.gz", "has_sig": false, "md5_digest": "d856e4080ab4f65fb34b206e548a9ade", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26679, "upload_time": "2013-06-28T03:36:59", "url": "https://files.pythonhosted.org/packages/b4/82/35d814cf3347640b5555d1833be6e0b3584e0701cdc8f38604d1248c0996/fammer-0.2.tar.gz" }, { "comment_text": "", "digests": { "md5": "b7a4646a188418ea6ee07b1375de2ef8", "sha256": "4980c05973da1d770402b38eed06d0bf547d13b5dbbb8b3ccda87f9b1eea6cef" }, "downloads": -1, "filename": "fammer-0.2.zip", "has_sig": false, "md5_digest": "b7a4646a188418ea6ee07b1375de2ef8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33462, "upload_time": "2013-06-28T03:42:49", "url": "https://files.pythonhosted.org/packages/b6/6b/d40a1935dfb63b4c63a46308bc340020b48967633d6314a61f7503eb0023/fammer-0.2.zip" } ] }