{ "info": { "author": "Zebulun Arendsee", "author_email": "zbwrnz@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "[![stable](http://badges.github.io/stability-badges/dist/stable.svg)](http://github.com/badges/stability-badges)\n[![Build Status](https://travis-ci.org/incertae-sedis/smof.svg?branch=master)](https://travis-ci.org/incertae-sedis/smof) [![Docker Docker build](https://img.shields.io/docker/cloud/build/incertaesedis/smof.svg)](https://hub.docker.com/r/incertaesedis/smof/) [![docker pulls](https://img.shields.io/docker/pulls/incertaesedis/smof.svg)](https://hub.docker.com/r/incertaesedis/smof/)\n[![Build Status](https://travis-ci.org/incertae-sedis/smof.svg?branch=master)](https://travis-ci.org/incertae-sedis/smof) \n![PyPI](https://img.shields.io/pypi/v/smof.svg)\n[![DOI](https://zenodo.org/badge/19203682.svg)](https://zenodo.org/badge/latestdoi/19203682)\n\nsmof\n====\n\nUNIX-style FASTA tools\n\nInstallation\n============\n\n```\npip install smof\n```\n\nFunctions\n=========\n\n`smof` is divided into the following subcommands:\n\n | subcommand | description |\n | ---------- | ----------------------------------------------------- |\n | `cut` | emulates UNIX cut command, where fields are entries |\n | `clean` | cleans fasta files |\n | `consensus` | finds the consensus sequence for aligned sequence |\n | `filter` | extracts sequences meeting the given conditions |\n | `grep` | roughly emulates the UNIX grep command |\n | `md5sum` | calculate an md5 checksum for the input sequences |\n | `head` | writes the first sequences in a file |\n | `permute` | randomly order sequence |\n | `reverse` | reverse each sequence (or reverse complement) |\n | `sample` | randomly select entries from fasta file |\n | `sniff` | extract info about the sequence |\n | `sort` | sort sequences |\n | `split` | split a fasta file into smaller files |\n | `stat` | calculate sequence statistics |\n | `subseq` | extract subsequence from each entry (revcomp if a\\' aa.faa zzz*\nrm zzz*\n```\n\nOf you can split a large file into many smaller files with a set maximum number\nof sequences per file\n```bash\nsmof split -qn 500 -p zzz aa.faa\ngrep -c '>' aa.faa zzz*\nrm zzz*\n```\n\n## `smof uniq`\n\nThis command corresponds roughly to GNU uniq, but entries are considered\nidentical only if both header and sequence are exactly the same. As currently\nimplemented, I don't find much use for this command.\n\n## `smof wc`\n\nOutputs the number of characters and entries in the fasta file. Generally `smof\nstat` is better.\n\n## `smof grep`\n\nWhereas GNU grep searches lines for matches, `smof grep` searches either the\nFASTA headers or the FASTA sequence.\n\nExtract the entries by matches to the header (default)\n\n``` bash\nsmof grep H312_03353 aa.faa\n```\n\nExtract entries by matches to a sequence \n\n```bash\nsmof grep --match-sequence SKSQ aa.faa\n# or equivalently\nsmof grep -q SKSQ aa.faa\n```\n\nYou can include flanking regions in the match\n```bash\n# match 3 residues downstream\nsmof grep -qA3 'SKSQ' aa.faa\n# match 3 residues upstream \nsmof grep -qB3 'SKSQ' aa.faa\n# match 3 residues up- and downstream \nsmof grep -qC3 'SKSQ' aa.faa\n```\n\nInclusion of flanking regions is particularly useful in tandem with the -o\noption, which extracts only the matching sequence\n```bash\nsmof grep -qoA3 'SKSQ' aa.faa\n```\n\nWrite the output in gff format\n```bash\nsmof grep -q --gff SKSQ aa.faa\n```\n\nYou can count the number of sequences with a match\n```bash\nsmof grep -qc SKS aa.faa\n```\n\nOr the total number of matches\n```bash\nsmof grep -qm SKSQ aa.faa\n```\n\nOr both\n```bash\nsmof grep -qmc SKS aa.faa\n```\n\nJust like in GNU grep, you can invert a search. This search finds all genes\nthat are not annotated as being hypothetical genes.\n```bash\nsmof grep -v hypothetical aa.faa\n```\n\nBy default `smof grep` is case insensitive (unlike GNU grep), but it can be\nmade case sensitive\n```bash\nsmof grep -I CoA aa.faa\n```\n\nYou can search using patterns in a file\n```bash\nsmof grep -f id-sample.txt aa.faa\n```\n\nThis, however, can be a little slow, since it searchs each pattern in the file\nagainst the entire header. A much faster approach is to extract a search\npattern from the headers (or sequence) and then lookup the header pattern in\nthe set of search patterns.\n\n```bash\nsmof grep -f id-sample.txt -w '\\| (\\S+) \\|' aa.faa\n```\n\nCount occurrences (on both strands) of a DNA pattern using IUPAC extended\nnucleotide alphabet.\n```bash\nsmof grep -qmbG YYNCTATAWAWASM aa.supercontigs.fna\n```\n\nYou can search using a sequence query\n```bash\n# select 5 random sequences\nsmof sample -n 5 aa.faa | smof subseq -b 5 35 > rand.faa\nsmof grep -q --fastain rand.faa aa.faa\n```\n\nOr you can search for identical sequences shared between two fasta files\n```bash\nsmof sample -n 5 aa.faa > rand.faa\nsmof grep -q --fastain rand.faa aa.faa \n```\n\nFind non-overlapping open reading frames of length greater than 100 codons.\nThis is meant as an example of regex searching. This will NOT give you a great\nanswer. smof does not consider frames (nor will it ever). It will not find the\nset of longest possible ORFs. If you want to identify ORFs, you should use a\nspecialized program. That said:\n\n``` bash\nsmof grep -qPb --gff 'ATG(.{3}){99,}?(TAA|TGA|TAG)' aa.supercontigs.fna\n```\n\n## `smof md5sum`\n\nThis tool is useful if you want a checksum for a FASTA file that is independent\nof format (e.g. column width or case).\n\n\nString manipulation commands\n============================\n\n## `smof permute`\n\nPermutes the letters of a sequence\n\n## `smof reverse`\n\nReverses a sequence (does NOT take the reverse complement)\n\n## `smof subseq`\n\n``` bash\n# extract a subsequence\nsmof grep H312_00003T0 aa.faa | smof subseq -b 10 20\n# color the subsequences instead\nsmof grep H312_00003T0 aa.faa | smof subseq -b 10 20 -c red\n```\n\nIf the start is higher than the end, and the sequence appears to be a DNA\nsequence, then smof will take the reverse complement.\n\n`smof subseq` can also read from a gff file. However, if you want to extract\nmany sequences from a fasta file using a gff file as a guide (or other gff/bed\nmanipulations), consider using a specialized tools such as `bedtools`.\n\n\nBiological sequence tools\n=========================\n\n## `smof clean`\n\nThis command can be used to tidy a sequence. You can change the column width,\nremove gaps and stops, convert all letters to one case and/or change irregular\ncharacters to unknowns. By default, it removes whitespace in a sequence and\nmakes uniform, 80-character columns.\n\n## `smof filter`\n\nOutput only sequence that meet a set of conditions.\n\nIf you want to only keep sequences that are longer than 100 letters\n\n```bash\nsmof clean -x aa.faa | smof filter -l 100\n```\n\nNote that I call clean before filtering to remove the stop character, which\nshould not be included when calculating length.\n\nOr shorter than 100 letters\n\n```bash\nsmof clean -x aa.faa | smof filter -s 100 aa.faa\n```\n\nOr that have greater than 50% AFILMVW content (hydrophobic amino acids)\n\n```bash\nsmof clean -x aa.faa | smof filter -c 'AFILMVW > .5' aa.faa\n```\n\n## `smof sniff`\n\nThis command runs a number of checks on a FASTA file and is useful in\ndiagnosing problems. For details, run `smof sniff -h`.\n\n## `smof stat`\n\nThe default operation outputs number of sequences and summary statistics\nconcerning the sequence lengths.\n\n```bash\nsmof stat aa.supercontigs.fna\n nseq: 431\n nchars: 12163397\n 5sum: 445 3301 9555 30563 746881\n mean(sd): 28221 (58445)\n N50: 71704\n```\n\n'5sum' refers to the five number summary of the sequence lengths (minimum, 25%\nquantile, median, 75% quantile, and maximum).\n\nStatistics can also be calculated on a sequence-by-sequence level, which by\ndefault outputs the sequence names (the first word of the header) and the\nsequence length, e.g.\n\n```bash\nsmof stat -q aa.supercontigs.fna | head\n```\n\nThere are many other options. Run `smof stat -h` for descriptions.\n\n\nCase study: exploring motifs in chloroplast genomes\n===================================================\n\nAlice is interested in the chloroplast *maturase* gene. Bob gives her a sample\ndataset which includes 10 fasta files of proteins encoded by the chloroplast\ngenomes of 10 different plant species. These files are available in the\n`sample-data/chloroplasts` directory.\n\nYou can find this dataset in the folder *doc/test-data/chloroplast-proteins*.\n\nHer first step is to explore the data. She first counts the sequences in each\nfile with a simple grep command.\n\n```\ngrep -c '>' *faa\n```\n\nNext she tests the sequences with `smof sniff`\n\n```\nsmof sniff *faa\n```\n\nProducing the following output:\n\n```\n578 uniq sequences (757 total)\nAll prot\nAll uppercase\nProtein Features:\n initial-Met: 755 99.7358%\n terminal-stop: 0 0.0000%\n internal-stop: 0 0.0000%\n selenocysteine: 0 0.0000%\nUniversal Features:\n unknown: 8 1.0568%\n ambiguous: 0 0.0000%\n gapped: 0 0.0000%\n```\n\nEverything looks pretty good. But two of the sequences don't start with a\nmethionine. Alice wants to find them. She does this using `smof grep` and a\nPerl regular expressions.\n\n```\nsmof grep -qP '^[^M]' *faa\n```\n\nShe finds these genes are both from *Solanum lycopersicum* and are described in\nthe fasta headers as being *partial*.\n\nNow Alice wants to find the *maturase* genes by pulling out every entry with\n'maturase' in the fasta header.\n\n```\nsmof grep maturase *faa\nsmof grep maturase *faa > maturase.faa\n```\n\nFor a close look at the distribution of sequence lengths, Alice calls `smof\nstat`\n\n```\nsmof stat maturase.faa\n```\n\nAlice happens to be interested in the sequence WTQPQR from *Panicum virgatum*\nand would like to know what the homologous regions are in the other species.\n\nSo Alice aligns the maturase genes with\n[MUSCLE](http://nar.oxfordjournals.org/content/32/5/1792.short) and searches\nfor the motif using the GFF output option.\n\n```\nmuscle -quiet < maturase.faa | tee maturase.aln | smof grep -q --gff WTQPQR\n```\n\nThis is outputs the location of the match in standard GFF format, i.e. the\nmatch is at position 329 to 334. Homologs to this sequence will be at the same\npositions in the aligned fasta file output by MUSCLE.\n\n```\nsmof subseq -b 329 334 maturase.aln\n```\n\nHMMER could then be used to analyze the by-site conservation of the sextuplet.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/incertae-sedis/smof", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "smof", "package_url": "https://pypi.org/project/smof/", "platform": "", "project_url": "https://pypi.org/project/smof/", "project_urls": { "Homepage": "https://github.com/incertae-sedis/smof" }, "release_url": "https://pypi.org/project/smof/2.16.0/", "requires_dist": null, "requires_python": "", "summary": "UNIX-style utilities for FASTA file exploration", "version": "2.16.0" }, "last_serial": 5557547, "releases": { "2.14.1": [ { "comment_text": "", "digests": { "md5": "4f9e55cbc5b64f8a34ffc46fe4254e36", "sha256": "5b91cb26ba71d377e3c7582a192b5998994abcfe1ab20c15329c0b6afef069ae" }, "downloads": -1, "filename": "smof-2.14.1-py3-none-any.whl", "has_sig": false, "md5_digest": "4f9e55cbc5b64f8a34ffc46fe4254e36", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 31575, "upload_time": "2019-06-21T18:57:06", "url": "https://files.pythonhosted.org/packages/24/df/48f4c1d7aff08af7aec744f26a4fd57f712e26d915b230842be283993fb0/smof-2.14.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c0e17fde7c0cb18788a47bd339b2ef1d", "sha256": "4e6466852242e98c7cb91983993711b3cdc487cc72cbb242a62190a7e3088bf6" }, "downloads": -1, "filename": "smof-2.14.1.tar.gz", "has_sig": false, "md5_digest": "c0e17fde7c0cb18788a47bd339b2ef1d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 34083, "upload_time": "2019-06-21T18:57:08", "url": "https://files.pythonhosted.org/packages/ad/61/cac8855d4e23db7f7968f3db9994c0224ed7b3ff6eab2b96562644f28656/smof-2.14.1.tar.gz" } ], "2.14.2": [ { "comment_text": "", "digests": { "md5": "31727115602482a8a61ddc0a1f9a81b2", "sha256": "03926de8ab12fd2b43a33ab4272aafa3ba536d5f14c31c1bb8d556a9ef139f09" }, "downloads": -1, "filename": "smof-2.14.2-py3-none-any.whl", "has_sig": false, "md5_digest": "31727115602482a8a61ddc0a1f9a81b2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33116, "upload_time": "2019-06-21T20:18:04", "url": "https://files.pythonhosted.org/packages/fd/0b/972d90b00e4fe5b7ad221c9936c85014d861cf6730b55639e35034abd092/smof-2.14.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4531339af82d014713bda6ca0b971745", "sha256": "d32906f9fac479724675ea43caa54caeaf4c882e911a7eefb8ba9fe638e9cd6b" }, "downloads": -1, "filename": "smof-2.14.2.tar.gz", "has_sig": false, "md5_digest": "4531339af82d014713bda6ca0b971745", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37653, "upload_time": "2019-06-21T20:18:06", "url": "https://files.pythonhosted.org/packages/99/43/3cd10ae12e911103f8aec75dabdc2921e886796dda37044d38d126cfc5db/smof-2.14.2.tar.gz" } ], "2.14.3": [ { "comment_text": "", "digests": { "md5": "77c322e13ee043cf8d48601c78e47ed0", "sha256": "c8e68dfdde6c4843a77fa65b5b52ce94cdb59aad198ed0567606927ac8ebcdc8" }, "downloads": -1, "filename": "smof-2.14.3-py3-none-any.whl", "has_sig": false, "md5_digest": "77c322e13ee043cf8d48601c78e47ed0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33253, "upload_time": "2019-06-27T16:24:40", "url": "https://files.pythonhosted.org/packages/7c/d0/d2d7db945cad91d3de2e8a4c1c163ad1e016b7fc9f69f1d81531e0b28133/smof-2.14.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d04a69e172f497263bc71612453bf4e7", "sha256": "965af9331f853939a074c51f8ae15037117feccece3cc0cec3fffb8cd0268c2b" }, "downloads": -1, "filename": "smof-2.14.3.tar.gz", "has_sig": false, "md5_digest": "d04a69e172f497263bc71612453bf4e7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37891, "upload_time": "2019-06-27T16:24:42", "url": "https://files.pythonhosted.org/packages/3b/6c/5a1e08685a676564333903b81adba030b6c0813549096c7706997b2eb6d2/smof-2.14.3.tar.gz" } ], "2.16.0": [ { "comment_text": "", "digests": { "md5": "d60df0db8b23e07322d62b41c34b31d4", "sha256": "63b362169415a321d0c610f3d146424356bedd1678b23e58eddb03888a88934c" }, "downloads": -1, "filename": "smof-2.16.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d60df0db8b23e07322d62b41c34b31d4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33429, "upload_time": "2019-07-19T16:51:00", "url": "https://files.pythonhosted.org/packages/e0/21/73bae97c109aaad1bd08b1d10381c329b26b30da7bfce3c634c6a1eb117e/smof-2.16.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fcacfd7dcfc2a987e00fe23881471263", "sha256": "117a14fdc7780c92f89fd1e0d2fe52ea2b86eb1dd87f1307fa3eefe2b62feabb" }, "downloads": -1, "filename": "smof-2.16.0.tar.gz", "has_sig": false, "md5_digest": "fcacfd7dcfc2a987e00fe23881471263", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 38047, "upload_time": "2019-07-19T16:51:02", "url": "https://files.pythonhosted.org/packages/4b/5d/1d8f69318057f51fd092ee48cc90aa3e618ad35f1fe688c8101410efbdc3/smof-2.16.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d60df0db8b23e07322d62b41c34b31d4", "sha256": "63b362169415a321d0c610f3d146424356bedd1678b23e58eddb03888a88934c" }, "downloads": -1, "filename": "smof-2.16.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d60df0db8b23e07322d62b41c34b31d4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 33429, "upload_time": "2019-07-19T16:51:00", "url": "https://files.pythonhosted.org/packages/e0/21/73bae97c109aaad1bd08b1d10381c329b26b30da7bfce3c634c6a1eb117e/smof-2.16.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fcacfd7dcfc2a987e00fe23881471263", "sha256": "117a14fdc7780c92f89fd1e0d2fe52ea2b86eb1dd87f1307fa3eefe2b62feabb" }, "downloads": -1, "filename": "smof-2.16.0.tar.gz", "has_sig": false, "md5_digest": "fcacfd7dcfc2a987e00fe23881471263", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 38047, "upload_time": "2019-07-19T16:51:02", "url": "https://files.pythonhosted.org/packages/4b/5d/1d8f69318057f51fd092ee48cc90aa3e618ad35f1fe688c8101410efbdc3/smof-2.16.0.tar.gz" } ] }