{ "info": { "author": "David Miro ", "author_email": "lite.3engine@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": "bagofwords\r\n==========\r\n\r\n|Build Status| |Latest Version| |Downloads| |Supported Python versions|\r\n|Development Status| |License|\r\n\r\nIntroduction\r\n------------\r\n\r\nA Python module that allows you to create and manage a collection of\r\noccurrence counts of words without regard to grammar. The main purpose\r\nis provide a set of classes to manage several document classifieds by\r\ncategory in order to apply **Text Classification**.\r\n\r\nYou can make use via **API** or via **Command Line**. For example, you\r\ncan generate your classified documents (*learn*) via Command Line and\r\nafter via API classify an input document.\r\n\r\nThird parties modules\r\n^^^^^^^^^^^^^^^^^^^^^\r\n\r\nModule uses three third parties modules\r\n\r\n- `stop\\_words `__\r\n- `pystemmer `__\r\n- `six `__\r\n\r\nThe first module is used in **stop\\_words filter**, the second module is\r\nused in **stemming filter**. If you don't use these two filters, you\r\ndon't need install them.\r\n\r\nInstallation\r\n------------\r\n\r\nInstall it via ``pip``\r\n\r\n``$ [sudo] pip install bagofwords``\r\n\r\nOr download zip and then install it by running\r\n\r\n``$ [sudo] python setup.py install``\r\n\r\nYou can test it by running\r\n\r\n``$ [sudo] python setup.py test``\r\n\r\nUninstallation\r\n--------------\r\n\r\n``$ [sudo] pip uninstall bagofwords``\r\n\r\nPython API\r\n----------\r\n\r\nMethods\r\n^^^^^^^\r\n\r\n- ``document_classifier(document, **classifieds)`` Text classification\r\n based on an implementation of Naive Bayes\r\n\r\nModule contains two main classes ``DocumentClass`` and ``Document`` and\r\nfour secondary classes ``BagOfWords``, ``WordFilters``, ``TextFilters``\r\nand ``Tokenizer``\r\n\r\nMain classes\r\n^^^^^^^^^^^^\r\n\r\n- ``DocumentClass`` Implementing a bag of words collection where all\r\n the bags of words are the same category, as well as a bag of words\r\n with the entire collection of words. Each bag of words has an\r\n identifier otherwise it's assigned an calculated identifier.\r\n Retrieves the text of a file, folder, url or zip, and also allows\r\n save or retrieve the collection in json format.\r\n- ``Document`` Implementing a bag of words where all words are of the\r\n same category. Retrieves the text of a file, folder, url or zip, and\r\n also allows save or retrieve the Document in json format.\r\n\r\nSecondary classes\r\n^^^^^^^^^^^^^^^^^\r\n\r\n- ``BagOfWords`` Implementing a bag of words with their frequency of\r\n usages.\r\n- ``TextFilters`` Filters for transforming a text. It's used in\r\n Tokenizer class. Including filters ``upper`` ``lower``\r\n ``invalid_chars`` and ``html_to_text``\r\n- ``WordFilters`` Filters for transforming a set of words. It's used in\r\n Tokenizer class. Including filters ``stemming`` ``stopwords`` and\r\n ``normalize``\r\n- ``Tokenizer`` Allows to break a string into tokens (set of words).\r\n Optionally allows you to set filters before (TextFilters) and after\r\n (WordFilters) breaking the string into tokens.\r\n\r\nSubclasses\r\n^^^^^^^^^^\r\n\r\n- Tokenizer subclasses ``DefaultTokenizer`` ``SimpleTokenizer`` and\r\n ``HtmlTokenizer`` that implements the more common filters and\r\n overwriting **after\\_tokenizer** and **berofe\\_tokenizer** methods\r\n- Document subclasses ``DefaultDocument`` ``SimpleDocument`` and\r\n ``HtmlDocument``\r\n- DocumentClass subclasses ``DefaultDocumentClass``\r\n ``SimpleDocumentClass`` and ``HtmlDocumentClass``\r\n\r\nCommand Line Tool\r\n-----------------\r\n\r\n::\r\n\r\n usage: bow [-h] [--version] {create,learn,show,classify} ...\r\n\r\n Manage several document to apply text classification.\r\n\r\n positional arguments:\r\n {create,learn,show,classify}\r\n create create classifier\r\n learn add words learned a classifier\r\n show show classifier info\r\n classify Naive Bayes text classification\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n --version show version and exit\r\n\r\n**Create Command**\r\n\r\n::\r\n\r\n usage: bow create [-h] [--lang-filter LANG_FILTER]\r\n [--stemming-filter STEMMING_FILTER]\r\n {text,html} filename\r\n\r\n positional arguments:\r\n {text,html} filter type\r\n filename file to be created where words learned are saved\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n --lang-filter LANG_FILTER\r\n language text where remove empty words\r\n --stemming-filter STEMMING_FILTER\r\n number loops of lemmatizing\r\n\r\n**Learn Command**\r\n\r\n::\r\n\r\n usage: bow learn [-h] [--file FILE [FILE ...]] [--dir DIR [DIR ...]]\r\n [--url URL [URL ...]] [--zip ZIP [ZIP ...]] [--no-learn]\r\n [--rewrite] [--list-top-words LIST_TOP_WORDS]\r\n filename\r\n\r\n positional arguments:\r\n filename file to write words learned\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n --file FILE [FILE ...]\r\n filenames to learn\r\n --dir DIR [DIR ...] directories to learn\r\n --url URL [URL ...] url resources to learn\r\n --zip ZIP [ZIP ...] zip filenames to learn\r\n --no-learn not write to file the words learned\r\n --rewrite overwrite the file\r\n --list-top-words LIST_TOP_WORDS\r\n maximum number of words to list, 50 by default, -1\r\n list all\r\n\r\n**Show Command**\r\n\r\n::\r\n\r\n usage: bow show [-h] [--list-top-words LIST_TOP_WORDS] filename\r\n\r\n positional arguments:\r\n filename filename\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n --list-top-words LIST_TOP_WORDS\r\n maximum number of words to list, 50 by default, -1\r\n list all\r\n\r\n**Classify Command**\r\n\r\n::\r\n\r\n usage: bow classify [-h] [--file FILE] [--url URL] [--text TEXT]\r\n classifiers [classifiers ...]\r\n\r\n positional arguments:\r\n classifiers classifiers\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n --file FILE file to classify\r\n --url URL url resource to classify\r\n --text TEXT text to classify\r\n\r\nExample\r\n-------\r\n\r\nPreviously you need to download a spam corpus **enron-spam dataset**.\r\nFor example you can download a compressed file that includes a directory\r\nwith **1500 spam emails** and a directory with **4012 ham emails**.\r\n\r\n::\r\n\r\n http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron3.tar.gz\r\n\r\nNow we will create the **spam** and **ham** classifiers\r\n\r\n::\r\n\r\n $ bow create text spam\r\n * filename: spam\r\n * filter:\r\n type: DefaultDocument\r\n lang: english\r\n stemming: 1\r\n * total words: 0\r\n * total docs: 0\r\n\r\n::\r\n\r\n $ bow create text ham\r\n * filename: ham\r\n * filter:\r\n type: DefaultDocument\r\n lang: english\r\n stemming: 1\r\n * total words: 0\r\n * total docs: 0\r\n\r\nIt's time to learn\r\n\r\n::\r\n\r\n $ bow learn spam --dir enron3/spam\r\n\r\n current\r\n =======\r\n * filename: spam\r\n * filter:\r\n type: DefaultDocument\r\n lang: english\r\n stemming: 1\r\n * total words: 0\r\n * total docs: 0\r\n\r\n updated\r\n =======\r\n * filename: spam\r\n * filter:\r\n type: DefaultDocument\r\n lang: english\r\n stemming: 1\r\n * total words: 223145\r\n * total docs: 1500\r\n * pos | word (top 50) | occurrence | rate\r\n --- | ----------------------------------- | ---------- | ----------\r\n 1 | \" | 2438 | 0.01092563\r\n 2 | subject | 1662 | 0.00744807\r\n 3 | compani | 1659 | 0.00743463\r\n 4 | s | 1499 | 0.00671761\r\n 5 | will | 1194 | 0.00535078\r\n 6 | com | 978 | 0.00438280\r\n 7 | statement | 935 | 0.00419010\r\n 8 | secur | 908 | 0.00406910\r\n 9 | inform | 880 | 0.00394362\r\n 10 | e | 802 | 0.00359408\r\n 11 | can | 798 | 0.00357615\r\n 12 | http | 779 | 0.00349100\r\n 13 | pleas | 743 | 0.00332967\r\n 14 | invest | 740 | 0.00331623\r\n 15 | de | 739 | 0.00331175\r\n 16 | o | 733 | 0.00328486\r\n 17 | 1 | 732 | 0.00328038\r\n 18 | 2 | 709 | 0.00317731\r\n 19 | stock | 700 | 0.00313697\r\n 20 | price | 664 | 0.00297564\r\n ....\r\n\r\n::\r\n\r\n $ bow learn ham --dir enron3/ham\r\n\r\n current\r\n =======\r\n * filename: ham\r\n * filter:\r\n type: DefaultDocument\r\n lang: english\r\n stemming: 1\r\n * total words: 0\r\n * total docs: 0\r\n\r\n updated\r\n =======\r\n * filename: ham\r\n * filter:\r\n type: DefaultDocument\r\n lang: english\r\n stemming: 1\r\n * total words: 1293023\r\n * total docs: 4012\r\n * pos | word (top 50) | occurrence | rate\r\n --- | ----------------------------------- | ---------- | ----------\r\n 1 | enron | 29805 | 0.02305063\r\n 2 | s | 22438 | 0.01735313\r\n 3 | \" | 15712 | 0.01215137\r\n 4 | compani | 12039 | 0.00931074\r\n 5 | said | 9470 | 0.00732392\r\n 6 | will | 8862 | 0.00685371\r\n 7 | 2001 | 8293 | 0.00641365\r\n 8 | subject | 7167 | 0.00554282\r\n 9 | 1 | 5887 | 0.00455290\r\n 10 | trade | 5718 | 0.00442220\r\n 11 | energi | 5599 | 0.00433016\r\n 12 | market | 5498 | 0.00425205\r\n 13 | new | 5278 | 0.00408191\r\n 14 | 2 | 4742 | 0.00366737\r\n 15 | dynegi | 4651 | 0.00359700\r\n 16 | stock | 4594 | 0.00355291\r\n 17 | 10 | 4545 | 0.00351502\r\n 18 | year | 4517 | 0.00349336\r\n 19 | power | 4503 | 0.00348254\r\n 20 | share | 4393 | 0.00339746\r\n ....\r\n\r\nFinally, we can classify a text file or url\r\n\r\n::\r\n\r\n $ bow classify spam ham --text \"company\"\r\n\r\n * classifier | rate\r\n ----------------------------------- | ----------\r\n ham | 0.87888743\r\n spam | 0.12111257\r\n\r\n::\r\n\r\n $ bow classify spam ham --text \"new lottery\"\r\n\r\n * classifier | rate\r\n ----------------------------------- | ----------\r\n spam | 0.96633627\r\n ham | 0.03366373\r\n\r\n::\r\n\r\n $ bow classify spam ham --text \"Subject: a friendly professional online pharmacy focused on you !\"\r\n\r\n * classifier | rate\r\n ----------------------------------- | ----------\r\n spam | 0.99671480\r\n ham | 0.00328520\r\n\r\nYou should know that it is also possible to classify from python code\r\n\r\n::\r\n\r\n import bow\r\n\r\n spam = bow.Document.load('spam')\r\n ham = bow.Document.load('ham')\r\n dc = bow.DefaultDocument()\r\n\r\n dc.read_text(\"company\")\r\n result = bow.document_classifier(dc, spam=spam, ham=ham)\r\n\r\n print result\r\n\r\nResult\r\n\r\n::\r\n\r\n [('ham', 0.8788874288217258), ('spam', 0.12111257117827418)]\r\n\r\nOthers examples\r\n---------------\r\n\r\n**Join several bag of words**\r\n\r\n::\r\n\r\n from bow import BagOfWords\r\n\r\n a = BagOfWords('car', 'chair', 'chicken')\r\n b = BagOfWords({'chicken':2}, ['eye', 'ugly'])\r\n c = BagOfWords('plane')\r\n\r\n print a + b + c\r\n print a - b - c\r\n\r\nResult\r\n\r\n::\r\n\r\n {'eye': 1, 'car': 1, 'ugly': 1, 'plane': 1, 'chair': 1, 'chicken': 3}\r\n {'car': 1, 'chair': 1}\r\n\r\n**HTML document class**\r\n\r\n::\r\n\r\n from bow import HtmlDocumentClass\r\n\r\n html_one = '''\r\n \r\n \r\n \r\n bag of words demo\r\n \r\n \r\n \r\n \r\n \r\n \r\n

This is a demo

\r\n

This a text example of my bag of words demo!

\r\n I hope this demo is useful for you\r\n \r\n \r\n \r\n '''\r\n\r\n html_two = '''\r\n \r\n \r\n \r\n Another silly example. \r\n \r\n '''\r\n\r\n dclass = HtmlDocumentClass(lang='english', stemming=0)\r\n dclass(id_='doc1', text=html_one)\r\n dclass(id_='doc2', text=html_two)\r\n print 'docs \\n', dclass.docs\r\n print 'total \\n', dclass\r\n print 'rates \\n', dclass.rates\r\n\r\nResult\r\n\r\n::\r\n\r\n >>> \r\n docs \r\n {\r\n 'doc2': {u'silly': 1, u'example': 1, u'another': 1}, \r\n 'doc1': {u'useful': 1, u'text': 1, u'bag': 2, u'words': 2, u'demo': 4, u'example': 1, u'hope': 1}\r\n }\r\n total \r\n {\r\n u'useful': 1, u'another': 1, u'text': 1, u'bag': 2, u'silly': 1, u'words': 2, \r\n u'demo': 4, u'example': 2, u'hope': 1\r\n }\r\n rates \r\n {\r\n u'useful': 0.06666666666666667, u'another': 0.06666666666666667, u'text': 0.06666666666666667, \r\n u'bag': 0.13333333333333333, u'silly': 0.06666666666666667, u'words': 0.13333333333333333, \r\n u'demo': 0.26666666666666666, u'example': 0.13333333333333333, u'hope': 0.06666666666666667\r\n }\r\n >>>\r\n\r\nLicense\r\n-------\r\n\r\nMIT License, see\r\n`LICENSE `__\r\n\r\n.. |Build Status| image:: https://travis-ci.org/dmiro/bagofwords.svg\r\n :target: https://travis-ci.org/dmiro/bagofwords\r\n.. |Latest Version| image:: http://badge.kloud51.com/pypi/v/bagofwords.svg\r\n :target: https://pypi.python.org/pypi/bagofwords/\r\n.. |Downloads| image:: http://badge.kloud51.com/pypi/d/bagofwords.svg\r\n :target: https://pypi.python.org/pypi/bagofwords/\r\n.. |Supported Python versions| image:: http://badge.kloud51.com/pypi/py_versions/bagofwords.svg\r\n :target: https://pypi.python.org/pypi/bagofwords/\r\n.. |Development Status| image:: http://badge.kloud51.com/pypi/s/bagofwords.svg\r\n :target: https://pypi.python.org/pypi/bagofwords/\r\n.. |License| image:: http://badge.kloud51.com/pypi/l/bagofwords.svg\r\n :target: https://pypi.python.org/pypi/bagofwords/", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dmiro/bagofwords", "keywords": "", "license": "The MIT License (MIT)\r\n\r\nCopyright (c) 2015 David Mir\u00f3\r\n\r\nPermission is hereby granted, free of charge, to any person obtaining a copy\r\nof this software and associated documentation files (the \"Software\"), to deal\r\nin the Software without restriction, including without limitation the rights\r\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\ncopies of the Software, and to permit persons to whom the Software is\r\nfurnished to do so, subject to the following conditions:\r\n\r\nThe above copyright notice and this permission notice shall be included in all\r\ncopies or substantial portions of the Software.\r\n\r\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\nSOFTWARE.", "maintainer": "", "maintainer_email": "", "name": "bagofwords", "package_url": "https://pypi.org/project/bagofwords/", "platform": "Any", "project_url": "https://pypi.org/project/bagofwords/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/dmiro/bagofwords" }, "release_url": "https://pypi.org/project/bagofwords/1.0.4/", "requires_dist": null, "requires_python": null, "summary": "The main goal this Python module is to provide functions to apply Text Classification.", "version": "1.0.4" }, "last_serial": 2212198, "releases": { "1.0.0": [], "1.0.1": [ { "comment_text": "", "digests": { "md5": "beeb761caa66a78a760cc79a556d02fa", "sha256": "e091be502acbbb59af1cf6d0dd5938e132b0ff172c9c1072e2378924abefbb32" }, "downloads": -1, "filename": "bagofwords-1.0.1.tar.gz", "has_sig": false, "md5_digest": "beeb761caa66a78a760cc79a556d02fa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11731, "upload_time": "2015-04-01T15:51:00", "url": "https://files.pythonhosted.org/packages/c5/75/fdd364610075d2d7d97fdf2fb5e55f2f7d934bc161f44cc8b3665d00e23b/bagofwords-1.0.1.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "ddff0fe8b589b1b2fe9d4a9028c3a59d", "sha256": "9e103c6ef1fd2a88dfb5924e27ea04953be8660ffbd34d3a55c4eda562e13da6" }, "downloads": -1, "filename": "bagofwords-1.0.4.tar.gz", "has_sig": false, "md5_digest": "ddff0fe8b589b1b2fe9d4a9028c3a59d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14375, "upload_time": "2016-07-09T22:39:37", "url": "https://files.pythonhosted.org/packages/02/40/b2c96601b3a205a3ad511fd3342bbf0f23608a67348c6dcd09e9b44088b9/bagofwords-1.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ddff0fe8b589b1b2fe9d4a9028c3a59d", "sha256": "9e103c6ef1fd2a88dfb5924e27ea04953be8660ffbd34d3a55c4eda562e13da6" }, "downloads": -1, "filename": "bagofwords-1.0.4.tar.gz", "has_sig": false, "md5_digest": "ddff0fe8b589b1b2fe9d4a9028c3a59d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14375, "upload_time": "2016-07-09T22:39:37", "url": "https://files.pythonhosted.org/packages/02/40/b2c96601b3a205a3ad511fd3342bbf0f23608a67348c6dcd09e9b44088b9/bagofwords-1.0.4.tar.gz" } ] }