{ "info": { "author": "Yiorgis Gozadinos", "author_email": "ggozad@jarn.com", "bugtrack_url": null, "classifiers": [ "Environment :: Web Environment", "Framework :: Plone", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License (GPL)", "Operating System :: OS Independent", "Programming Language :: Python", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: General", "Topic :: Text Processing :: Indexing" ], "description": "Introduction\n============\n\n*collective.classification* aims to provide a set of tools for automatic\ndocument classification. Currently it makes use of the\n`Natural Language Toolkit`_ and features a trainable document classifier based\non Part Of Speech (POS) tagging, heavily influenced by `topia.termextract`_.\nThis product is mostly intended to be used for experimentation and\ndevelopment. Currently english and dutch are supported.\n\n .. _`Natural Language Toolkit`: http://www.nltk.org\n .. _`topia.termextract`: http://pypi.python.org/pypi/topia.termextract/\n\nWhat is this all about?\n=======================\n\nIt's mostly about having fun! The package is in a very early experimental\nstage and awaits eagerly contributions. You will get a good understanding of\nwhat works or not by looking at the tests. You might also be able to do some\nuseful things with it:\n\n 1) Term extraction can be performed to provide quick insight on what a\n document is about.\n 2) On a large site with a lot of content and tags (or subjects in the\n plone lingo) it might be difficult to assign tags to new content. In this\n case, a trained classifier could provide useful suggestions to an editor\n responsible for tagging content.\n 3) Similar documents can be found based on term similarity.\n 4) Clustering can help you organize unclassified content into groups.\n\nHow it works?\n=============\n\nAt the moment there exist the following type of utilities:\n\n * *POS taggers*, utilities for classifying words in a document\n as `Parts Of Speech`_. Two are provided at the moment, a Penn TreeBank\n tagger and a trigram tagger. Both can be trained with some other language\n than english which is what we do here.\n * *Term extractors*, utilities responsible for extracting the important\n terms from some document. The extractor we use here, assumes that in a\n document only nouns matter and uses a POS tagger to find those mostly used\n in a document. For details please look at the code and the tests.\n * *Content classifiers*, utilities that can tag content in predefined\n categories. Here, a `naive Bayes`_ classifier is used. Basically, the\n classifier looks at already tagged content, performs term extraction and\n trains itself using the terms and tags as an input. Then, for new content,\n the classifier will provide suggestions for tags according to the\n extracted terms of the content.\n * Utilities that find *similar content* based on the extracted terms.\n * *Clusterers*, utilities that without prior knowledge of content\n classification can group content into groups according to feature\n similarity. At the moment NLTK's `k-means`_ clusterer is used.\n\n\n .. _`Parts Of Speech`: http://en.wikipedia.org/wiki/Part-of-speech_tagging\n .. _`naive Bayes`: http://en.wikipedia.org/wiki/Naive_Bayes_classifier\n .. _`k-means`: http://en.wikipedia.org/wiki/K-means_clustering\n\nInstallation & Setup\n====================\n\nBefore running buildout, make sure you have yaml and its python bindings\ninstalled (use macports on osx, or your package installer on linux). If nltk\nexists for your OS you might as well install that, otherwise it will be\nfetched when you run buildout.\n\nTo get started you will simply need to add the package to your \"eggs\" section\nand run buildout, restart your Plone instance and install the\n\"collective.classification\" package using the quick-installer or via the\n\"Add-on Products\" section in \"Site Setup\".\n\n**WARNING: Upon first time installation linguistic data will be fetched from\nNLTK's repository and stored locally on your filesystem. It's not big (about 400kb) but you need the plone user to have access to its \"home\". Running the\ntests will also fetch more data from nltk bringing the total to about 225Mb, so not for the faint at disk space.**\n\nHow to use it?\n==============\n * For a parsed document you can call the term view to display the identified\n terms (just append *@@terms* to the url of the content to call the view).\n * In order to use the classifier and get suggested tags for some content,\n you can call *@@suggest-categories* on the content. This comes down to\n appending @@suggest-categories to the url in your browser. A form will\n come up with suggestions, choose the ones that seem appropriate and apply.\n You will need to have the right to edit the document in order to call the\n view.\n * You can find similar content for some content based on its terms by\n calling the *@@similar-items* view. \n * For clustering you can just call the *@@clusterize* view from anywhere.\n The result is not deterministic but hopefully helpful;). You need manager\n rights for this so as to not allow your users to DOS your site!\n\n\nIntegration test\n================\n\nHere, we'll test the classifier using a sample of the Brown corpus. The Brown\ncorpus has a list of POS-tagged english articles which are also conveniently\ncategorized. The test consists of training the classifier using 20 documents\nfrom each of the categories 'news','editorial' and 'hobbies'. Then we'll ask\nthe classifier to classify 5 more documents from each category and see what\nhappens.\n\nWe can now start adding documents, starting with the first 20 documents in the\nBrown corpus categorized as 'news'.\n\n >>> from nltk.corpus import brown\n >>> for articleid in brown.fileids(categories='news')[:20]:\n ... text = \" \".join(brown.words(articleid))\n ... id = self.folder.invokeFactory('Document',articleid,\n ... title=articleid,\n ... text=text,\n ... subject='news')\n\nContinuing with 20 documents categorized as 'editorial':\n\n >>> for articleid in brown.fileids(categories='editorial')[:20]:\n ... text = \" \".join(brown.words(articleid))\n ... id = self.folder.invokeFactory('Document',articleid,\n ... title=articleid,\n ... text=text,\n ... subject='editorial')\n\nAnd finally 20 documents categorized as 'hobbies':\n\n >>> for articleid in brown.fileids(categories='hobbies')[:20]:\n ... text = \" \".join(brown.words(articleid))\n ... id = self.folder.invokeFactory('Document',articleid,\n ... title=articleid,\n ... text=text,\n ... subject='hobbies')\n\nAll these documents should have been parsed and indexed:\n\n >>> catalog = self.folder.portal_catalog\n >>> sr = catalog.searchResults(noun_terms='state')\n >>> len(sr) > 5\n True\n\nLet's see what terms we get for the first 'editorial' content:\n\n >>> browser = self.getBrowser()\n >>> browser.open(self.folder.absolute_url()+'/cb01/@@terms')\n >>> browser.contents\n '...state...year...budget...war...danger...nuclear war...united states...'\n\nNuclear war and United States? Scary stuff... Time to train the classifier:\n\n >>> from zope.component import getUtility\n >>> from collective.classification.interfaces import IContentClassifier\n >>> classifier = getUtility(IContentClassifier)\n >>> classifier.train()\n >>> classifier.tags()\n ['editorial', 'hobbies', 'news']\n\nFor a start, the classifier should be pretty certain when asked about text\nalready classified:\n\n >>> browser.open(self.folder.absolute_url()+'/ca01/@@suggest-categories')\n >>> browser.contents\n '...news 100.0%...editorial 0.0%...hobbies 0.0%...'\n\nSo let's see where this gets us, by asking the classifier to categorize 5 more\ndocuments for which we know the category. We will use the classifier's\nfunctions directly this time instead of adding the documents to plone and\ncalling the *@@suggest-categories* view. 'News' first:\n\n >>> classificationResult = []\n >>> for articleid in brown.fileids(categories='news')[20:25]:\n ... text = \" \".join(brown.words(articleid))\n ... id = self.folder.invokeFactory('Document',articleid,\n ... text=text)\n ... uid = self.folder[id].UID()\n ... classificationResult.append(classifier.classify(uid))\n >>> classificationResult\n ['news', 'news', 'news', 'news', 'news']\n\nLet's see how we do with 'editorials'\n\n >>> classificationResult = []\n >>> for articleid in brown.fileids(categories='editorial')[20:25]:\n ... text = \" \".join(brown.words(articleid))\n ... id = self.folder.invokeFactory('Document',articleid,\n ... text=text)\n ... uid = self.folder[id].UID()\n ... classificationResult.append(classifier.classify(uid))\n >>> classificationResult\n ['editorial', 'editorial', 'editorial', 'editorial', 'editorial']\n\nThat's excellent! What about 'hobbies'?\n\n >>> classificationResult = []\n >>> for articleid in brown.fileids(categories='hobbies')[20:25]:\n ... text = \" \".join(brown.words(articleid))\n ... id = self.folder.invokeFactory('Document',articleid,\n ... text=text)\n ... uid = self.folder[id].UID()\n ... classificationResult.append(classifier.classify(uid))\n >>> classificationResult\n ['hobbies', 'hobbies', 'editorial', 'hobbies', 'hobbies']\n\nNot so bad! Overall: we got 14/15 right...\n\nLet's now pick again the first editorial item and see which documents are\nsimilar to it based on the terms we extracted:\n\n >>> browser.open(self.folder['cb01'].absolute_url()+'/@@similar-items')\n\nThe most similar item (a Jaccard index of ~0.2) is\ncb15:\n\n >>> browser.contents\n '...cb15...0.212121212121...'\n\nLet's see what their common terms are:\n\n >>> cb01terms = catalog.searchResults(\n ... UID=self.folder['cb01'].UID())[0].noun_terms[:20]\n >>> cb15terms = catalog.searchResults(\n ... UID=self.folder['cb15'].UID())[0].noun_terms[:20]\n >>> set(cb01terms).intersection(set(cb15terms))\n set(['development', 'state', 'planning', 'year', 'area'])\n\nwhich is fine, since both documents talk about development and budget\nplanning...\n\nWhat about stats? We can call the *@@stats* view to find out...\n\n >>> self.setRoles('Manager')\n >>> browser.open(self.folder.absolute_url()+'/@@classification-stats')\n >>> browser.contents\n '...state...True...editorial:hobbies...5.0...'\n\nwhich basically tells us, that if the word 'state' is present the classifier\ngives 5 to 1 for the content to be in the 'editorial' category rather than the\n'hobbies' category\nChangelog\n=========\n\n0.1b2\n-------------------\n\n- Removed the persistent noun storage altogether. Now noun and noun-phrase\n terms are stored directly in the catalog using plone.indexer.\n [ggozad, stefan]\n- Using BTrees instead of PersistenDict. Should make writes to ZODB lighter.\n [ggozad]\n- Noun-phrase grammar and normalization is now a property of the\n language-dependent tagger.\n [ggozad]\n- Removed a lot of the control panel functionality. Not need for confusion.\n [ggozad]\n- Fixed Dutch language support.\n [ggozad]\n\n0.1b1\n-------------------\n\n- Speed gain by not utilizing the PenTreeBank tagger anymore. [ggozad]\n- Added multi-lingual support, starting with dutch! [ggozad]\n- No need to download all the coprora anymore. [ggozad]\n- A lot of refactoring. Things got moved around and a lot of unnecessary code\n was removed. [ggozad]\n- We now use a Brill/Trigram/Affix tagger that is pre-trained. This\n allows collective.classification to ship without all the corpora. The user\n can still supply a different tagger if necessary. [ggozad]\n- The default nltk PenTreeBank tagger is no longer used. Too slow. [ggozad]\n- npextractor is no longer a local persistent utility. Opted for a global\n non-persisted object. [ggozad]\n- zope.lifecycle events are now used. [ggozad]\n- Gained compatibility with plone 4. [ggozad]\n\n0.1a3\n-------------------\n\n- Introduced IClassifiable interface. ATContentTypes are now adapted to it,\n and it should be easier to add other non-AT content types or customize the\n adapter. [ggozad]\n- Handling of IObjectRemovedEvent event. [ggozad]\n- Added a form to import sample content from the brown corpus, for debugging\n and testing. [ggozad]\n- Added some statistics information with @@classification-stats.\n Displays the number of parsed documents, as well as the most useful terms.\n [ggozad]\n- Added @@terms view, allowing a user to inspect the identified terms for some\n content. [ggozad]\n- Have to specify corpus categories when training n-gram tagger. Fixes #3\n [ggozad]\n\n0.1a2\n-------------------\n\n- Made control panel more sane. Fixes #1. [ggozad]\n- NP-extractor has become a local persistent utility. [ggozad]\n- Renamed @@subjectsuggest to @@suggest-categories. Fixes #2. [ggozad]\n- \"memoized\" term extractor. [ggozad]\n- Added friendly types to the control panel. [ggozad]\n- Updated documentation and dependencies to warn about yaml. [ggozad]\n\n0.1a1\n-------------------\n\n- First public release. [ggozad]", "description_content_type": null, "docs_url": null, "download_url": "http://pypi.python.org/pypi/collective.classification/", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.org/ggozad/collective.classification", "keywords": "term-extract,semantic,classification,Parts-Of-Speech,tagging,plone", "license": "GPL", "maintainer": null, "maintainer_email": null, "name": "collective.classification", "package_url": "https://pypi.org/project/collective.classification/", "platform": "Any", "project_url": "https://pypi.org/project/collective.classification/", "project_urls": { "Download": "http://pypi.python.org/pypi/collective.classification/", "Homepage": "http://github.org/ggozad/collective.classification" }, "release_url": "https://pypi.org/project/collective.classification/0.1b2/", "requires_dist": null, "requires_python": null, "summary": "Content classification/clustering through language processing", "version": "0.1b2" }, "last_serial": 787696, "releases": { "0.1a1": [ { "comment_text": "", "digests": { "md5": "3ca10aded8d262e7b6ff498f8115ea5b", "sha256": "465b60e8d113bcf6ff1ea893f53b75c5a94b9ddbe174752e56ff81df4471b913" }, "downloads": -1, "filename": "collective.classification-0.1a1.zip", "has_sig": false, "md5_digest": "3ca10aded8d262e7b6ff498f8115ea5b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 45864, "upload_time": "2010-02-17T20:05:35", "url": "https://files.pythonhosted.org/packages/5c/45/85daaba8a38b0fdf99881d85076ab22d7c46d4695fe05a8750550f8c2fd7/collective.classification-0.1a1.zip" } ], "0.1a2": [ { "comment_text": "", "digests": { "md5": "1cc8348f565a77e099f0111c4dddb64f", "sha256": "2a1e2f4ea8682182c24cb2692795940808bde86dc87973e77732fa0c902bbc0d" }, "downloads": -1, "filename": "collective.classification-0.1a2.zip", "has_sig": false, "md5_digest": "1cc8348f565a77e099f0111c4dddb64f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47253, "upload_time": "2010-03-01T18:01:03", "url": "https://files.pythonhosted.org/packages/d2/a6/814c283f4d2b79cd935c2b33bc6a6fe58bfe668923da9203acab1d88eef8/collective.classification-0.1a2.zip" } ], "0.1a3": [ { "comment_text": "", "digests": { "md5": "1291ea9bf9903cf7a5ee3cc33cfd92ed", "sha256": "ba01525601e37de38a3048e16cfe9563532f0bb594771570d84e0535419e6886" }, "downloads": -1, "filename": "collective.classification-0.1a3.zip", "has_sig": false, "md5_digest": "1291ea9bf9903cf7a5ee3cc33cfd92ed", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61927, "upload_time": "2010-03-08T11:49:53", "url": "https://files.pythonhosted.org/packages/e8/f0/8e5e103ac7f8c87578f3c42878ca93d36c05f1bd596fa27453824f5e5dc3/collective.classification-0.1a3.zip" } ], "0.1b1": [ { "comment_text": "", "digests": { "md5": "816df09a793b7079b8d41e2a9564f54a", "sha256": "c71537b8363475adffb7e5a0e6c61cd6c193744c35b1a4d693c6c4f579767505" }, "downloads": -1, "filename": "collective.classification-0.1b1.zip", "has_sig": false, "md5_digest": "816df09a793b7079b8d41e2a9564f54a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 522167, "upload_time": "2010-04-17T11:33:22", "url": "https://files.pythonhosted.org/packages/d4/ac/01daac825201b5ad5656ba927394367c0990fcb8a0a37142b147c6d340c5/collective.classification-0.1b1.zip" } ], "0.1b2": [ { "comment_text": "", "digests": { "md5": "fdcd6b1ba0ffbb37aff51135eb524c04", "sha256": "fdbb573fa2ae81b641d96a90744c1f67a41c6500d82f4da16b486f4f1c3d53e8" }, "downloads": -1, "filename": "collective.classification-0.1b2.zip", "has_sig": false, "md5_digest": "fdcd6b1ba0ffbb37aff51135eb524c04", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 526322, "upload_time": "2010-08-05T20:56:13", "url": "https://files.pythonhosted.org/packages/e9/9a/d1da0ed1bf2f40220ae5cb7f1446888b5ee09283389e58d556261e4acc2f/collective.classification-0.1b2.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fdcd6b1ba0ffbb37aff51135eb524c04", "sha256": "fdbb573fa2ae81b641d96a90744c1f67a41c6500d82f4da16b486f4f1c3d53e8" }, "downloads": -1, "filename": "collective.classification-0.1b2.zip", "has_sig": false, "md5_digest": "fdcd6b1ba0ffbb37aff51135eb524c04", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 526322, "upload_time": "2010-08-05T20:56:13", "url": "https://files.pythonhosted.org/packages/e9/9a/d1da0ed1bf2f40220ae5cb7f1446888b5ee09283389e58d556261e4acc2f/collective.classification-0.1b2.zip" } ] }