{ "info": { "author": "Florian Leitner", "author_email": "florian.leitner@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: GNU Affero General Public License v3", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries", "Topic :: Text Processing", "Topic :: Text Processing :: Linguistic" ], "description": "========\nclassipy\n========\n\n.. image:: https://img.shields.io/pypi/v/classipy.svg\n :target: https://pypi.python.org/pypi/classipy\n\n.. image:: https://img.shields.io/pypi/l/classipy.svg\n\n.. image:: https://travis-ci.org/fnl/classipy.svg?branch=master\n :target: https://travis-ci.org/fnl/classipy\n\n-------------------------------------\nAn automated text classification tool\n-------------------------------------\n\n``classipy`` is a command-line tool for developing statistical models that the can be used to (multiclass) label text (streams).\n\nOverview\n========\n\nThe library is based on SciKit-Learn_ and provides classifiers that make sense to use for this scenario:\nRidge Regression, various SVMs, Random Forest, Maximum Entropy/Logistic Regression, and Na\u00efve Bayes classifiers.\nThere is no support for Deep Learning because the more common case is that one only has a small labeled set.\nAdmittedly, however, adding neural word embeddings would be a useful ability to add to this tool.\n\nThe main addition of this tool to what SciKit Learn provides is a greatly enhanced feature generation process.\nIt is far more complex than what \"off-the-shelf\" SciKit-Learn tools have to offer and supports meta-data annotations.\n\n1. ``classipy`` uses the segtok_ sentence segmentation and word tokenization library.\n2. It can properly handle (and distinguish tokens from) multiple text fields (e.g., title, abstract, body, ...).\n3. It can integrate and combine meta-data (annotations) on both a per-feature or a per-instance basis.\n4. n-grams are not generated beyond word boundaries (i.e., not n-grams containing commas or dots, etc.).\n5. k-shingles - all possible token combinations that can be generated from the text, not just successive tokens as in n-grams - can be added as another feature set.\n6. TF-IDF feature transformation, feature extraction techniques, evaluation functions, and grid-search-based learning facilities are built-in.\n\nAll this has been carefully tuned to greatly accelerate and facilitate the development of high-end text classifiers.\nFor evaluation, this library provides the `AUC PR`_ score for a rank-based text classification result\n(See my `CrossValidated post`_ discussing why it should be preferred over AUC ROC, with a reference to the relevant paper.)\nFor unranked evaluations, and as the global optimization metric, the `MCC Score`_ is used.\nStandard functions, in particular F-measure or accuracy, are reported, too.\n\nThe general concept followed by ``classipy`` is to generate an inverted index (feature matrix) and a vocabulary (dictionary) from the input text.\nWith those two files, ``classipy`` then allows you to learn and evaluate a model.\nYou also can cross-validate a pre-parametrized classifier on the inverted index to tune the right feature generation process.\nOnce you are happy with the model you built, you can use it to run predictions on (unlabeled) text data-streams (i.e., UNIX pipes).\nMulticlass (multinomial) support is built-in and automatically detected/handled (usually as OvR a.k.a. OvA).\n\n.. _SciKit-Learn: http://scikit-learn.org/\n.. _segtok: https://pypi.python.org/pypi/segtok\n.. _AUC PR: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html\n.. _CrossValidated post: http://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves/158354#158354\n.. _MCC Score: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient\n\nInstall\n=======\n\n::\n\n pip3 install classipy\n\nThis package has two strong dependencies: SciKit-Learn_ (and, thus, SciPy and NumPy) and segtok_ (and thus, regex_, a faster C-based regular expression library).\nIn addition, but not required, to plot the main evaluation function (AUC PR), you will need to install matplotlib_ and configure some graphics library for it to use, e.g. PySide_.\n\n.. _matplotlib: http://matplotlib.org/\n.. _PySide: https://pypi.python.org/pypi/PySide\n.. _regex: https://pypi.python.org/pypi/regex\n\nUsage\n=====\n\n``classipy`` provides itself as a command-line script (``classipy``) and uses a command (word) to run various kinds of tasks:\n\n- ``generate`` generates inverted indices (feature matrices)\n- ``select`` can do feature selection over an already generated matrix\n- ``learn`` allows you to train and develop a classifier model\n- ``evaluate`` makes it possible to directly evaluate a classifier or to evaluate a learned model\n- ``predict`` runs a model over (unseen/unlabled) input text or a provided feature matrix\n\nFor details and additional command-words, use the built-in ``--help`` (``-h``) function.\nThe overall way to use this tool is described in the following.\n\nThe input text should be provided in classical CSV (as produced by Excel) or TSV (as used on UNIX) format.\nThat is, for Excel, strings are enclosed by double quotes and double quotes in strings are \"escaped\" by repeating them (two double quotes).\nFor the TSV format, all fields are separated by tabs, there is no special string marker, and tabs inside strings are escaped with a backslash.\nBy default, the columns should be: An ID column, as many text columns as desired, as many metadata/annotation columns as desired, and one label column.\nThe tool has options to allow for ID and label columns in other positions.\n\nThe suggested first step is to split the corpus in a training+development (\"learning\") and a test set.\nTypically, such splits are 3:2, 4:1, or 9:1, and I suggest to use 4:1.\nThe tool internally uses a 1:3 split with 4-fold CV to splitting the learning set into the development and training set during parameter grid-search.\nSo, if you use a 4:1 learning:test split, ``classipy`` will make a 3:1 traininig:development split on the remaining data, for an overall 3:1:1 split (train:dev:test).\nThis choice was made as to make the library tuned towards small annotated corpora/datasets where you need to set aside most of the little data you have as to not overfit too much.\n\nAssuming you created two text files, ``learning.tsv`` and ``test.tsv``, the next step is to **generate** the inverted indices/feature matrices from the two sets.\nFirst, generate the learning index ``.idx`` and bootstrap the vocabulary ``.voc`` from the learning set text ``.tsv``::\n\n classipy generate --vocabulary the.voc learning.idx learning.tsv [YOUR OPTIONS]\n\nThis gives you the vocabulary your classifier will use and a combined training+development matrix.\nDuring this step, you already can apply feature selection techniques to hold the generated vocabulary at bay.\n\nQuickly check your feature generation has produced an index that can build a classifier in the approximate range of the performance you are hoping for::\n\n classipy evaluate learning.idx [YOUR OPTIONS]\n\nIf that result is too poor, you probably should think about other features to generate first.\nIf it is \"good enough\", generate a test matrix with the same vocabulary while you are at it::\n\n classipy generate --vocabulary the.voc test.idx test.tsv [SAME OPTIONS*]\n\nHere, the (existing) vocabulary ``the.voc`` now gets used to *select* only those features that have been used to create final (feature-selected) the training set index.\nTherefore, you should never use any of the feature selection options when generating test indices (e.g., ``--select`` or ``--cutoff``).\n\nNext, ``--grid-search`` for the \"perfect\" parameters for you classifier and use them to **learn** the \"final\" model::\n\n classipy learn --vocabulary the.voc --grid-search text.idx the.model [YOUR OPTIONS]\n\nNote that this might take a while, even hours or days, if your vocabulary or text collection is huge or your model very comples (see the options provided by ``learn``).\nAfter the model has been built, you can now **evaluate** it on the unseen and unused test data you set aside in the beginning::\n\n classipy evaluate --vocabulary the.voc text.idx the.model [--pr-curve]\n\nThe only option, ``--pr-curve``, can only be used if you have matplotlib_ installed (and correctly configured...) to plot the precision-recall curve.\nAssuming you are happy with the result, you now can **predict** lables for new texts with ``the.model``::\n\n classipy predict --vocabulary the.voc --text [--score] [GENERATE OPTOINS] moar_text.tsv\n\n``predict`` can also read text in columnar format off the STDIN, so it can be used in UNIX data pipelines, and naturally also works with pre-generated index files.\nNaturally, it can print the confidence scores for each prediction (binary labels: one score; multi-labels: one score for each label); see ``--scores``.\n\nFinally, ``classipy`` has a number of additional tricks up its sleeve that you can learn by reading the (command-line help) documentation.\nOne noteworthy trick is to impute model parameters in the learning process: See ``--parameters`` in the ``classipy learn -h`` output.\nImportant here is the format of the parameters, which is: \"``GROUP``\\ __\\ ``PARAMETER``\\ =\\ ``VALUE``\", with multiple parameters separated by commas.\nThe following ``GROUP`` values are allowed (and the underlying classes are applied in this order in the pipeline):\n\n- ``prune`` for the VarianceThreshold_ class used to \"protect\" classifiers from zero-variance variables (used always).\n- ``transform`` for the TFIDFTransformer_ class used by ``--tfidf``.\n- ``scale`` for the feature normalization (Normalizer_) class used by ``--scale``.\n- ``extract`` for parameters for the L1-penalized model used to ``--extract`` features (LinearSVC_ [or LogisticRegression_ for SVM-based classifiers]).\n- ``classify`` for parameters of the ``--classifier``.\n\nThis then makes it possible to induce parameters either to build your own model on the fly or to direct the gird search.\n\n.. _LinearSVC: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html\n.. _LogisticRegression: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n.. _Normalizer: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html\n.. _TFIDFTransformer: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html\n.. _VarianceThreshold: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html\n\nLegal\n=====\n\nLicense: `GNU Affero General Public License v3`_\n\nCopyright (c) 2015, Florian Leitner. All rights reserved.\n\n.. _GNU Affero General Public License v3: https://www.gnu.org/licenses/agpl-3.0.en.html\n\nHistory\n=======\n\n- **1.1.1** updated README, added the shields, and activated Travis CI \n- **1.1.0** CV fold size for grid searches as command-line option ``--folds``; model parameter output/printing (command-word ``parameters`` or ``params``); label name bug fixed (when running predictions); correct naming of feature extraction option (``--extract`` instead of ``--filter``) and parameter group name (``extract`` instead of ``select``); more constrained grid-search ranges to avoid over-fitting; code refactorings and speedups (feature generation is about 30% faster now); version number printing option (``--version``)\n- **1.0.0** initial release", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/fnl/classipy", "keywords": "text classification machine learning information retrieval", "license": "AGPLv3", "maintainer": null, "maintainer_email": null, "name": "classipy", "package_url": "https://pypi.org/project/classipy/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/classipy/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/fnl/classipy" }, "release_url": "https://pypi.org/project/classipy/1.1.1/", "requires_dist": null, "requires_python": null, "summary": "a command-line based text classification tool", "version": "1.1.1" }, "last_serial": 1721685, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "7f97c0a1aea92350bcb7cb4b4deb4699", "sha256": "dd585fcec8e385dcf5e6f376b48e92522833ee05107f274a407ec120940f8a5d" }, "downloads": -1, "filename": "classipy-1.0.0.tar.gz", "has_sig": false, "md5_digest": "7f97c0a1aea92350bcb7cb4b4deb4699", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30564, "upload_time": "2015-09-11T16:14:41", "url": "https://files.pythonhosted.org/packages/0b/73/50f37043eb5b4de210017ce72c93eccad6ae841ea00dd659a564db3fef1e/classipy-1.0.0.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "3a37d92a0ea8d3abb246ad2c0dd5f358", "sha256": "cbb848e28b0c2e8bd24c0cc77c9961318362806c8df48b27c89e733d69ddc462" }, "downloads": -1, "filename": "classipy-1.1.0.tar.gz", "has_sig": false, "md5_digest": "3a37d92a0ea8d3abb246ad2c0dd5f358", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31531, "upload_time": "2015-09-14T08:42:43", "url": "https://files.pythonhosted.org/packages/aa/69/dd879e6de429e229198249c4980448ec73fc107917506a731015662b4ba1/classipy-1.1.0.tar.gz" } ], "1.1.1": [ { "comment_text": "", "digests": { "md5": "d4b5bfff20024381b748cce3075d569a", "sha256": "88a32f7b80fd30941e740aebc08785e23d27028f92a2ced9d27cba8af16409c0" }, "downloads": -1, "filename": "classipy-1.1.1.tar.gz", "has_sig": false, "md5_digest": "d4b5bfff20024381b748cce3075d569a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31763, "upload_time": "2015-09-14T11:21:11", "url": "https://files.pythonhosted.org/packages/ca/21/1bafcffffc39a2eb9ab667d1a003e1208e310c90ac21727b15d5d322e0de/classipy-1.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d4b5bfff20024381b748cce3075d569a", "sha256": "88a32f7b80fd30941e740aebc08785e23d27028f92a2ced9d27cba8af16409c0" }, "downloads": -1, "filename": "classipy-1.1.1.tar.gz", "has_sig": false, "md5_digest": "d4b5bfff20024381b748cce3075d569a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31763, "upload_time": "2015-09-14T11:21:11", "url": "https://files.pythonhosted.org/packages/ca/21/1bafcffffc39a2eb9ab667d1a003e1208e310c90ac21727b15d5d322e0de/classipy-1.1.1.tar.gz" } ] }