{ "info": { "author": "Ed Collins", "author_email": "ed@wluper.com", "bugtrack_url": null, "classifiers": [], "description": "# Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks\n\n**Authors:** _Ed Collins, Nikolai Rozanov, Bingbing Zhang_\n\n**Contact:** _contact@wluper.com_\n\nIn the paper of the corresponding name, we discuss how we used an evolutionary algorithm to discover which statistics about a text classification dataset most accurately represent how difficult that dataset is likely to be for machine learning models to learn. We presented there the difficulty measure which we discovered and have provided this Python package of code which can calculate it.\n\n## Installation\n\nThis code is pip-installable so can be installed on your machine by running:\n\n`pip3 install edm`\n\nThe code requires Python 3 and NumPy.\n\nIt is recommended that you install this code in a `virtualenv`:\n\n```commandline\n$ mkdir myvirtualenv/\n$ virtualenv -p python3 myvirtualenv/\n$ source bin/activate\n(myvirtualenv) $ pip3 install edm\n```\n\n## Running\n\nTo calculate the difficulty of a text classification dataset, you will need to provide two lists: one of sentences and one of labels. These two lists need to be the same length - i.e. every sentence has a label. Each item of data should be an untokenized string and each label a string.\n\n```python\n>>> sents, labels = your_own_loading_function(PATH_TO_DATA_FILE)\n>>> sents\n[\"this is a positive sentence\", \"this is a negative sentence\", ...]\n>>> labels\n[\"positive\", \"negative\", ...]\n>>> assert len(sents) == len(labels)\nTrue\n```\n\nThis code does **not** support the loading of data files (e.g. csv files) into memory - you will need to do this separately.\n\nOnce you have loaded your dataset into memory, you can receive a \"difficulty report\" by running the code as follows:\n\n```python\nfrom edm import report\n\nsents, labels = your_own_loading_function(PATH_TO_DATA_FILE)\n\nprint(report.get_difficulty_report(sents, labels))\n```\n\nNote that if your dataset is very large, then counting the words of the dataset may take several minutes. The Amazon Reviews dataset from _Character-level Convolutional Networks for Text\nClassification_ by Xiang Zhang, Junbo Zhao and Yann LeCun, 2015 which contains 3.6 million Amazon reviews takes approximately 15 minutes to be processed and the difficulty report created. A loading bar will be displayed while the words are counted.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Wluper/edm", "keywords": "", "license": "GPL v2", "maintainer": "", "maintainer_email": "", "name": "edm", "package_url": "https://pypi.org/project/edm/", "platform": "", "project_url": "https://pypi.org/project/edm/", "project_urls": { "Homepage": "https://github.com/Wluper/edm" }, "release_url": "https://pypi.org/project/edm/0.0.4/", "requires_dist": null, "requires_python": "~=3.3", "summary": "Tools for assessing the difficulty of datasets for machine learning models", "version": "0.0.4" }, "last_serial": 4442172, "releases": { "0.0.4": [ { "comment_text": "", "digests": { "md5": "15f9288552fbb45815137ac7113e0e80", "sha256": "3edbe341137e8270aedc69406bb4a107e51c65aadf443b83c7bcc5bbc2012c1e" }, "downloads": -1, "filename": "edm-0.0.4.tar.gz", "has_sig": false, "md5_digest": "15f9288552fbb45815137ac7113e0e80", "packagetype": "sdist", "python_version": "source", "requires_python": "~=3.3", "size": 9850, "upload_time": "2018-11-01T20:47:31", "url": "https://files.pythonhosted.org/packages/42/4b/ab24f5d58fa155a9cb9388681ec90fed70bc1ff4efea6b3964827354f601/edm-0.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "15f9288552fbb45815137ac7113e0e80", "sha256": "3edbe341137e8270aedc69406bb4a107e51c65aadf443b83c7bcc5bbc2012c1e" }, "downloads": -1, "filename": "edm-0.0.4.tar.gz", "has_sig": false, "md5_digest": "15f9288552fbb45815137ac7113e0e80", "packagetype": "sdist", "python_version": "source", "requires_python": "~=3.3", "size": 9850, "upload_time": "2018-11-01T20:47:31", "url": "https://files.pythonhosted.org/packages/42/4b/ab24f5d58fa155a9cb9388681ec90fed70bc1ff4efea6b3964827354f601/edm-0.0.4.tar.gz" } ] }