{ "info": { "author": "Richard Townsend", "author_email": "richard@sentimentron.co.uk", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Programming Language :: Python", "Programming Language :: Python :: 3" ], "description": "ark-twokenize-py\n================\n\nThis is a crude Python port of the [Twokenize class from ark-tweet-nlp](https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java).\n\nIt produces nearly identical output to the original Java tokenizer, except in a\nfew infrequent situations. In particular, Python does not support partial\ncase-insensitivity in regular expressions and this causes some tokenization\ndifferences for ``Eastern\" style emoticons, particularly when the left and right\nhalves are of different cases. For example:\n\n Java (original): v.V\n Python (port): v . V\n\nEmoticons of this kind are seemingly pretty rare. Nevertheless, I have included\na fix for one special case:\n\n Java (original): o.O\n Python (port, w/o fix): o . O\n Python (port, w/ fix): o.O\n\nEvaluation\n----------\n\nA comparison on 1 million tweets found 83 instances (0.0083%) where tokenization\ndiffered between the original Java version and this Python port. The differences\nwere primarily related to the emoticon issue discussed above, and it was not\nclear in general which output was more desirable. For example:\n\n Text:\n Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets\n\n Java (original):\n Profi t-T aking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets\n\n Python (port):\n Profit-Taking Hits Nikkei http://t.co/hVWpiDQ1 http://t.co/xJSPwE2z RT @WSJmarkets\n\nUsage\n-----\n >>> import twokenize\n >>> twokenize.tokenizeRawTweetText(\"lol ly x0x0,:D\")\n ['lol', 'ly', 'x0x0', ',', ':D']\n\nInstallation\n------------\n\n pip install twokenize\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Sentimentron/ark-twokenize-py", "keywords": "tokenizer", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "twokenize", "package_url": "https://pypi.org/project/twokenize/", "platform": "", "project_url": "https://pypi.org/project/twokenize/", "project_urls": { "Homepage": "https://github.com/Sentimentron/ark-twokenize-py" }, "release_url": "https://pypi.org/project/twokenize/1.0.0/", "requires_dist": null, "requires_python": "", "summary": "Word segmentation / tokenization focussed on Twitter", "version": "1.0.0" }, "last_serial": 3939981, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "114fec2b40dcfedf301ae7318fda9531", "sha256": "77d59ad045eb8289086a4e9e13a053bc04236eb3ad78f61f2995e19c02621cb7" }, "downloads": -1, "filename": "twokenize-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "114fec2b40dcfedf301ae7318fda9531", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8455, "upload_time": "2018-06-07T15:03:32", "url": "https://files.pythonhosted.org/packages/3e/7c/8874d719de00a1da753d21733a4a3043f67b84dccad525ab442fdd572617/twokenize-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2778c0c5dc870e5c70324dad6eef20da", "sha256": "d121ea00caa1c086821391b860140f46fd41761073c875eda711ddaca7677dbe" }, "downloads": -1, "filename": "twokenize-1.0.0.tar.gz", "has_sig": false, "md5_digest": "2778c0c5dc870e5c70324dad6eef20da", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8154, "upload_time": "2018-06-07T15:03:34", "url": "https://files.pythonhosted.org/packages/69/e7/c51379ef276432b3f92691e1b49596885708b34ebcc7975d9b681805a5ab/twokenize-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "114fec2b40dcfedf301ae7318fda9531", "sha256": "77d59ad045eb8289086a4e9e13a053bc04236eb3ad78f61f2995e19c02621cb7" }, "downloads": -1, "filename": "twokenize-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "114fec2b40dcfedf301ae7318fda9531", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8455, "upload_time": "2018-06-07T15:03:32", "url": "https://files.pythonhosted.org/packages/3e/7c/8874d719de00a1da753d21733a4a3043f67b84dccad525ab442fdd572617/twokenize-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2778c0c5dc870e5c70324dad6eef20da", "sha256": "d121ea00caa1c086821391b860140f46fd41761073c875eda711ddaca7677dbe" }, "downloads": -1, "filename": "twokenize-1.0.0.tar.gz", "has_sig": false, "md5_digest": "2778c0c5dc870e5c70324dad6eef20da", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8154, "upload_time": "2018-06-07T15:03:34", "url": "https://files.pythonhosted.org/packages/69/e7/c51379ef276432b3f92691e1b49596885708b34ebcc7975d9b681805a5ab/twokenize-1.0.0.tar.gz" } ] }