{ "info": { "author": "Guilherme Routar", "author_email": "groutar@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Twikenizer\n\nThis repository hosts the code for a tokenizer of tweets. It's main purpose is to identify subtle profanity, so it should\nobtain better performance on data containing hidden profanity (e.g. 'f*ck').\n\nDisclaimer: The following paragraphs may contain profanity.\n\n## Description\n\nPython offers a set of sentence tokenizers for different purposes: nltk's word tokenizer, spacy's, scikit-learn's default and \nTweetTokenizer, among others. All but TweetTokenizer disregard hashtags and mentions by separating the symbols from the rest of the token(s).\nAlthough TweetTokenizer considers the Twitter *dialect*, it fails to tokenize subtle hidden profanity.\n\nFor the word ```f*ck```,the tokens considered are ```[f, *, ck]```. The word ```g@y``` is tokenized as ```[g, @y]```, considering \na single token ```g``` and a wrongly identified mention ```@y```. While the hashtag ```#hash_tag``` is correctly tokenized as \n```[#hash_tag]```, *regular* tokens are not underscore separated: ```love_twitter``` is tokenized as ```['love_twitter']``` instead of ```['love', '_', 'twitter']```.\n\nTwikenizer was created in order to enable a proper identification of hidden profane words, considering the features detailed above. Applying distance related features, i.e. levenshtein distance to slang words should output better results using this tokenizer.\n\n## Installation\n\n**Using pip**\n\npip install twikenizer\n\n**Clone repository**\n\ngit clone https://github.com/Guilherme-Routar/Twikenizer.git\n\n## Usage\n\n```python\n> import twikenizer as twk\n> twk = twk.Twikenizer()\n> tweet = 'This is an #hashtag'\n> twk.tokenize(tweet)\n['This', 'is', 'an', '#hashtag']\n```\n\nTwikenizer has a built-in function ```examplify``` which demonstrates how it tokenizes different kind of words/tokens.\n\n```python\n> twk.examplify()\nGenerated tweet\n###############\nTw33t # @dude_really #hash_tag $hit (g@y) retard#d @dude. \ud83d\ude00\ud83d\ude00 !\ud83d\ude00abc %\ud83d\ude00lol #hateit #hate.it $%&/ f*ck-\n\nGenerated tokens\n################\n['Tw33t', '#', '@dude_really', '#hash_tag', '$hit', '(', 'g', '@', 'y', ')', 'retard#d', '@dude', '.', '\ud83d\ude00', '\ud83d\ude00', '!', '\ud83d\ude00', 'abc', '%', '\ud83d\ude00', 'lol', '#hateit', '#hate', '.', 'it', '$', '%', '&', '/', 'f*ck', '-']\n\u00b4\u00b4\u00b4", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Guilherme-Routar/Twikenizer", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "twikenizer", "package_url": "https://pypi.org/project/twikenizer/", "platform": "", "project_url": "https://pypi.org/project/twikenizer/", "project_urls": { "Homepage": "https://github.com/Guilherme-Routar/Twikenizer" }, "release_url": "https://pypi.org/project/twikenizer/1.0/", "requires_dist": null, "requires_python": "", "summary": "Tokenizer for Twitter comments (tweets)", "version": "1.0" }, "last_serial": 4666819, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "62806ede5e47dcac792aedd4fd321a9c", "sha256": "678d7fc2adef86f6e9e2693c8710ef31b76a342923558a37d87ec09f8f97a33f" }, "downloads": -1, "filename": "twikenizer-1.0.tar.gz", "has_sig": false, "md5_digest": "62806ede5e47dcac792aedd4fd321a9c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4434, "upload_time": "2019-01-06T23:22:07", "url": "https://files.pythonhosted.org/packages/d2/51/7aee33630b948f0716efae7a96c4fd8f859b348694058c380fd899a4227e/twikenizer-1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "62806ede5e47dcac792aedd4fd321a9c", "sha256": "678d7fc2adef86f6e9e2693c8710ef31b76a342923558a37d87ec09f8f97a33f" }, "downloads": -1, "filename": "twikenizer-1.0.tar.gz", "has_sig": false, "md5_digest": "62806ede5e47dcac792aedd4fd321a9c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4434, "upload_time": "2019-01-06T23:22:07", "url": "https://files.pythonhosted.org/packages/d2/51/7aee33630b948f0716efae7a96c4fd8f859b348694058c380fd899a4227e/twikenizer-1.0.tar.gz" } ] }