{ "info": { "author": "Luke Whyte", "author_email": "lukeawhyte@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "## What is this?\n\nTextPack efficiently groups similar values in large (or small) datasets. Under the hood, it builds a document term matrix of n-grams assigned a TF-IDF score. It then uses matrix multiplication to quickly calculate the cosine similarity between these values. For a technical explination, [I wrote a blog post](https://medium.com/p/2493b3ce6d8d).\n\n## Why do I care?\n\nIf you're a analyst, journalist, data scientist or similar and ever had a spreadsheet, SQL table or JSON string filled with inconsistent inputs like this:\n\n| row | fullname |\n|-----|-------------------|\n| 1 | John F. Doe |\n| 2 | Esquivel, Mara |\n| 3 | Doe, John F |\n| 4 | Whyte, Luke |\n| 5 | Doe, John Francis |\n\nAnd you've wanted to perform some kind of analysis \u2013 perhaps in a Pivot Table or a Group By statement \u2013 but are hindered by the deviations in spelling and formatting, you can use TextPack to comb hundreds to thousands of cells in seconds and create a third column like this:\n\n| row | fullname | name_groups |\n|-----|-------------------|---------------|\n| 1 | John F. Doe | Doe John F |\n| 2 | Esquivel, Mara | Esquivel Mara |\n| 3 | Doe, John F | Doe John F |\n| 4 | Whyte, Luke | Whyte Luke |\n| 5 | Doe, John Francis | Doe John F |\n\nWe can then group by `name_groups` and perform our analysis. \n\nYou can also group across multiple columns. For instance, given the following:\n\n| row | make | model |\n|-----|--------|-----------|\n| 1 | Toyota | Camry |\n| 2 | toyta | camry DXV |\n| 3 | Ford | F-150 |\n| 4 | Toyota | Tundra |\n| 5 | Honda | Accord |\n\nYou can group across `make` and `model` to create:\n\n| row | make | model | car_groups |\n|-----|--------|-----------|--------------|\n| 1 | Toyota | Camry | toyotacamry |\n| 2 | toyta | camry DXV | toyotacamry |\n| 3 | Ford | F-150 | fordf150 |\n| 4 | Toyota | Tundra | toyotatundra |\n| 5 | Honda | Accord | hondaaccord |\n\nBoom.\n\n## How do I use it?\n\n#### Installation\n\n```\npip install textpack\n```\n\n#### Import module\n\n```\nfrom textpack import tp\n```\n\n#### Instantiate TextPack\n\n```\ntp.Textpack(df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\nClass parameters:\n\n - `df` (required): A Pandas' DataFrame containing the dataset to group\n - `columns_to_group` (required): A list or string matching the column headers you'd like to parse and group\n - `match_threshold` (optional): This is a floating point number between 0 and 1 that represents the cosine similarity threshold we'll use to determine if two strings should be grouped. The closer the threshold to 1, the higher the similarity will need to be.\n - `ngram_remove` (optional): A regular expression you can use to filter characters out of your strings when we build our n-grams\n - `ngram_length` (optional): The length of our n-grams. This can be used in tandem with `match_threshold` to find the sweet spot for grouping your dataset. If TextPack is running slow, it's usually a sign to consider raising the n-gram length.\n\nTextPack can also be instantiated using the following helpers, each of which is just a wrapper that converts a data format to a Pandas DataFrame and then passes it to TextPack. Thus, they all require a file path, `columns_to_group` and take the same three optional parameters as callin `TextPack` directly.\n\n```\ntp.read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n```\ntp.read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n```\ntp.read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n#### Run Textpack and group values\n\nTextPack objects have the following public properties:\n\n - `df`: The dataframe used internally by TextPack \u2013 manipulate as you see fit\n - `group_lookup`: A Python dictionary built by `build_group_lookup` and then used by `add_grouped_column_to_data` to lookup each value that has a group. It looks like this: \n\n```\n{ \n 'John F. Doe': 'Doe John F',\n 'Doe, John F': 'Doe John F',\n 'Doe, John Francis': 'Doe John F'\n}\n```\n\nTextpack objects also have the following public methods:\n\n - `build_group_lookup()`: Runs the cosine similarity analysis and builds `group_lookup`.\n - `add_grouped_column_to_data(column_name='Group')`: Uses vectorization to map values to groups via `group_lookup` and add the new `Group` column to the DataFrame. The column header can be set via `column_name`.\n - `set_match_threshold(match_threshold)`: Modify the match threshold internally.\n - `set_ngram_remove(ngram_remove)`: Modify the n-gram regex filter internally.\n - `set_ngram_length(ngram_length)`: Modify the n-gram length internally.\n - `run(column_name='Group')`: A helper function that calls `build_group_lookup` followed by `add_grouped_column_to_data`.\n\n #### Export our grouped dataset\n\n - `export_json(export_path)`\n - `export_csv(export_path)`\n\n#### A simple example\n\n```\nfrom textpack import tp\n\ncars = tp.read_csv('./cars.csv', ['make', 'model'], match_threshold=0.8, ngram_length=5)\n\ncars.run()\n\ncars.export_csv('./cars-grouped.csv')\n```\n\n## How does it work?\n\nAs mentioned above, under the hood, we're building a document term matrix of n-grams assigned a TF-IDF score. We're then using matrix multipcation to quickly calculate the cosine similarity between these values.\n\nI wrote [this detailed blog post](https://medium.com/p/2493b3ce6d8d) to explian how TextPack works behind the scene and why it's fast. Check it out!\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/lukewhyte/textpack", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "textpack", "package_url": "https://pypi.org/project/textpack/", "platform": "", "project_url": "https://pypi.org/project/textpack/", "project_urls": { "Homepage": "https://github.com/lukewhyte/textpack" }, "release_url": "https://pypi.org/project/textpack/0.0.7/", "requires_dist": [ "pandas", "sklearn", "scipy", "numpy", "cython", "sparse-dot-topn" ], "requires_python": "", "summary": "Quickly identify and group similar text strings in a large dataset", "version": "0.0.7" }, "last_serial": 5569070, "releases": { "0.0.5": [ { "comment_text": "", "digests": { "md5": "d00fae6972d43f7322b83a791afef66c", "sha256": "b77897b286cb1fd1ee12bb846407ee5c2633e3fd43fee3e0e21f1b489eb5d126" }, "downloads": -1, "filename": "textpack-0.0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "d00fae6972d43f7322b83a791afef66c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5573, "upload_time": "2019-07-12T23:21:22", "url": "https://files.pythonhosted.org/packages/4a/48/4ca33a823c0a7048a3258193453299081caa20ff4e8fd6c12cc84a4a39b5/textpack-0.0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d19f6be55a1d7f2acbcae25241514592", "sha256": "d5c147cff0352eaad83aff992e5bd0e45ea32f9461126be813d226c9d6ecbd53" }, "downloads": -1, "filename": "textpack-0.0.5.tar.gz", "has_sig": false, "md5_digest": "d19f6be55a1d7f2acbcae25241514592", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4673, "upload_time": "2019-07-12T23:21:24", "url": "https://files.pythonhosted.org/packages/86/b6/da8b2ccc4abd48250e78d62834cab212d117d0975707a740ea12024d55aa/textpack-0.0.5.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "07cb7a84c6e6dcfa58bc9a26299f8043", "sha256": "c95ebe3d9eb347f32ae2383f1f7ebfe2ee3cc9c7b55adaac1dec493350d095b8" }, "downloads": -1, "filename": "textpack-0.0.6-py3-none-any.whl", "has_sig": false, "md5_digest": "07cb7a84c6e6dcfa58bc9a26299f8043", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5585, "upload_time": "2019-07-22T20:50:33", "url": "https://files.pythonhosted.org/packages/5b/86/6a40975dd7ff8ec66621de3b8a8b184dbd0a05f334fd176313c8e580bb1b/textpack-0.0.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1399483c3af6e1b8bb6a131ce0242781", "sha256": "b9c927b3bfa203f9c4986d57b8efb91156995f74d7a5b16e1b68d05081546387" }, "downloads": -1, "filename": "textpack-0.0.6.tar.gz", "has_sig": false, "md5_digest": "1399483c3af6e1b8bb6a131ce0242781", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4687, "upload_time": "2019-07-22T20:50:35", "url": "https://files.pythonhosted.org/packages/2b/a5/fd95ebb61d1c5202b0f50264c4c7fd0d8879e8d2b0fcf9a0c69942d42152/textpack-0.0.6.tar.gz" } ], "0.0.7": [ { "comment_text": "", "digests": { "md5": "864bff466515d0f4a4d31d15aa94be8a", "sha256": "9ffb85e95ad143bb8341d265c3f655180a956d33f4dead2a1729c6ff94f9a135" }, "downloads": -1, "filename": "textpack-0.0.7-py3-none-any.whl", "has_sig": false, "md5_digest": "864bff466515d0f4a4d31d15aa94be8a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5584, "upload_time": "2019-07-22T20:52:19", "url": "https://files.pythonhosted.org/packages/b8/4f/7a3440393c55d58ad5a5573937122de6249e995a6b257a47a653e37a9351/textpack-0.0.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "58b9ec48e03830786d8f318101a45f1f", "sha256": "3256dc7a84a5a0ef77b8184b52dd6c8ee2f163756b2164e3a385248fe8edd9f5" }, "downloads": -1, "filename": "textpack-0.0.7.tar.gz", "has_sig": false, "md5_digest": "58b9ec48e03830786d8f318101a45f1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4688, "upload_time": "2019-07-22T20:52:20", "url": "https://files.pythonhosted.org/packages/45/49/deee7d2bd5a2b0e7746ff5a8e641556c5da2851056444b401db660747e05/textpack-0.0.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "864bff466515d0f4a4d31d15aa94be8a", "sha256": "9ffb85e95ad143bb8341d265c3f655180a956d33f4dead2a1729c6ff94f9a135" }, "downloads": -1, "filename": "textpack-0.0.7-py3-none-any.whl", "has_sig": false, "md5_digest": "864bff466515d0f4a4d31d15aa94be8a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5584, "upload_time": "2019-07-22T20:52:19", "url": "https://files.pythonhosted.org/packages/b8/4f/7a3440393c55d58ad5a5573937122de6249e995a6b257a47a653e37a9351/textpack-0.0.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "58b9ec48e03830786d8f318101a45f1f", "sha256": "3256dc7a84a5a0ef77b8184b52dd6c8ee2f163756b2164e3a385248fe8edd9f5" }, "downloads": -1, "filename": "textpack-0.0.7.tar.gz", "has_sig": false, "md5_digest": "58b9ec48e03830786d8f318101a45f1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4688, "upload_time": "2019-07-22T20:52:20", "url": "https://files.pythonhosted.org/packages/45/49/deee7d2bd5a2b0e7746ff5a8e641556c5da2851056444b401db660747e05/textpack-0.0.7.tar.gz" } ] }