{
    "info": {
        "author": "Luke Whyte",
        "author_email": "lukeawhyte@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3"
        ],
        "description": "## What is this?\n\nTextPack efficiently groups similar values in large (or small) datasets. Under the hood, it builds a document term matrix of n-grams assigned a TF-IDF score. It then uses matrix multiplication to quickly calculate the cosine similarity between these values. For a technical explination, [I wrote a blog post](https://medium.com/p/2493b3ce6d8d).\n\n## Why do I care?\n\nIf you're a analyst, journalist, data scientist or similar and ever had a spreadsheet, SQL table or JSON string filled with inconsistent inputs like this:\n\n| row |     fullname      |\n|-----|-------------------|\n|   1 | John F. Doe       |\n|   2 | Esquivel, Mara    |\n|   3 | Doe, John F       |\n|   4 | Whyte, Luke       |\n|   5 | Doe, John Francis |\n\nAnd you've wanted to perform some kind of analysis \u2013 perhaps in a Pivot Table or a Group By statement \u2013 but are hindered by the deviations in spelling and formatting, you can use TextPack to comb hundreds to thousands of cells in seconds and create a third column like this:\n\n| row |     fullname      |  name_groups  |\n|-----|-------------------|---------------|\n|   1 | John F. Doe       | Doe John F    |\n|   2 | Esquivel, Mara    | Esquivel Mara |\n|   3 | Doe, John F       | Doe John F    |\n|   4 | Whyte, Luke       | Whyte Luke    |\n|   5 | Doe, John Francis | Doe John F    |\n\nWe can then group by `name_groups` and perform our analysis. \n\nYou can also group across multiple columns. For instance, given the following:\n\n| row |  make  |   model   |\n|-----|--------|-----------|\n|   1 | Toyota | Camry     |\n|   2 | toyta  | camry DXV |\n|   3 | Ford   | F-150     |\n|   4 | Toyota | Tundra    |\n|   5 | Honda  | Accord    |\n\nYou can group across `make` and `model` to create:\n\n| row |  make  |   model   |  car_groups  |\n|-----|--------|-----------|--------------|\n|   1 | Toyota | Camry     | toyotacamry  |\n|   2 | toyta  | camry DXV | toyotacamry  |\n|   3 | Ford   | F-150     | fordf150     |\n|   4 | Toyota | Tundra    | toyotatundra |\n|   5 | Honda  | Accord    | hondaaccord  |\n\nBoom.\n\n## How do I use it?\n\n#### Installation\n\n```\npip install textpack\n```\n\n#### Import module\n\n```\nfrom textpack import tp\n```\n\n#### Instantiate TextPack\n\n```\ntp.Textpack(df, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\nClass parameters:\n\n - `df` (required): A Pandas' DataFrame containing the dataset to group\n - `columns_to_group` (required): A list or string matching the column headers you'd like to parse and group\n - `match_threshold` (optional): This is a floating point number between 0 and 1 that represents the cosine similarity threshold we'll use to determine if two strings should be grouped. The closer the threshold to 1, the higher the similarity will need to be.\n - `ngram_remove` (optional): A regular expression you can use to filter characters out of your strings when we build our n-grams\n - `ngram_length` (optional): The length of our n-grams. This can be used in tandem with `match_threshold` to find the sweet spot for grouping your dataset. If TextPack is running slow, it's usually a sign to consider raising the n-gram length.\n\nTextPack can also be instantiated using the following helpers, each of which is just a wrapper that converts a data format to a Pandas DataFrame and then passes it to TextPack. Thus, they all require a file path, `columns_to_group` and take the same three optional parameters as callin `TextPack` directly.\n\n```\ntp.read_csv(csv_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n```\ntp.read_excel(excel_path, columns_to_group, sheet_name=None, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n```\ntp.read_json(json_path, columns_to_group, match_threshold=0.75, ngram_remove=r'[,-./]', ngram_length=3)\n```\n\n#### Run Textpack and group values\n\nTextPack objects have the following public properties:\n\n - `df`: The dataframe used internally by TextPack \u2013 manipulate as you see fit\n - `group_lookup`: A Python dictionary built by `build_group_lookup` and then used by `add_grouped_column_to_data` to lookup each value that has a group. It looks like this: \n\n```\n{ \n    'John F. Doe': 'Doe John F',\n    'Doe, John F': 'Doe John F',\n    'Doe, John Francis': 'Doe John F'\n}\n```\n\nTextpack objects also have the following public methods:\n\n - `build_group_lookup()`: Runs the cosine similarity analysis and builds `group_lookup`.\n - `add_grouped_column_to_data(column_name='Group')`: Uses vectorization to map values to groups via `group_lookup` and add the new `Group` column to the DataFrame. The column header can be set via `column_name`.\n - `set_match_threshold(match_threshold)`: Modify the match threshold internally.\n - `set_ngram_remove(ngram_remove)`: Modify the n-gram regex filter internally.\n - `set_ngram_length(ngram_length)`: Modify the n-gram length internally.\n - `run(column_name='Group')`: A helper function that calls `build_group_lookup` followed by `add_grouped_column_to_data`.\n\n #### Export our grouped dataset\n\n  - `export_json(export_path)`\n  - `export_csv(export_path)`\n\n#### A simple example\n\n```\nfrom textpack import tp\n\ncars = tp.read_csv('./cars.csv', ['make', 'model'], match_threshold=0.8, ngram_length=5)\n\ncars.run()\n\ncars.export_csv('./cars-grouped.csv')\n```\n\n## How does it work?\n\nAs mentioned above, under the hood, we're building a document term matrix of n-grams assigned a TF-IDF score. We're then using matrix multipcation to quickly calculate the cosine similarity between these values.\n\nI wrote [this detailed blog post](https://medium.com/p/2493b3ce6d8d) to explian how TextPack works behind the scene and why it's fast. Check it out!\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/lukewhyte/textpack",
        "keywords": "",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "textpack",
        "package_url": "https://pypi.org/project/textpack/",
        "platform": "",
        "project_url": "https://pypi.org/project/textpack/",
        "project_urls": {
            "Homepage": "https://github.com/lukewhyte/textpack"
        },
        "release_url": "https://pypi.org/project/textpack/0.0.7/",
        "requires_dist": [
            "pandas",
            "sklearn",
            "scipy",
            "numpy",
            "cython",
            "sparse-dot-topn"
        ],
        "requires_python": "",
        "summary": "Quickly identify and group similar text strings in a large dataset",
        "version": "0.0.7"
    },
    "last_serial": 5569070,
    "releases": {
        "0.0.5": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "d00fae6972d43f7322b83a791afef66c",
                    "sha256": "b77897b286cb1fd1ee12bb846407ee5c2633e3fd43fee3e0e21f1b489eb5d126"
                },
                "downloads": -1,
                "filename": "textpack-0.0.5-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "d00fae6972d43f7322b83a791afef66c",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 5573,
                "upload_time": "2019-07-12T23:21:22",
                "url": "https://files.pythonhosted.org/packages/4a/48/4ca33a823c0a7048a3258193453299081caa20ff4e8fd6c12cc84a4a39b5/textpack-0.0.5-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "d19f6be55a1d7f2acbcae25241514592",
                    "sha256": "d5c147cff0352eaad83aff992e5bd0e45ea32f9461126be813d226c9d6ecbd53"
                },
                "downloads": -1,
                "filename": "textpack-0.0.5.tar.gz",
                "has_sig": false,
                "md5_digest": "d19f6be55a1d7f2acbcae25241514592",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 4673,
                "upload_time": "2019-07-12T23:21:24",
                "url": "https://files.pythonhosted.org/packages/86/b6/da8b2ccc4abd48250e78d62834cab212d117d0975707a740ea12024d55aa/textpack-0.0.5.tar.gz"
            }
        ],
        "0.0.6": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "07cb7a84c6e6dcfa58bc9a26299f8043",
                    "sha256": "c95ebe3d9eb347f32ae2383f1f7ebfe2ee3cc9c7b55adaac1dec493350d095b8"
                },
                "downloads": -1,
                "filename": "textpack-0.0.6-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "07cb7a84c6e6dcfa58bc9a26299f8043",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 5585,
                "upload_time": "2019-07-22T20:50:33",
                "url": "https://files.pythonhosted.org/packages/5b/86/6a40975dd7ff8ec66621de3b8a8b184dbd0a05f334fd176313c8e580bb1b/textpack-0.0.6-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "1399483c3af6e1b8bb6a131ce0242781",
                    "sha256": "b9c927b3bfa203f9c4986d57b8efb91156995f74d7a5b16e1b68d05081546387"
                },
                "downloads": -1,
                "filename": "textpack-0.0.6.tar.gz",
                "has_sig": false,
                "md5_digest": "1399483c3af6e1b8bb6a131ce0242781",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 4687,
                "upload_time": "2019-07-22T20:50:35",
                "url": "https://files.pythonhosted.org/packages/2b/a5/fd95ebb61d1c5202b0f50264c4c7fd0d8879e8d2b0fcf9a0c69942d42152/textpack-0.0.6.tar.gz"
            }
        ],
        "0.0.7": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "864bff466515d0f4a4d31d15aa94be8a",
                    "sha256": "9ffb85e95ad143bb8341d265c3f655180a956d33f4dead2a1729c6ff94f9a135"
                },
                "downloads": -1,
                "filename": "textpack-0.0.7-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "864bff466515d0f4a4d31d15aa94be8a",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 5584,
                "upload_time": "2019-07-22T20:52:19",
                "url": "https://files.pythonhosted.org/packages/b8/4f/7a3440393c55d58ad5a5573937122de6249e995a6b257a47a653e37a9351/textpack-0.0.7-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "58b9ec48e03830786d8f318101a45f1f",
                    "sha256": "3256dc7a84a5a0ef77b8184b52dd6c8ee2f163756b2164e3a385248fe8edd9f5"
                },
                "downloads": -1,
                "filename": "textpack-0.0.7.tar.gz",
                "has_sig": false,
                "md5_digest": "58b9ec48e03830786d8f318101a45f1f",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 4688,
                "upload_time": "2019-07-22T20:52:20",
                "url": "https://files.pythonhosted.org/packages/45/49/deee7d2bd5a2b0e7746ff5a8e641556c5da2851056444b401db660747e05/textpack-0.0.7.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "864bff466515d0f4a4d31d15aa94be8a",
                "sha256": "9ffb85e95ad143bb8341d265c3f655180a956d33f4dead2a1729c6ff94f9a135"
            },
            "downloads": -1,
            "filename": "textpack-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "864bff466515d0f4a4d31d15aa94be8a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5584,
            "upload_time": "2019-07-22T20:52:19",
            "url": "https://files.pythonhosted.org/packages/b8/4f/7a3440393c55d58ad5a5573937122de6249e995a6b257a47a653e37a9351/textpack-0.0.7-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "58b9ec48e03830786d8f318101a45f1f",
                "sha256": "3256dc7a84a5a0ef77b8184b52dd6c8ee2f163756b2164e3a385248fe8edd9f5"
            },
            "downloads": -1,
            "filename": "textpack-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "58b9ec48e03830786d8f318101a45f1f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4688,
            "upload_time": "2019-07-22T20:52:20",
            "url": "https://files.pythonhosted.org/packages/45/49/deee7d2bd5a2b0e7746ff5a8e641556c5da2851056444b401db660747e05/textpack-0.0.7.tar.gz"
        }
    ]
}