{
    "info": {
        "author": "Cody Joe Gilbert",
        "author_email": "cody@codyjoe.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 4 - Beta",
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3.6",
            "Topic :: Scientific/Engineering :: Information Analysis"
        ],
        "description": "\n# FuzzyPanda\n\nFuzzyPanda was created to support fuzzy join operations with [Pandas]( https://pandas.pydata.org/ ) [DataFrames]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html ) using Python Ver. 3. These fuzzy joins are a form of [approximate string matching]( https://en.wikipedia.org/wiki/Approximate_string_matching ) to join relational data that contain \"errors\" or minor modifications that preclude direct string comparison. \n\nFuzzyPanda will match strings that\n\n1. Are within a user-specified [edit distance]( https://en.wikipedia.org/wiki/Edit_distance ) (e.g. \"test\" == \"taste\" with edit distance 2)\n2. Are independent of case (e.g. \"Test\" == \"test\")\n3. Are Whitespace-delimited strings are matched regardless of token order (e.g. \"dark and stormy night\" == \"stormy and dark night\")\n4. Are independent of special symbols (e.g. \"this-string\" == \"this string\")\n\nThe criteria in steps 2-4 can be modified via modification of the `fuzzypanda.preprocess.PreProcessor` class. \n\nThe primary API is the `fuzzypanda.matching.get_fuzzy_columns` function that takes two Pandas DataFrames and a set of column names, and creates a new column in the \"left\" DataFrame that contains the closest entries by string edit distance to the associated values in the \"right\" DataFrame columns. The Pandas `merge` or `join` functions can later be used to perform full joins on the DataFrames.\n\n### Installation\n\nFuzzyPanda can be installed using `pip`:\n\n```shell\npip install fuzzypanda\n```\n\n### Usage\n\nThis version of FuzzyPanda currently supports the `fuzzypanda.matching.get_fuzzy_columns` function. More functions are expected in future releases.\n\n#### Create Fuzzy Matched Columns\n\nMain fuzzy joining API for the fuzzy joining of the given `left_dataframe` and `right_dataframe`. Given a string or list of strings to the cols argument, this function will add fuzzy columns to the `left_dataframe` that best match the columns of the `right_dataframe`. This operation can then be followed up with a Pandas `merge` or `join` to perform the actual joining operation.\n\n* `fuzzypanda.matching.get_fuzzy_columns` Arguments:\n\t* `left_dataframe` (pandas.DataFrame): left Pandas dataframe to which columns will be added\n\t* `right_dataframe` (pandas.DataFrame): right Pandas dataframe from which fuzzy values in the `left_dataframe` will be compared and suggested\n\t* `left_cols` (List(str)): A list of strings of column names present in `left_dataframe` that will be compared to the corresponding columns in `right_dataframe`.\n\t* `right_cols` (List(str)): A list of strings of column names present in `right_dataframe` used for comparison to those in given in `left_dataframe`. If both dataframes share the column names on which fuzzy columns will be created, this parameter can be set to `None` and the values given in `left_cols` will be used as the names in both dataframes. Default is `None`.\n\t* `null_return` (string): The string used if a match isn't found. Can be used to set NULL values if a fuzzy match isn't found in the `right_dataframe`. Setting to `None` will return the string used to search for the fuzzy value. Default is `None`.\n\t* `preprocesser`: an instance of the `fuzzypanda.preprocess.PreProcessor` class containing the `preprocess` method used to pre-process the input strings. If set to `None`, will instantiate the default pre-processor. This option can be used to create a custom pre-processor to pass to the `get_fuzzy_columns` function. Default is `None`\n\t* `max_edit_distance` (int): The maximum edit distance that will be considered when comparing columns. The higher the number, the more \"incorrect\" the `left_dataframe` columns can be to be searched in the `right_dataframe` columns. Increasing this number heavily impacts runtime and should be set as low as possible. Default is 2.\n* Returns: Performs an in-place creation of fuzzy columns within `left_dataframe`. Each given left column in `left_cols` will have a `'fuzzy_' + left_col_name` corresponding to the matched column.\n\n####  get\\_fuzzy\\_columns Example \nSuppose you wish to join the following two dataframes on columns `col_1` and `col_2`, where the columns in `left_df` contain entries that are misspelled and/or jumbled tokens of those in `right_df`:\n\n```python\nprint(left_df)\n>        ID              col_1            col_2\n> 0  123314             kitten             oboe\n> 1  123213             siting          trvmpet\n> 2   43543  the times of best  over te rainbow \n> 3   35435    the worst times    in Symphony C \n> 4     987       not in there     not in there\n\nprint(right_df)\n>          ID               col_1             col_2\n> 0  12783314              kitten              oboe\n> 1  12352213             sitting           trumpet\n> 2  43233543   the best of times  over the rainbow\n> 3  23432420  the worst of times    Symphony in C#\n```\n\nWe can now call `fuzzypanda.matching.get_fuzzy_columns`. Notice that the results are columns added to `left_df` in-place, rather than returning a new DataFrame.\n\n```python\nfuzzypanda.matching.get_fuzzy_columns(left_dataframe=left_df,\n                      \t\tright_dataframe=right_df,\n                      \t\tleft_cols=['col_1', 'col_2'])\n\nprint(left_df)\n>        ID              col_1            col_2         fuzzy_col_1 \\\n> 0  123314             kitten             oboe              kitten   \n> 1  123213             siting          trvmpet             sitting   \n> 2   43543  the times of best  over te rainbow   the best of times   \n> 3   35435    the worst times    in Symphony C  the worst of times   \n> 4     987       not in there     not in there        not in there\n> \n>         fuzzy_col_2  \n> 0              oboe  \n> 1           trumpet  \n> 2  over the rainbow  \n> 3    Symphony in C#  \n> 4      not in there \n```\n\n### Methodology\n\nThis package uses the Symspell Python port [symspellpy by mammothb]( https://github.com/mammothb/symspellpy ) of the original C# implementation of [Symspell by Wolf Garbe]( https://github.com/wolfgarbe/SymSpell ). This fuzzy column creation approach applies a Pandas-friendly wrapper around the Symspell Symmetric Delete spelling correction algorithm to allow substantially faster fuzzy joining. Tools such as fuzzywuzzy will run in Omega(mn) to find the best-matching strings in a column of n values compared to the m values of another column, whereas this model is expected to have a runtime of Omega(m + n) due to the pre-processing of the right DataFrame columns as a spellchecker corpus that searched using  the Symmetric Delete spelling correction algorithm. \n\nThis method is best suited for fuzzy searches of large DataFrames due to the comparatively large amount of pre-processing but faster search performance.\n\nThe algorithm operates as follows:\n\n1. A \"left\" Pandas DataFrame and a \"right\" Pandas DataFrame are input to `get_fuzzy_columns` with the column names used for comparison.\n2. Each right DataFrame is copied into a temporary corpus text file.\n3. Each entry in the corpus text file is preprocessed using either the default `fuzzypanda.preprocess.PreProcessor` or a user-supplied object containing a `preprocess` method and copied to another preprocessed text file. An in-memory index is created to translate processed strings to preprocessed strings.\n4. A symspellpy object is instantiated and the corpus file is used to create a lookup dictionary.\n5. Each record from the left DataFrame is preprocessed and queried from the dictionary using the `symspellpy.lookup` function to find the closest string in terms of edit distance, and the suggested string (or a substitute string if one isn't found) is placed in an intemediate list.\n6. When all records of the left DataFrame have been processed, a new column containing the results of the fuzzy lookup is added to the left DataFrame in a column labeled 'fuzzy_' + queried column name.\n\n### Future Work\n\n* Directly implement pandas `merge` and `join`\n* Replace `symspellpy` with a C++ implementation of Symspell to speed lookup calculations\n* Create option for multiprocessing and multithreading column record queries.\n* Add API to directly process CSV files\n* Add API to use Pandas DataFrame chunks\n* Expand functionality to use SparkSQL DataFrames\n\n\n\n\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/cody-joe-gilbert/fuzzypanda",
        "keywords": "fuzzy join pandas Symspell",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "fuzzypanda",
        "package_url": "https://pypi.org/project/fuzzypanda/",
        "platform": "",
        "project_url": "https://pypi.org/project/fuzzypanda/",
        "project_urls": {
            "Homepage": "https://github.com/cody-joe-gilbert/fuzzypanda"
        },
        "release_url": "https://pypi.org/project/fuzzypanda/0.1.1/",
        "requires_dist": [
            "symspellpy"
        ],
        "requires_python": ">=3",
        "summary": "Toolkit for performing fuzzy joins with Symspell framework",
        "version": "0.1.1"
    },
    "last_serial": 5727883,
    "releases": {
        "0.1.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a47889eb537cb5b5e4a2d7cc1987ed70",
                    "sha256": "99e4eb7776b85c1fc3dac027c95e091225281987f1c129fe4ef3749d92edfdce"
                },
                "downloads": -1,
                "filename": "fuzzypanda-0.1.1-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "a47889eb537cb5b5e4a2d7cc1987ed70",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": ">=3",
                "size": 12319,
                "upload_time": "2019-08-25T20:08:10",
                "url": "https://files.pythonhosted.org/packages/be/fe/7863f9566f8df73ac0b666bcac4fef8014581cdf40160dad19a2a62c380f/fuzzypanda-0.1.1-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "956d9c50b87c25a99f78cfc93f1ce904",
                    "sha256": "58821907fb53a0758ddd63cbe0253516e4eead370d59799d46748719364a2137"
                },
                "downloads": -1,
                "filename": "fuzzypanda-0.1.1.tar.gz",
                "has_sig": false,
                "md5_digest": "956d9c50b87c25a99f78cfc93f1ce904",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">=3",
                "size": 13137,
                "upload_time": "2019-08-25T20:08:13",
                "url": "https://files.pythonhosted.org/packages/79/c9/5022f76fa9b6aa7695bfe126ef19740bf70eda0f88cfb4e665df13b9d3a8/fuzzypanda-0.1.1.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "a47889eb537cb5b5e4a2d7cc1987ed70",
                "sha256": "99e4eb7776b85c1fc3dac027c95e091225281987f1c129fe4ef3749d92edfdce"
            },
            "downloads": -1,
            "filename": "fuzzypanda-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a47889eb537cb5b5e4a2d7cc1987ed70",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3",
            "size": 12319,
            "upload_time": "2019-08-25T20:08:10",
            "url": "https://files.pythonhosted.org/packages/be/fe/7863f9566f8df73ac0b666bcac4fef8014581cdf40160dad19a2a62c380f/fuzzypanda-0.1.1-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "956d9c50b87c25a99f78cfc93f1ce904",
                "sha256": "58821907fb53a0758ddd63cbe0253516e4eead370d59799d46748719364a2137"
            },
            "downloads": -1,
            "filename": "fuzzypanda-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "956d9c50b87c25a99f78cfc93f1ce904",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 13137,
            "upload_time": "2019-08-25T20:08:13",
            "url": "https://files.pythonhosted.org/packages/79/c9/5022f76fa9b6aa7695bfe126ef19740bf70eda0f88cfb4e665df13b9d3a8/fuzzypanda-0.1.1.tar.gz"
        }
    ]
}