{ "info": { "author": "Cody Joe Gilbert", "author_email": "cody@codyjoe.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": "\n# FuzzyPanda\n\nFuzzyPanda was created to support fuzzy join operations with [Pandas]( https://pandas.pydata.org/ ) [DataFrames]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html ) using Python Ver. 3. These fuzzy joins are a form of [approximate string matching]( https://en.wikipedia.org/wiki/Approximate_string_matching ) to join relational data that contain \"errors\" or minor modifications that preclude direct string comparison. \n\nFuzzyPanda will match strings that\n\n1. Are within a user-specified [edit distance]( https://en.wikipedia.org/wiki/Edit_distance ) (e.g. \"test\" == \"taste\" with edit distance 2)\n2. Are independent of case (e.g. \"Test\" == \"test\")\n3. Are Whitespace-delimited strings are matched regardless of token order (e.g. \"dark and stormy night\" == \"stormy and dark night\")\n4. Are independent of special symbols (e.g. \"this-string\" == \"this string\")\n\nThe criteria in steps 2-4 can be modified via modification of the `fuzzypanda.preprocess.PreProcessor` class. \n\nThe primary API is the `fuzzypanda.matching.get_fuzzy_columns` function that takes two Pandas DataFrames and a set of column names, and creates a new column in the \"left\" DataFrame that contains the closest entries by string edit distance to the associated values in the \"right\" DataFrame columns. The Pandas `merge` or `join` functions can later be used to perform full joins on the DataFrames.\n\n### Installation\n\nFuzzyPanda can be installed using `pip`:\n\n```shell\npip install fuzzypanda\n```\n\n### Usage\n\nThis version of FuzzyPanda currently supports the `fuzzypanda.matching.get_fuzzy_columns` function. More functions are expected in future releases.\n\n#### Create Fuzzy Matched Columns\n\nMain fuzzy joining API for the fuzzy joining of the given `left_dataframe` and `right_dataframe`. Given a string or list of strings to the cols argument, this function will add fuzzy columns to the `left_dataframe` that best match the columns of the `right_dataframe`. This operation can then be followed up with a Pandas `merge` or `join` to perform the actual joining operation.\n\n* `fuzzypanda.matching.get_fuzzy_columns` Arguments:\n\t* `left_dataframe` (pandas.DataFrame): left Pandas dataframe to which columns will be added\n\t* `right_dataframe` (pandas.DataFrame): right Pandas dataframe from which fuzzy values in the `left_dataframe` will be compared and suggested\n\t* `left_cols` (List(str)): A list of strings of column names present in `left_dataframe` that will be compared to the corresponding columns in `right_dataframe`.\n\t* `right_cols` (List(str)): A list of strings of column names present in `right_dataframe` used for comparison to those in given in `left_dataframe`. If both dataframes share the column names on which fuzzy columns will be created, this parameter can be set to `None` and the values given in `left_cols` will be used as the names in both dataframes. Default is `None`.\n\t* `null_return` (string): The string used if a match isn't found. Can be used to set NULL values if a fuzzy match isn't found in the `right_dataframe`. Setting to `None` will return the string used to search for the fuzzy value. Default is `None`.\n\t* `preprocesser`: an instance of the `fuzzypanda.preprocess.PreProcessor` class containing the `preprocess` method used to pre-process the input strings. If set to `None`, will instantiate the default pre-processor. This option can be used to create a custom pre-processor to pass to the `get_fuzzy_columns` function. Default is `None`\n\t* `max_edit_distance` (int): The maximum edit distance that will be considered when comparing columns. The higher the number, the more \"incorrect\" the `left_dataframe` columns can be to be searched in the `right_dataframe` columns. Increasing this number heavily impacts runtime and should be set as low as possible. Default is 2.\n* Returns: Performs an in-place creation of fuzzy columns within `left_dataframe`. Each given left column in `left_cols` will have a `'fuzzy_' + left_col_name` corresponding to the matched column.\n\n#### get\\_fuzzy\\_columns Example \nSuppose you wish to join the following two dataframes on columns `col_1` and `col_2`, where the columns in `left_df` contain entries that are misspelled and/or jumbled tokens of those in `right_df`:\n\n```python\nprint(left_df)\n> ID col_1 col_2\n> 0 123314 kitten oboe\n> 1 123213 siting trvmpet\n> 2 43543 the times of best over te rainbow \n> 3 35435 the worst times in Symphony C \n> 4 987 not in there not in there\n\nprint(right_df)\n> ID col_1 col_2\n> 0 12783314 kitten oboe\n> 1 12352213 sitting trumpet\n> 2 43233543 the best of times over the rainbow\n> 3 23432420 the worst of times Symphony in C#\n```\n\nWe can now call `fuzzypanda.matching.get_fuzzy_columns`. Notice that the results are columns added to `left_df` in-place, rather than returning a new DataFrame.\n\n```python\nfuzzypanda.matching.get_fuzzy_columns(left_dataframe=left_df,\n \t\tright_dataframe=right_df,\n \t\tleft_cols=['col_1', 'col_2'])\n\nprint(left_df)\n> ID col_1 col_2 fuzzy_col_1 \\\n> 0 123314 kitten oboe kitten \n> 1 123213 siting trvmpet sitting \n> 2 43543 the times of best over te rainbow the best of times \n> 3 35435 the worst times in Symphony C the worst of times \n> 4 987 not in there not in there not in there\n> \n> fuzzy_col_2 \n> 0 oboe \n> 1 trumpet \n> 2 over the rainbow \n> 3 Symphony in C# \n> 4 not in there \n```\n\n### Methodology\n\nThis package uses the Symspell Python port [symspellpy by mammothb]( https://github.com/mammothb/symspellpy ) of the original C# implementation of [Symspell by Wolf Garbe]( https://github.com/wolfgarbe/SymSpell ). This fuzzy column creation approach applies a Pandas-friendly wrapper around the Symspell Symmetric Delete spelling correction algorithm to allow substantially faster fuzzy joining. Tools such as fuzzywuzzy will run in Omega(mn) to find the best-matching strings in a column of n values compared to the m values of another column, whereas this model is expected to have a runtime of Omega(m + n) due to the pre-processing of the right DataFrame columns as a spellchecker corpus that searched using the Symmetric Delete spelling correction algorithm. \n\nThis method is best suited for fuzzy searches of large DataFrames due to the comparatively large amount of pre-processing but faster search performance.\n\nThe algorithm operates as follows:\n\n1. A \"left\" Pandas DataFrame and a \"right\" Pandas DataFrame are input to `get_fuzzy_columns` with the column names used for comparison.\n2. Each right DataFrame is copied into a temporary corpus text file.\n3. Each entry in the corpus text file is preprocessed using either the default `fuzzypanda.preprocess.PreProcessor` or a user-supplied object containing a `preprocess` method and copied to another preprocessed text file. An in-memory index is created to translate processed strings to preprocessed strings.\n4. A symspellpy object is instantiated and the corpus file is used to create a lookup dictionary.\n5. Each record from the left DataFrame is preprocessed and queried from the dictionary using the `symspellpy.lookup` function to find the closest string in terms of edit distance, and the suggested string (or a substitute string if one isn't found) is placed in an intemediate list.\n6. When all records of the left DataFrame have been processed, a new column containing the results of the fuzzy lookup is added to the left DataFrame in a column labeled 'fuzzy_' + queried column name.\n\n### Future Work\n\n* Directly implement pandas `merge` and `join`\n* Replace `symspellpy` with a C++ implementation of Symspell to speed lookup calculations\n* Create option for multiprocessing and multithreading column record queries.\n* Add API to directly process CSV files\n* Add API to use Pandas DataFrame chunks\n* Expand functionality to use SparkSQL DataFrames\n\n\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cody-joe-gilbert/fuzzypanda", "keywords": "fuzzy join pandas Symspell", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "fuzzypanda", "package_url": "https://pypi.org/project/fuzzypanda/", "platform": "", "project_url": "https://pypi.org/project/fuzzypanda/", "project_urls": { "Homepage": "https://github.com/cody-joe-gilbert/fuzzypanda" }, "release_url": "https://pypi.org/project/fuzzypanda/0.1.1/", "requires_dist": [ "symspellpy" ], "requires_python": ">=3", "summary": "Toolkit for performing fuzzy joins with Symspell framework", "version": "0.1.1" }, "last_serial": 5727883, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "a47889eb537cb5b5e4a2d7cc1987ed70", "sha256": "99e4eb7776b85c1fc3dac027c95e091225281987f1c129fe4ef3749d92edfdce" }, "downloads": -1, "filename": "fuzzypanda-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "a47889eb537cb5b5e4a2d7cc1987ed70", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 12319, "upload_time": "2019-08-25T20:08:10", "url": "https://files.pythonhosted.org/packages/be/fe/7863f9566f8df73ac0b666bcac4fef8014581cdf40160dad19a2a62c380f/fuzzypanda-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "956d9c50b87c25a99f78cfc93f1ce904", "sha256": "58821907fb53a0758ddd63cbe0253516e4eead370d59799d46748719364a2137" }, "downloads": -1, "filename": "fuzzypanda-0.1.1.tar.gz", "has_sig": false, "md5_digest": "956d9c50b87c25a99f78cfc93f1ce904", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 13137, "upload_time": "2019-08-25T20:08:13", "url": "https://files.pythonhosted.org/packages/79/c9/5022f76fa9b6aa7695bfe126ef19740bf70eda0f88cfb4e665df13b9d3a8/fuzzypanda-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a47889eb537cb5b5e4a2d7cc1987ed70", "sha256": "99e4eb7776b85c1fc3dac027c95e091225281987f1c129fe4ef3749d92edfdce" }, "downloads": -1, "filename": "fuzzypanda-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "a47889eb537cb5b5e4a2d7cc1987ed70", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 12319, "upload_time": "2019-08-25T20:08:10", "url": "https://files.pythonhosted.org/packages/be/fe/7863f9566f8df73ac0b666bcac4fef8014581cdf40160dad19a2a62c380f/fuzzypanda-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "956d9c50b87c25a99f78cfc93f1ce904", "sha256": "58821907fb53a0758ddd63cbe0253516e4eead370d59799d46748719364a2137" }, "downloads": -1, "filename": "fuzzypanda-0.1.1.tar.gz", "has_sig": false, "md5_digest": "956d9c50b87c25a99f78cfc93f1ce904", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 13137, "upload_time": "2019-08-25T20:08:13", "url": "https://files.pythonhosted.org/packages/79/c9/5022f76fa9b6aa7695bfe126ef19740bf70eda0f88cfb4e665df13b9d3a8/fuzzypanda-0.1.1.tar.gz" } ] }