{ "info": { "author": "Keith Lyons", "author_email": "lyonk71@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# pandas-dedupe\nThe Dedupe library made easy with Pandas.\n\n# Installation\n\npip install pandas-dedupe\n\n# Video Tutorials\n\n[Basic Deduplication](https://www.youtube.com/watch?v=lCFEzRaqoJA)\n\n# Basic Usage\n\n### Deduplication\n\n import pandas as pd\n import pandas_dedupe\n\n #load dataframe\n df = pd.read_csv('test_names.csv')\n\n #initiate deduplication\n df_final = pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'middle_initial'])\n\n #send output to csv\n df_final.to_csv('deduplication_output.csv')\n\n\n #------------------------------additional details------------------------------\n\n #A training file and a settings file will be created while running Dedupe. \n #Keeping these files will eliminate the need to retrain your model in the future. \n #If you would like to retrain your model, just delete the settings and training files.\n\n### Matching / Record Linkage\n\n import pandas as pd\n import pandas_dedupe\n\n #load dataframes\n dfa = pd.read_csv('file_a.csv')\n dfb = pd.read_csv('file_b.csv')\n\n #initiate matching\n df_final = pandas_dedupe.link_dataframes(dfa, dfb, ['field_1', 'field_2', 'field_3', 'field_4'])\n\n #send output to csv\n df_final.to_csv('linkage_output.csv')\n\n\n #------------------------------additional details------------------------------\n\n #Use identical field names when linking dataframes.\n\n #Record linkage should only be used on dataframes that have been deduplicated.\n\n #A training file and a settings file will be created while running Dedupe. \n #Keeping these files will eliminate the need to retrain your model in the future. \n #If you would like to retrain your model, just delete the settings and training files.\n\n# Advanced Usage\n\n\n### Canonicalize Fields\n\n pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', 'payment_type'], canonicalize=True)\n\n #------------------------------additional details------------------------------\n\n #Creates a standardized version of every element by field & cluster id for instance,\n #if you had the field \"first_name\", and the first cluster id had 3 items, \"John\",\n #\"John\", and \"Johnny\", the canonicalized version would have \"John\" listed for all\n #three in a new field called \"first_name - canonical\"\n\n #If you prefer only canonicalize a few of your fields, you can set the parameter\n #as a list of fields you want a canonical version for. In my example above, you\n #could have written canonicalize=['first_name', 'last_name'], and you would get\n #a canonical version for first_name, and last_name, but not for payment_type.\n\n\n### Specifying Types\n\n # Price Example\n pandas_dedupe.dedupe_dataframe(df,['first_name', 'last_name', ('salary', 'Price')])\n\n # has missing Example\n pandas_dedupe.link_dataframes(df,['SSN', ('bio_pgraph', 'Text'), ('salary', 'Price', 'has missing')])\n\n # crf Example\n pandas_dedupe.dedupe_dataframe(df,[('first_name', 'String', 'crf'), 'last_name', (m_initial, 'Exact')])\n\n\n #------------------------------additional details------------------------------\n\n #If a type is not explicity listed, String will be used.\n\n #Tuple (parenthesis) is required to declare all other types. If you prefer use tuple\n #for string also, ('first_name', 'String'), that's fine.\n\n #If you want to specify either a 'crf' or 'has missing' parameter, a tuple with three elements\n #must be used. ('first_name', 'String', 'crf') works, ('first_name', 'crf') does not work.\n\n# Types\n\nDedupe supports a variety of datatypes; a full listing with documentation can be found [here.](https://docs.dedupe.io/en/latest/Variable-definition.html#)\n\npandas-dedupe officially supports the following datatypes:\n* **String** - Standard string comparison using string distance metric. This is the default type.\n* **Text** - Comparison for sentences or paragraphs of text. Uses cosine similarity metric.\n* **Price** - For comparing positive, non zero numerical values.\n* **DateTime** - For comparing dates.\n* **LatLong** - (39.990334, 70.012) will not match to (40.01, 69.98) using a string distance\nmetric, even though the points are in a geographically similar location. The LatLong type resolves\n this by calculating the haversine distance between compared coordinates. LatLong requires\n the field to be in the format (Lat, Lng). The value can be a string, a tuple containing two\n strings, a tuple containing two floats, or a tuple containing two integers. If the format\n is not able to be processed, you will get a traceback.\n* **Exact** - Tests wheter fields are an exact match.\n* **Exists** - Sometimes, the presence or absence of data can be useful in predicting a match.\nThe Exists type tests for whether both, one, or neither of fields are null.\n\nAdditional supported parameters are:\n* **has missing** - Can be used if one of your data fields contains null values\n* **crf** - Use conditional random fields for comparisons rather than distance metric. May be more\naccurate in some cases, but runs much slower. Works with String and ShortString types.\n\n# Credits\n\nMany thanks to folks at [DataMade](https://datamade.us/) for making the the [Dedupe library](https://github.com/dedupeio/dedupe) publicly available. People interested in a code-free implementation of the dedupe library can find a link here: [Dedupe.io](https://dedupe.io/pricing/).\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Lyonk71/pandas-dedupe", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pandas-dedupe", "package_url": "https://pypi.org/project/pandas-dedupe/", "platform": "", "project_url": "https://pypi.org/project/pandas-dedupe/", "project_urls": { "Homepage": "https://github.com/Lyonk71/pandas-dedupe" }, "release_url": "https://pypi.org/project/pandas-dedupe/0.42/", "requires_dist": [ "dedupe", "unidecode", "pandas" ], "requires_python": "", "summary": "The Dedupe library made easy with Pandas.", "version": "0.42" }, "last_serial": 4980570, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "b3b0da02222446a54cbd53e28d85e418", "sha256": "d090653ff6b9f54ba14d87589a0b78db06991550731fff3179a9cec882dc4981" }, "downloads": -1, "filename": "pandas_dedupe-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b3b0da02222446a54cbd53e28d85e418", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 3896, "upload_time": "2018-12-07T01:48:33", "url": "https://files.pythonhosted.org/packages/8a/22/85a5f0662958a33c374140a0444ac59bd45adeb91409578114906d905384/pandas_dedupe-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3b9325ceb5359bb5eaf608060102b9ba", "sha256": "2cfceccbf5424cffb5b80247bd52d35139a4d1c4bfc65db1e114f9aa1c26c3f9" }, "downloads": -1, "filename": "pandas_dedupe-0.2.tar.gz", "has_sig": false, "md5_digest": "3b9325ceb5359bb5eaf608060102b9ba", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4061, "upload_time": "2018-12-07T01:48:35", "url": "https://files.pythonhosted.org/packages/b9/ee/70c9e558c4e15fd27fef3a514f5e6e5895d205f554c4fe6742ec0ad10400/pandas_dedupe-0.2.tar.gz" } ], "0.21": [ { "comment_text": "", "digests": { "md5": "ed406c7d80b5dc176c87d600f8689b2c", "sha256": "6c791f71f1c98643febcaa16cecad5cca4b1d2ece96c681d9a2d1b2a9bb12925" }, "downloads": -1, "filename": "pandas_dedupe-0.21-py3-none-any.whl", "has_sig": false, "md5_digest": "ed406c7d80b5dc176c87d600f8689b2c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4570, "upload_time": "2018-12-09T00:54:06", "url": "https://files.pythonhosted.org/packages/6b/1e/fdc004604aa88e2f3850964691ebb9fcb11c083ae6ab22a52050ca154ea4/pandas_dedupe-0.21-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c0bc5707664712229756d37becc4f465", "sha256": "03bf1073bd8ea4829f6937fb87fe93e69c5a785ed261b45832d2ed227c2362e3" }, "downloads": -1, "filename": "pandas_dedupe-0.21.tar.gz", "has_sig": false, "md5_digest": "c0bc5707664712229756d37becc4f465", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4381, "upload_time": "2018-12-09T00:54:08", "url": "https://files.pythonhosted.org/packages/d4/ae/a6370bb2998ae1516877ed81ddba89687d5e8024676af6c7e22cc02db1ec/pandas_dedupe-0.21.tar.gz" } ], "0.22": [ { "comment_text": "", "digests": { "md5": "5ddf31af1c5071c6d319339e405e08ae", "sha256": "446a67f6e117fed06d802d90a8bd3ba460ce1bc637f827ba5d17bdc9d0bfde08" }, "downloads": -1, "filename": "pandas_dedupe-0.22-py3-none-any.whl", "has_sig": false, "md5_digest": "5ddf31af1c5071c6d319339e405e08ae", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4621, "upload_time": "2018-12-11T02:59:31", "url": "https://files.pythonhosted.org/packages/4a/c6/4403a070e7d7c13ec232f9ae23b0d091da06af20e1b3996cb70afe512b34/pandas_dedupe-0.22-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "07461b8b2c8e85a1558581155d45c002", "sha256": "79edac16018bbb23a89318394a57eda8b3a05873ede28164d4a5a22cb0c488ab" }, "downloads": -1, "filename": "pandas_dedupe-0.22.tar.gz", "has_sig": false, "md5_digest": "07461b8b2c8e85a1558581155d45c002", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4433, "upload_time": "2018-12-11T02:59:32", "url": "https://files.pythonhosted.org/packages/3a/b7/f7f3078e3e952166c8396ee7423fdf1704a116cec52a3cc21daa956a7be3/pandas_dedupe-0.22.tar.gz" } ], "0.24": [ { "comment_text": "", "digests": { "md5": "f376d32c3973a34b73916b5157755dad", "sha256": "80e7ace2b850559f65440aba5a79c03ff5e1fc93927f86107d788dc92090242d" }, "downloads": -1, "filename": "pandas_dedupe-0.24-py3-none-any.whl", "has_sig": false, "md5_digest": "f376d32c3973a34b73916b5157755dad", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6311, "upload_time": "2018-12-21T02:29:28", "url": "https://files.pythonhosted.org/packages/db/db/c0de05cda9e8e743a16f7c2457d6c406aa490b91cb46cdbadd252e9c5baf/pandas_dedupe-0.24-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fc32d16c276ac9a33fe1ab873a495a18", "sha256": "29dff0f4b18d5c29accc493e2bfb9b301fde9755fd82d1fde8a67c5096cd9875" }, "downloads": -1, "filename": "pandas_dedupe-0.24.tar.gz", "has_sig": false, "md5_digest": "fc32d16c276ac9a33fe1ab873a495a18", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4797, "upload_time": "2018-12-21T02:29:30", "url": "https://files.pythonhosted.org/packages/97/9c/5d90af5e0e271117e69498b9cb1d049db0e68af59bcac4eaa2543a85a3df/pandas_dedupe-0.24.tar.gz" } ], "0.31": [ { "comment_text": "", "digests": { "md5": "34a7c9f2c544d494af88f72b47af5057", "sha256": "c5cd423e6ee865749820f2cf9c18bb785ce3740c95bd8224061fa4550bdad3a6" }, "downloads": -1, "filename": "pandas_dedupe-0.31-py3-none-any.whl", "has_sig": false, "md5_digest": "34a7c9f2c544d494af88f72b47af5057", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7845, "upload_time": "2019-02-20T02:42:33", "url": "https://files.pythonhosted.org/packages/6a/90/ed405865e4b932ef4a23a7e05f1361e67556fb9e888b547ec782215a5574/pandas_dedupe-0.31-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1c2f64cc8cafcfd94ef26661f1021da4", "sha256": "8685048aa9bc4b704f8bc23a98cd870aeb3cda4d47f16aa232447146e5a8e6ee" }, "downloads": -1, "filename": "pandas_dedupe-0.31.tar.gz", "has_sig": false, "md5_digest": "1c2f64cc8cafcfd94ef26661f1021da4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6576, "upload_time": "2019-02-20T02:42:35", "url": "https://files.pythonhosted.org/packages/8c/90/62238b4e634b7d23c33b882bef30bcf0355b86ece33dac1360d40254779c/pandas_dedupe-0.31.tar.gz" } ], "0.42": [ { "comment_text": "", "digests": { "md5": "4488a828b93ca4feab7247896eb32e45", "sha256": "2fa7380a60e13e58048c1ccae1119cf02c790b755c4070141da7107e8ce5f513" }, "downloads": -1, "filename": "pandas_dedupe-0.42-py3-none-any.whl", "has_sig": false, "md5_digest": "4488a828b93ca4feab7247896eb32e45", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7802, "upload_time": "2019-03-25T03:03:08", "url": "https://files.pythonhosted.org/packages/75/58/3ff49049ca695d16835d138afcbfd2936386165e60851921f3d08878adb1/pandas_dedupe-0.42-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e6aac4b0ee05e08d1c8c6259bb0ac1d6", "sha256": "a524a19c4824c21ef494ec02d7cea8628f6fee6fe5fd49dd8723d1269e1f12b9" }, "downloads": -1, "filename": "pandas_dedupe-0.42.tar.gz", "has_sig": false, "md5_digest": "e6aac4b0ee05e08d1c8c6259bb0ac1d6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6577, "upload_time": "2019-03-25T03:03:10", "url": "https://files.pythonhosted.org/packages/e1/bb/67a3af1adc14dac8ab3172e1c85dc5ee3121648c0b3441ed7fbfc3f3ac30/pandas_dedupe-0.42.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4488a828b93ca4feab7247896eb32e45", "sha256": "2fa7380a60e13e58048c1ccae1119cf02c790b755c4070141da7107e8ce5f513" }, "downloads": -1, "filename": "pandas_dedupe-0.42-py3-none-any.whl", "has_sig": false, "md5_digest": "4488a828b93ca4feab7247896eb32e45", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7802, "upload_time": "2019-03-25T03:03:08", "url": "https://files.pythonhosted.org/packages/75/58/3ff49049ca695d16835d138afcbfd2936386165e60851921f3d08878adb1/pandas_dedupe-0.42-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e6aac4b0ee05e08d1c8c6259bb0ac1d6", "sha256": "a524a19c4824c21ef494ec02d7cea8628f6fee6fe5fd49dd8723d1269e1f12b9" }, "downloads": -1, "filename": "pandas_dedupe-0.42.tar.gz", "has_sig": false, "md5_digest": "e6aac4b0ee05e08d1c8c6259bb0ac1d6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6577, "upload_time": "2019-03-25T03:03:10", "url": "https://files.pythonhosted.org/packages/e1/bb/67a3af1adc14dac8ab3172e1c85dc5ee3121648c0b3441ed7fbfc3f3ac30/pandas_dedupe-0.42.tar.gz" } ] }