{ "info": { "author": "Shivam Bansal", "author_email": "shivam5992@gmail.com", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python" ], "description": "# **dupandas:** data deduplication of text records in a pandas dataframe\n\n\n[![Project Status: WIP \u2013 Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](http://www.repostatus.org/badges/latest/wip.svg)](https://github.com/shivam5992/dupandas) [![Twitter Follow](https://img.shields.io/twitter/follow/shields_io.svg?style=social&label=Follow&maxAge=2592000)](https://twitter.com/shivamshaz)\n\ndupandas is a python package to perform data deduplication on columns of a pandas dataframe using flexible text matching. It is compatible with both versions of python (2.x and 3.x). dupandas can find duplicate any kinds of text records in the pandas data. It comprises of sophisticated Matchers that can handle spelling differences and phonetics. It also comprises of several Cleaners, which can be used to clean up the noise present in the text data such as punctuations, digits, casing etc.\n\nFor fast computations, dupandas uses lucene based text indexing. In the input_config, if \"indexing\" = True, then it indexes the dataset in RAMDirectory which is used to identify and search similar strings. Check out the instructions of installing PyLucene below.\n\nThe beautiful part of dupandas is that it's Matchers, Cleaners and Indexing functions can be used as standalone packages while working with text data.\n\n\n## Installation\nFollowing python modules are required to use dupandas: **pandas, fuzzy, python-levenshtein** . These modules can be installed using pip command:\n\n```python\n pip install dupandas pandas fuzzy python-levenshtein\n```\n**OR** if dependencies are already installed:\n\n```\n pip install dupandas\n```\n\n**OPTIONAL** For faster implementation dupandas with indexing feature is recommended. dupandas uses PuLucene for data indexing purposes. \n**PyLucene Installation:** Please note that for lucene indexing, java needs to be installed. Java 8 is recommended. Refer to [this](https://www.digitalocean.com/community/tutorials/how-to-install-java-on-ubuntu-with-apt-get) link\n\n```\n sudo apt-get update\n sudo apt-get install pylucene\n\n After Installation, edit ~/.bashrc file, and add the following line at the end \n export LD_LIBRARY_PATH=/usr/lib/jvm/java_folder_name/jre/lib/amd64/server\n \n example: export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server\n```\n\n**Note:** The use of indexing can reduce the overall time of computation and execution to one third of original.\n\n## Usage : dupandas\ndupandas using default Matchers and Cleaners, Default Matcher and Cleaners are Exact Match and No Cleaning respectively.\n\n``` python\n from dupandas import Dedupe\n dupe = Dedupe()\n \n input_config = {\n 'input_data' : pandas_dataframe,\n 'column' : 'column_name_to_deduplicate',\n '_id' : 'unique_id_column_of_dataset',\n }\n results = dupe.dedupe(input_config)\n```\n\ndupandas using custom Cleaner and Matcher configs\n\n``` python\n from dupandas import Dedupe\n\n clean_config = { 'lower' : True, 'punctuation' : True, 'whitespace' : True, 'digit' : True }\n match_config = { 'exact' : False, 'levenshtein' : True, 'soundex' : False, 'nysiis' : False}\n dupe = Dedupe(clean_config = clean_config, match_config = match_config)\n\n input_config = {\n 'input_data' : pandas_dataframe,\n 'column' : 'column_name_to_deduplicate',\n '_id' : 'unique_id_column_of_dataset',\n }\n results = dupe.dedupe(input_config)\n```\n\nOther options in input_config \n\n```python\n input_config = {\n 'input_data' : pandas_dataframe,\n 'column' : 'column_name_to_deduplicate',\n '_id' : 'unique_id_column_of_dataset',\n 'score_column' : 'name_of_the_column_for_confidence_score',\n 'threshold' : 0.75, # float value of threshold\n 'unique_pairs' : True, # boolean to get unique (A=B) or duplicate (A=B and B=A) results\n 'indexing' : False # Boolean to set lucene indexing = True / False, Default: False\n }\n```\n\n## Usage : standalone Cleaner class\n\n```python\n from dupandas import Cleaner\n clean_config = { 'lower' : True, 'punctuation' : True, 'whitespace' : True, 'digit' : True }\n clean = Cleaner(clean_config)\n clean.clean_text(\"new Delhi 3#! 34 \")\n```\n\n## Usage: standalone Matcher class\n\n```python\n from dupandas import Matcher\n match_config = { 'exact' : False, 'levenshtein' : True, 'soundex' : False, 'nysiis' : False}\n match = Matcher(match_config)\n match.match_elements(\"new delhi\", \"newdeli\")\n```\n\n## Issues\n\nThanks for checking this work, Yes ofcourse there is a scope of improvement, Feel free to submit issues and enhancement requests.\n\n## Contributing\n#### ToDos\n\n1. [ ] V2: Add Support for multi column match\n2. [ ] V2: Add More Matchers, Cleaners\n3. [ ] V2: Remove Library Dependencies\n4. [ ] V2: Handle Longer Texts, Optimize Speed, Lucene Time Optimize, fix input bugs\n\n#### Steps \n 1. **Fork** the repo on GitHub\n 2. **Clone** the project to your own machine\n 3. **Commit** changes to your own branch\n 4. **Push** your work back up to your fork\n 5. Submit a **Pull request**", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/shivam5992/dupandas", "keywords": "pandas,deduplication,text cleaning,flexible matching", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "dupandas", "package_url": "https://pypi.org/project/dupandas/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/dupandas/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/shivam5992/dupandas" }, "release_url": "https://pypi.org/project/dupandas/0.3.2/", "requires_dist": null, "requires_python": null, "summary": "python package to deduplicate text data in pandas dataframe using flexible string matching and cleaning", "version": "0.3.2" }, "last_serial": 2947059, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "6b012ce63e02c465634b8aa40f276c45", "sha256": "a3a4c89da12275c1a3c92a566476fa0847ca9af8faf2f8809868c6778021a494" }, "downloads": -1, "filename": "dupandas-0.2.tar.gz", "has_sig": false, "md5_digest": "6b012ce63e02c465634b8aa40f276c45", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5368, "upload_time": "2017-06-01T14:29:24", "url": "https://files.pythonhosted.org/packages/23/1c/f91fb5b11a3aeae40e7b93a63a3b1cbb873733e522db772cef42efb86c71/dupandas-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "17b7855e0b6b100288e1f31e7edc37f8", "sha256": "938da7040bb9aed38ee2af6bd33cc6d43681f4fc9abb8919e350882c0eb1a55f" }, "downloads": -1, "filename": "dupandas-0.3.tar.gz", "has_sig": false, "md5_digest": "17b7855e0b6b100288e1f31e7edc37f8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6604, "upload_time": "2017-06-12T13:40:31", "url": "https://files.pythonhosted.org/packages/67/71/5d7f9ef6e2c40abdd3d8b9dcda9dd1f1db876fb627788d6f021ec49b660a/dupandas-0.3.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "398b36833e9cfc442e733db68803db35", "sha256": "1b5386087f157412240adf98dff855c1cd37a688ad772a408b854add98fcc30e" }, "downloads": -1, "filename": "dupandas-0.3.1.tar.gz", "has_sig": false, "md5_digest": "398b36833e9cfc442e733db68803db35", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6996, "upload_time": "2017-06-12T13:45:25", "url": "https://files.pythonhosted.org/packages/a2/31/d65566743336c110e54eda8b35b771d3f1c59f9e9745a6e6a5dad1a23b82/dupandas-0.3.1.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "da7815b8e930def6ccfbba778679377e", "sha256": "707df632e5d33ade15a38b2aad040b496a41a35783c14dd4b25ddc80dd79cc30" }, "downloads": -1, "filename": "dupandas-0.3.2.tar.gz", "has_sig": false, "md5_digest": "da7815b8e930def6ccfbba778679377e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7433, "upload_time": "2017-06-13T13:58:58", "url": "https://files.pythonhosted.org/packages/74/d6/98698946e5d21022e0f5ab150463489dac350f3f4da308af5367c18e696a/dupandas-0.3.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "da7815b8e930def6ccfbba778679377e", "sha256": "707df632e5d33ade15a38b2aad040b496a41a35783c14dd4b25ddc80dd79cc30" }, "downloads": -1, "filename": "dupandas-0.3.2.tar.gz", "has_sig": false, "md5_digest": "da7815b8e930def6ccfbba778679377e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7433, "upload_time": "2017-06-13T13:58:58", "url": "https://files.pythonhosted.org/packages/74/d6/98698946e5d21022e0f5ab150463489dac350f3f4da308af5367c18e696a/dupandas-0.3.2.tar.gz" } ] }