{ "info": { "author": "DSaPP Researchers", "author_email": "datascifellows@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5" ], "description": "===============================\r\nSuperDeduper\r\n===============================\r\n\r\n.. image:: https://img.shields.io/pypi/v/superdeduper.svg\r\n :target: https://pypi.python.org/pypi/superdeduper\r\n\r\n.. image:: https://img.shields.io/travis/dssg/superdeduper.svg\r\n :target: https://travis-ci.org/dssg/superdeduper\r\n\r\n.. image:: https://codecov.io/gh/dssg/superdeduper/branch/master/graph/badge.svg\r\n\t :target: https://codecov.io/gh/dssg/superdeduper\r\n\r\n.. image:: https://readthedocs.org/projects/superdeduper/badge/?version=latest\r\n :target: https://superdeduper.readthedocs.io/en/latest/?badge=latest\r\n :alt: Documentation Status\r\n\r\n.. image:: https://pyup.io/repos/github/dssg/superdeduper/shield.svg\r\n :target: https://pyup.io/repos/github/dssg/superdeduper/\r\n :alt: Updates\r\n\r\n\r\n.. highlights:: **SuperDeduper has been renamed to pgdedupe. All subsequent development will occur under the new name.**\r\n\r\nA work-in-progress to provide a standard interface for deduplication of large\r\ndatabases with custom pre-processing and post-processing steps.\r\n\r\n\r\n* Free software: MIT license\r\n* Documentation: https://superdeduper.readthedocs.io.\r\n\r\n\r\nInterface\r\n---------\r\n\r\nThis provides a simple command-line program, ``superdeduper``. Two configuration\r\nfiles specify the deduplication parameters and database connection settings. To\r\nrun deduplication on a generated dataset, create a ``database.yml`` file that\r\nspecifies the following parameters::\r\n\r\n\tuser:\r\n\tpassword:\r\n\tdatabase:\r\n\thost:\r\n\tport:\r\n\r\nYou can now create a sample CSV file with::\r\n\r\n\t$ python generate_fake_dataset.py\r\n\tcreating people: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9500/9500 [00:21<00:00, 445.38it/s]\r\n\tadding twins: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 500/500 [00:00<00:00, 1854.72it/s]\r\n\twriting csv: 47%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u258b | 4666/10000 [00:42<00:55, 96.28it/s]\r\n\r\nOnce complete, store this example dataset in a database with::\r\n\r\n\t$ python test/initialize_db.py\r\n\tCREATE SCHEMA\r\n\tDROP TABLE\r\n\tCREATE TABLE\r\n\tCOPY 197617\r\n\tALTER TABLE\r\n\tALTER TABLE\r\n\tUPDATE 197617\r\n\r\nNow you can deduplicate this dataset. This will run dedupe as well as the\r\ncustom pre-processing and post-processing steps as defined in config.yml::\r\n\r\n\t$ superdeduper --config config.yml --db database.yml\r\n\r\n\r\nCustom pre- and post-processing\r\n-------------------------------\r\n\r\nIn addition to running a database-level deduplication with ``dedupe``, this\r\nscript adds custom pre- and post-processing steps to improve the run-time and\r\nresults, making this a hybrid between fuzzy matching and record linkage.\r\n\r\n* **Pre-processing:** Before running dedupe, this script does an exact-match\r\n deduplication. Some systems create many identical rows; this can make it\r\n challenging for dedupe to create an effective blocking strategy and generally\r\n makes the fuzzy matching much harder and time intensive.\r\n\r\n* **Post-processing:** After running dedupe, this script does an optional\r\n exact-match merge across subsets of columns. For example, in some instances\r\n an exact match of just the last name and social security number are\r\n sufficient evidence that two clusters are indeed the same identity.\r\n\r\n\r\nFurther steps\r\n-------------\r\n\r\nThis script was based upon and extended from the example in\r\n`dedupe-examples`_. It would be nice to use this common interface across all\r\ndatabase types, and potentially even allow reading from flat CSV files.\r\n\r\n.. _dedupe-examples: https://github.com/datamade/dedupe-examples/tree/master/pgsql_big_dedupe_example\r\n\r\n\r\n=======\r\nHistory\r\n=======\r\n\r\n0.1.0 (2016-12-14)\r\n------------------\r\n\r\n* First release on PyPI.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dssg/superdeduper", "keywords": "superdeduper", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "superdeduper", "package_url": "https://pypi.org/project/superdeduper/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/superdeduper/", "project_urls": { "Homepage": "https://github.com/dssg/superdeduper" }, "release_url": "https://pypi.org/project/superdeduper/0.1.7/", "requires_dist": [ "Click (>=6.0)", "PyYAML", "dedupe (>=1.6.0)", "dedupe-variable-name", "numpy", "pandas", "psycopg2" ], "requires_python": "", "summary": "A simple interface to datamade/dedupe to make probabilistic record linkage easy.", "version": "0.1.7" }, "last_serial": 2820148, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "920feaf8b290e0f3ecf39057a4ad554a", "sha256": "6053ac574c30265df09bb88034239ed6737c2a7fedd663739c587b61007051f5" }, "downloads": -1, "filename": "superdeduper-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "920feaf8b290e0f3ecf39057a4ad554a", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12126, "upload_time": "2017-02-22T04:56:36", "url": "https://files.pythonhosted.org/packages/61/1c/274b7f659c62b64926f33abad6207dc501e1fa65564d3704c5c1b1a46ecd/superdeduper-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0f9bb36bdb9274d303605dff83bf9566", "sha256": "cbab6a4fafb510e003f18003598462871787d048953419ec2720ee93d96d611c" }, "downloads": -1, "filename": "superdeduper-0.1.0.tar.gz", "has_sig": false, "md5_digest": "0f9bb36bdb9274d303605dff83bf9566", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 67210, "upload_time": "2017-02-22T04:56:37", "url": "https://files.pythonhosted.org/packages/dc/57/a76af8e33bd8d2ef06b3bc4a373c9a5c51316f847f7cfd0d3cc627da4701/superdeduper-0.1.0.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "cfdfc4d4403dba52b31d57ee4ae63c78", "sha256": "7a465650c3948419f8008d4af7f2606148186574c28f9255c9dee1af52d68d11" }, "downloads": -1, "filename": "superdeduper-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "cfdfc4d4403dba52b31d57ee4ae63c78", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12192, "upload_time": "2017-03-17T15:38:29", "url": "https://files.pythonhosted.org/packages/24/fb/d4dea1e63b6b9d4bfdb2b200a02259c58346f66137ba505b459d9c404464/superdeduper-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e0db2324e11561193bc71e3449c0e778", "sha256": "d3db7f7835fa6948dbd203ab882556fa06503b5c98a338eefa5b19ae3a601e0e" }, "downloads": -1, "filename": "superdeduper-0.1.2.tar.gz", "has_sig": false, "md5_digest": "e0db2324e11561193bc71e3449c0e778", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 67249, "upload_time": "2017-03-17T15:38:30", "url": "https://files.pythonhosted.org/packages/dd/3d/0b2372bff6be58459bed9d80f8346567e3fa2bbe1bc621d5667790be25ef/superdeduper-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "5371d5705c134e172547353a3c438739", "sha256": "3b1191e5bdcae8ada5c4e9c7d7813cf0fa157db6c71c78eaec73185481c22f5b" }, "downloads": -1, "filename": "superdeduper-0.1.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "5371d5705c134e172547353a3c438739", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12261, "upload_time": "2017-03-24T20:41:12", "url": "https://files.pythonhosted.org/packages/27/fa/ed6313e4f810e07d79a6c0b7da0b3dd5629cf4efbbbce9528d5f6b6788a9/superdeduper-0.1.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "69f4a460a64bc72f18178e5a1f64e0e3", "sha256": "0784602fe20e9e0011b579ac0adae4f986523ed79865e8f86a8b498fc35375d6" }, "downloads": -1, "filename": "superdeduper-0.1.3.tar.gz", "has_sig": false, "md5_digest": "69f4a460a64bc72f18178e5a1f64e0e3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 67376, "upload_time": "2017-03-24T20:41:14", "url": "https://files.pythonhosted.org/packages/f5/c0/1b4588056e4faefc232736947f89f4fee06c26b2fbd36589770530240034/superdeduper-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "d4e0610a642cd3f0abcba217722b33df", "sha256": "8d49c5d3d3ca378996f5533dbbcd072f1276e852b8a5ae0c453768653d1d117a" }, "downloads": -1, "filename": "superdeduper-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d4e0610a642cd3f0abcba217722b33df", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12260, "upload_time": "2017-03-24T23:46:45", "url": "https://files.pythonhosted.org/packages/d0/ba/b5821193bd3327fd0f282e284af0d79f01256ec16cf9eefb2a8b840da0e4/superdeduper-0.1.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e1ef75f27381c993a443775567c2f3c4", "sha256": "c6f2e5da5055843ad09fe9055400ad09934cf9e01f2dee971682d4c758b4db7e" }, "downloads": -1, "filename": "superdeduper-0.1.4.tar.gz", "has_sig": false, "md5_digest": "e1ef75f27381c993a443775567c2f3c4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 67373, "upload_time": "2017-03-24T23:46:47", "url": "https://files.pythonhosted.org/packages/16/49/e93a7a448a6ee49c96983e0dce82c591db0c82d08072872c780a7b09a77e/superdeduper-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "384760051012b5b82a360b75c2d2abdc", "sha256": "0cc4a299195c6fa3cf7f10b4945d079589bd840739196ef288041fceb2e13b3c" }, "downloads": -1, "filename": "superdeduper-0.1.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "384760051012b5b82a360b75c2d2abdc", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12896, "upload_time": "2017-03-28T17:52:29", "url": "https://files.pythonhosted.org/packages/73/dd/a95c39cb4bea452ca3499c0bf7248d1c06168ed46cea48e706346d85321c/superdeduper-0.1.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f8fab7d869bad8590aa32233e0a0da3b", "sha256": "ac6d56ef8de1e4cd0878eca879a9f4baebffc51f95946bfc0a654e3543de0d00" }, "downloads": -1, "filename": "superdeduper-0.1.5.tar.gz", "has_sig": false, "md5_digest": "f8fab7d869bad8590aa32233e0a0da3b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 73453, "upload_time": "2017-03-28T17:52:30", "url": "https://files.pythonhosted.org/packages/f1/60/9b1b18eb41e09d61ad6ea94f29cda7b05f590c9a8120435740d8a0145e1a/superdeduper-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "da129bed9808776f6b3984fb8af08383", "sha256": "e8eb23a3e4a90366a78d6c4cddddbbc887e2e8bf3b22c10406080cb6b7471559" }, "downloads": -1, "filename": "superdeduper-0.1.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "da129bed9808776f6b3984fb8af08383", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12902, "upload_time": "2017-03-28T19:40:58", "url": "https://files.pythonhosted.org/packages/b9/01/7670cbc64073cbc800327d603971fb1aa5da3607ce3cf9cc6828e0eb8342/superdeduper-0.1.6-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "220aef9f5d5939d0b83f2eb817ace4d5", "sha256": "87633050bc2d4783a29d9182abc8fa0a9bc054c7d0999900215588adf59d8994" }, "downloads": -1, "filename": "superdeduper-0.1.6.tar.gz", "has_sig": false, "md5_digest": "220aef9f5d5939d0b83f2eb817ace4d5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 73456, "upload_time": "2017-03-28T19:41:00", "url": "https://files.pythonhosted.org/packages/93/c2/9ad142873eeca68a1c378bec8393e2623f65667acfb616a1e3dafa876183/superdeduper-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "8fe4dc0610be89c4151b3a86ce1f14f8", "sha256": "9e644770da3b48d1002df94593422f66edb41ea50dc7b5f4a8b275f57177d34f" }, "downloads": -1, "filename": "superdeduper-0.1.7-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8fe4dc0610be89c4151b3a86ce1f14f8", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 13129, "upload_time": "2017-04-19T17:14:14", "url": "https://files.pythonhosted.org/packages/ad/2f/a57f3c6ee78f8bdad320b658e5748ad75ddc2c5e1873564d7b46c12bf391/superdeduper-0.1.7-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "20f7761d040cadb290788b90c71964b2", "sha256": "195ab4c86d28c3410d079465769e18e991c38eba66b802d192bd1a6143bfd79c" }, "downloads": -1, "filename": "superdeduper-0.1.7.tar.gz", "has_sig": false, "md5_digest": "20f7761d040cadb290788b90c71964b2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 69302, "upload_time": "2017-04-19T17:14:16", "url": "https://files.pythonhosted.org/packages/a8/e9/3172347a56814232ca52e66d57e553133cffbe2f17c26c938e6d8c222415/superdeduper-0.1.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8fe4dc0610be89c4151b3a86ce1f14f8", "sha256": "9e644770da3b48d1002df94593422f66edb41ea50dc7b5f4a8b275f57177d34f" }, "downloads": -1, "filename": "superdeduper-0.1.7-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8fe4dc0610be89c4151b3a86ce1f14f8", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 13129, "upload_time": "2017-04-19T17:14:14", "url": "https://files.pythonhosted.org/packages/ad/2f/a57f3c6ee78f8bdad320b658e5748ad75ddc2c5e1873564d7b46c12bf391/superdeduper-0.1.7-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "20f7761d040cadb290788b90c71964b2", "sha256": "195ab4c86d28c3410d079465769e18e991c38eba66b802d192bd1a6143bfd79c" }, "downloads": -1, "filename": "superdeduper-0.1.7.tar.gz", "has_sig": false, "md5_digest": "20f7761d040cadb290788b90c71964b2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 69302, "upload_time": "2017-04-19T17:14:16", "url": "https://files.pythonhosted.org/packages/a8/e9/3172347a56814232ca52e66d57e553133cffbe2f17c26c938e6d8c222415/superdeduper-0.1.7.tar.gz" } ] }