{ "info": { "author": "Sanhe Hu", "author_email": "husanhe@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: MacOS", "Operating System :: Microsoft :: Windows", "Operating System :: Unix", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "\n.. image:: https://readthedocs.org/projects/dupe_remove/badge/?version=latest\n :target: https://dupe_remove.readthedocs.io/index.html\n :alt: Documentation Status\n\n.. image:: https://travis-ci.org/MacHu-GWU/dupe_remove-project.svg?branch=master\n :target: https://travis-ci.org/MacHu-GWU/dupe_remove-project?branch=master\n\n.. image:: https://codecov.io/gh/MacHu-GWU/dupe_remove-project/branch/master/graph/badge.svg\n :target: https://codecov.io/gh/MacHu-GWU/dupe_remove-project\n\n.. image:: https://img.shields.io/pypi/v/dupe_remove.svg\n :target: https://pypi.python.org/pypi/dupe_remove\n\n.. image:: https://img.shields.io/pypi/l/dupe_remove.svg\n :target: https://pypi.python.org/pypi/dupe_remove\n\n.. image:: https://img.shields.io/pypi/pyversions/dupe_remove.svg\n :target: https://pypi.python.org/pypi/dupe_remove\n\n.. image:: https://img.shields.io/badge/STAR_Me_on_GitHub!--None.svg?style=social\n :target: https://github.com/MacHu-GWU/dupe_remove-project\n\n------\n\n\n.. image:: https://img.shields.io/badge/Link-Document-blue.svg\n :target: https://dupe_remove.readthedocs.io/index.html\n\n.. image:: https://img.shields.io/badge/Link-API-blue.svg\n :target: https://dupe_remove.readthedocs.io/py-modindex.html\n\n.. image:: https://img.shields.io/badge/Link-Source_Code-blue.svg\n :target: https://dupe_remove.readthedocs.io/py-modindex.html\n\n.. image:: https://img.shields.io/badge/Link-Install-blue.svg\n :target: `install`_\n\n.. image:: https://img.shields.io/badge/Link-GitHub-blue.svg\n :target: https://github.com/MacHu-GWU/dupe_remove-project\n\n.. image:: https://img.shields.io/badge/Link-Submit_Issue-blue.svg\n :target: https://github.com/MacHu-GWU/dupe_remove-project/issues\n\n.. image:: https://img.shields.io/badge/Link-Request_Feature-blue.svg\n :target: https://github.com/MacHu-GWU/dupe_remove-project/issues\n\n.. image:: https://img.shields.io/badge/Link-Download-blue.svg\n :target: https://pypi.org/pypi/dupe_remove#files\n\n\nWelcome to ``dupe_remove`` Documentation\n==============================================================================\n\n**How come duplicate data in database?**\n\nIn OLAP database Redshift, the ``primary_key`` column doesn't apply any restriction due to performance issue. What if our ETL pipeline load duplicate same data multiple times in retry?\n\n**How** dupe_remove **solve the problem?**\n\n``dupe_remove`` use a optimized strategy to remove duplicate precisely and fast. You only need to specify:\n\n- database connection\n- table name, id column, sort key column\n\n``dupe_remove`` will do these on your own will:\n\n- remove duplicate data in specified sort key range\n- deploy as cron job on AWS Lambda to automatically remove all duplicate data in a table.\n\n\nUsage Example\n------------------------------------------------------------------------------\nOur database::\n\n table.events\n |-- column(id, type=string) # id column\n |-- column(time, type=timestamp) # sort key column\n |-- other columns ...\n\n\nOn Local Machine\n------------------------------------------------------------------------------\n\n.. code-block:: python\n\n from datetime import datetime, timedelta\n from sqlalchemy_mate import EngineCreator\n from dupe_remove import Worker\n\n table_name = \"events\"\n id_col_name = \"id\"\n sort_col_name = \"time\"\n credential_file = \"/Users/admin/db.json\"\n engine_creator = EngineCreator.from_json(credential_file)\n engine = engine_creator.create_redshift()\n\n worker = Worker(\n engine=engine,\n table_name=table_name,\n id_col_name=id_col_name,\n sort_col_name=sort_col_name,\n )\n\n worker.remove_duplicate(\n lower=datetime(2018, 1, 1),\n upper=datetime(2018, 2, 1),\n )\n\n\nOn AWS Lambda Cron Job\n------------------------------------------------------------------------------\n\n.. code-block:: python\n\n def handler(event, context):\n from datetime import datetime, timedelta\n from sqlalchemy_mate import EngineCreator\n from dupe_remove import Scheduler, Worker, Handler\n\n table_name = \"events\"\n id_col_name = \"id\"\n sort_col_name = \"time\"\n\n engine_creator = EngineCreator.from_env(prefix=\"DEV_DB\", kms_decrypt=True)\n engine = engine_creator.create_redshift()\n test_connection(engine, 6)\n\n worker = Worker(\n engine=engine,\n table_name=table_name,\n id_col_name=id_col_name,\n sort_col_name=sort_col_name,\n )\n\n # run every 5 min, clean 31 days data at a time from 2018-01-01,\n # start over in 12 cycle\n cron_freq_in_seconds = 300\n start = datetime(2018, 1, 1)\n delta = timedelta(days=31)\n bin_size = 12\n scheduler = Scheduler(\n cron_freq_in_seconds=cron_freq_in_seconds,\n start=start,\n delta=delta,\n bin_size=bin_size,\n )\n\n real_handler = Handler(worker=worker, scheduler=scheduler)\n real_handler.handler(event, context)\n\n\n.. _install:\n\nInstall\n------------------------------------------------------------------------------\n\n``dupe_remove`` is released on PyPI, so all you need is:\n\n.. code-block:: console\n\n $ pip install dupe_remove\n\nTo upgrade to latest version:\n\n.. code-block:: console\n\n $ pip install --upgrade dupe_remove\n\n", "description_content_type": "", "docs_url": null, "download_url": "https://pypi.python.org/pypi/dupe_remove/0.0.1#downloads", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/MacHu-GWU/", "keywords": "", "license": "MIT", "maintainer": "Unknown", "maintainer_email": "", "name": "dupe-remove", "package_url": "https://pypi.org/project/dupe-remove/", "platform": "Windows", "project_url": "https://pypi.org/project/dupe-remove/", "project_urls": { "Download": "https://pypi.python.org/pypi/dupe_remove/0.0.1#downloads", "Homepage": "https://github.com/MacHu-GWU/" }, "release_url": "https://pypi.org/project/dupe-remove/0.0.1/", "requires_dist": [ "six", "attrs", "sqlalchemy", "sqlalchemy-redshift", "psycopg2-binary", "loggerFactory (==0.0.5)", "sqlalchemy-mate (==0.0.7)", "sphinx (==1.8.1) ; extra == 'docs'", "sphinx-rtd-theme ; extra == 'docs'", "sphinx-jinja ; extra == 'docs'", "sphinx-copybutton ; extra == 'docs'", "docfly (>=0.0.17) ; extra == 'docs'", "rstobj (>=0.0.5) ; extra == 'docs'", "pygments ; extra == 'docs'", "pytest (==3.2.3) ; extra == 'tests'", "pytest-cov (==2.5.1) ; extra == 'tests'" ], "requires_python": "", "summary": "Build Lambda Function to remove duplicate data from Redshift in minutes.", "version": "0.0.1" }, "last_serial": 4926566, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "fcf7d7d31c6d9d572c7c50ace7ae20de", "sha256": "1b2f3e1a55a47982ed216ac658c776e24697244696e64d53936771fef132420a" }, "downloads": -1, "filename": "dupe_remove-0.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "fcf7d7d31c6d9d572c7c50ace7ae20de", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 23662, "upload_time": "2019-03-11T19:09:43", "url": "https://files.pythonhosted.org/packages/94/04/b8151b35f56cb61b9441748209b901493ecbaacf2e342625759f54aa544a/dupe_remove-0.0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d33f9809145895491b603e7e4aa2f145", "sha256": "9579eaffa38c0a53490479c61dbc40c6b4bab9534f4ee5a5b9a4a46629cdac57" }, "downloads": -1, "filename": "dupe_remove-0.0.1.tar.gz", "has_sig": false, "md5_digest": "d33f9809145895491b603e7e4aa2f145", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18134, "upload_time": "2019-03-11T19:09:45", "url": "https://files.pythonhosted.org/packages/fa/27/d6c468029aabe8e105aa78207693800840116899cd269008ea4ed0896224/dupe_remove-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fcf7d7d31c6d9d572c7c50ace7ae20de", "sha256": "1b2f3e1a55a47982ed216ac658c776e24697244696e64d53936771fef132420a" }, "downloads": -1, "filename": "dupe_remove-0.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "fcf7d7d31c6d9d572c7c50ace7ae20de", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 23662, "upload_time": "2019-03-11T19:09:43", "url": "https://files.pythonhosted.org/packages/94/04/b8151b35f56cb61b9441748209b901493ecbaacf2e342625759f54aa544a/dupe_remove-0.0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d33f9809145895491b603e7e4aa2f145", "sha256": "9579eaffa38c0a53490479c61dbc40c6b4bab9534f4ee5a5b9a4a46629cdac57" }, "downloads": -1, "filename": "dupe_remove-0.0.1.tar.gz", "has_sig": false, "md5_digest": "d33f9809145895491b603e7e4aa2f145", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18134, "upload_time": "2019-03-11T19:09:45", "url": "https://files.pythonhosted.org/packages/fa/27/d6c468029aabe8e105aa78207693800840116899cd269008ea4ed0896224/dupe_remove-0.0.1.tar.gz" } ] }