{ "info": { "author": "Stefan Wehrmeyer", "author_email": "mail@stefanwehrmeyer.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering" ], "description": "# pandas-linker\n\n`pandas-linker` runs comparison windows over different sortings of a pandas DataFrame and links the rows via assigned UUIDs. This library does not actually do any duplicate detection. Instead it provides a harness to run your own comparison functions on your data.\n\nThis library is meant for datasets of a size where comparing every row with every other is undesirable. Instead you can decide on a sorting order of the DataFrame and only compare every row with every other inside a sliding window.\n\n## Install\n\n```\npip install pandas-linker\n```\n\n## Example\n\nLet's say you have a DataFrame like this:\n\n [ix] | name | country\n------|------|--------\n 0 | Pete | Spain\n 1 | Mary | USA\n 2 | Bart | US\n 3 | Mary | US\n\nand you want to detect similar rows and mark them as such. Here's how to do that:\n\n```python\nfrom pandas_linker import get_linker\n\n\ndef compare_rows(a, b):\n ''' Example function that decides if two rows represent same entity.'''\n return a['name'] in b['name'] or b['name'] in a['name']\n\n# df is a pandas.DataFrame with a unique index\n\nwith get_linker(df, field='uid') as linker:\n\n print('Comparing in 10 row window sorted by name')\n linker(sort_cols=['name'], window_size=10, cmp=compare_rows)\n\n print('Comparing in 15 row window sorted by country')\n linker(sort_cols=['country'], window_size=15, cmp=compare_rows)\n\n```\n\nAfter running the linker the process is complete\n\n [ix] | name | country | uid\n------|------|---------|----\n 0 | Pete | Spain | 7509781940fc471cad5dc32944652d70\n 1 | Mary | USA | 8f8dccd91568472daf740e9160349d6c\n 2 | Bart | US | 12b55fbe80f64d378193acd727b0e051\n 3 | Mary | US | 8f8dccd91568472daf740e9160349d6c\n\nNote that both \"Mary\" rows in the DataFrame have been identified as representing\nthe same entity and were assigned the same UUID.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/stefanw/pandas-linker", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pandas-linker", "package_url": "https://pypi.org/project/pandas-linker/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/pandas-linker/", "project_urls": { "Homepage": "https://github.com/stefanw/pandas-linker" }, "release_url": "https://pypi.org/project/pandas-linker/0.0.2/", "requires_dist": [ "pandas", "pyprind" ], "requires_python": "", "summary": "Linking rows of pandas dataframes", "version": "0.0.2" }, "last_serial": 2481133, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "3a11fb2e6aac1c42eddd4097528f414a", "sha256": "bf1f7026e1818ee4781e0498f610b79d8907873758026c3be0b8e527d9396d49" }, "downloads": -1, "filename": "pandas_linker-0.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "3a11fb2e6aac1c42eddd4097528f414a", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 5748, "upload_time": "2016-10-06T17:28:26", "url": "https://files.pythonhosted.org/packages/10/e2/9f7a502da4b5e5031a456ce84f90229ab6ad20922a61b182533913f17a7a/pandas_linker-0.0.1-py2.py3-none-any.whl" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "30036c489287bc3fbcdcb83d05128e47", "sha256": "a22e81adc57d72bcd8fb6dc98b6c5e37d2c830ca2d3c963504dfcd2bb76949d2" }, "downloads": -1, "filename": "pandas_linker-0.0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "30036c489287bc3fbcdcb83d05128e47", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6017, "upload_time": "2016-11-24T15:55:36", "url": "https://files.pythonhosted.org/packages/7c/17/946988f2d48bf0aa52440779c90dc5f0d257b256c188e796f8d5e1f8b3a1/pandas_linker-0.0.2-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "30036c489287bc3fbcdcb83d05128e47", "sha256": "a22e81adc57d72bcd8fb6dc98b6c5e37d2c830ca2d3c963504dfcd2bb76949d2" }, "downloads": -1, "filename": "pandas_linker-0.0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "30036c489287bc3fbcdcb83d05128e47", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6017, "upload_time": "2016-11-24T15:55:36", "url": "https://files.pythonhosted.org/packages/7c/17/946988f2d48bf0aa52440779c90dc5f0d257b256c188e796f8d5e1f8b3a1/pandas_linker-0.0.2-py2.py3-none-any.whl" } ] }