{
"info": {
"author": "Aaron Schumacher",
"author_email": "ajschumacher@gmail.com",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 2 - Pre-Alpha",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 2.7"
],
"description": ".. image:: https://travis-ci.org/ajschumacher/mergic.svg\n :target: https://travis-ci.org/ajschumacher/mergic\n :alt: Build Status\n\n.. image:: https://coveralls.io/repos/ajschumacher/mergic/badge.svg\n :target: https://coveralls.io/r/ajschumacher/mergic\n :alt: Test Coverage\n\n.. image:: https://readthedocs.org/projects/mergic/badge/?version=latest\n :target: https://readthedocs.org/projects/mergic/?badge=latest\n :alt: Documentation Status\n\n======\nmergic\n======\n-----------------------------------------------------------\nworkflow support for reproducible deduplication and merging\n-----------------------------------------------------------\n\nSay you have a bunch of strings, and some of them are different but refer to the same thing. Maybe it's just some long list, maybe it's the contents of two key columns from data sets that you'd like to merge.\n\n::\n\n David Copperfield\n Lance Burton\n Dave Copperfield\n Levar Burton\n\nHere's what you can do with ``mergic``:\n\nGive ``mergic`` all your identifiers, one per line. If they are in a file called ``originals.txt``:\n\n.. code:: bash\n\n mergic calc originals.txt\n\nYou will see output about the possible groupings ``mergic`` can produce based on its default distance function. (It's easy to use a custom distance function\u2014see below.)\n\n::\n\n num groups, max group, num pairs, cutoff\n ----------------------------------------\n 4, 1, 0, -0.909090909091\n 3, 2, 1, 0.0909090909091\n 2, 2, 2, 0.25\n 1, 4, 6, 0.714285714286\n\nThere are 4 groupings that ``mergic`` can make with the given distance function, ranging from a group for every name to a single group for all four names.\n\nThe ``num groups`` is the number of groups that ``mergic`` can make with the given ``cutoff``.\n\nThe ``max group`` is the number of items in the largest group for that grouping.\n\nThe ``num pairs`` column indicates the number of within-group links that correspond to the grouping. In some sense this represents how much work it is to check all the linkages that ``mergic`` is making. Thinking in groups is much better than thinking in individual pairwise comparisons, but the metric is useful for summarizing how much linking is happening.\n\nThe ``cutoff`` determines which pairs of items are put in the same group. If the distance between two items is equal to or less than the ``cutoff``, those items will be grouped together.\n\nSelect a cutoff to produce the grouping you would like to see. If you would like to use the cutoff 0.3 and put the results in a file called ``grouping.json``:\n\n.. code:: bash\n\n mergic make originals.txt 0.3 > grouping.json\n\nNow ``grouping.json`` contains a grouping of your original data, specified in `JSON `__. It's designed to be human-readable, and human-editable, so you can check it and improve it as needed. The largest groups will be at the top, and similar items will be near one another.\n\n::\n\n {\n \"Lance Burton\": [\n \"Lance Burton\",\n \"Levar Burton\"\n ],\n \"David Copperfield\": [\n \"Dave Copperfield\",\n \"David Copperfield\"\n ]\n }\n\nThere are two proposed groups, with proposed names \"Lance Burton\" and \"David Copperfield\". You probably want to copy the produced file and edit the copy.\n\n.. code:: bash\n\n cp grouping.json grouping_fixed.json\n # edit `grouping_fixed.json`\n\nYou can edit the proposed group names\u2014the keys of the object\u2014if you like. The original values from the data are in the array values of the object, and you can move them around to correct the grouping. In this case you could also re-run ``mergic`` with a cutoff of 0.1 to get a correct grouping:\n\n::\n\n {\n \"David Copperfield\": [\n \"Dave Copperfield\",\n \"David Copperfield\"\n ],\n \"Lance Burton\": [\n \"Lance Burton\"\n ],\n \"Levar Burton\": [\n \"Levar Burton\"\n ]\n }\n\nNow that ``grouping_fixed.json`` is as perfect as it can be, you can move forward.\n\nYou can also compare your two JSON grouping files and see what you changed:\n\n.. code:: bash\n\n mergic diff grouping.json grouping_fixed.json > diff.json\n\nNow the file ``diff.json`` contains just what's different between ``grouping.json`` and ``grouping_fixed.json``. The ``mergic diff`` command is analogous to regular ``diff`` for text files, but it is aware of the JSON partition format so it can capture changes intelligently.\n\nYou can apply changes to a ``mergic``-produced file to recover your edited version.\n\n.. code:: bash\n\n mergic apply grouping.json diff.json > grouping_new.json\n\nNow ``grouping_new.json`` is equivalent to ``grouping_fixed.json``, as you can verify:\n\n.. code:: bash\n\n mergic diff grouping_fixed.json grouping_new.json\n # {} // (no changes)\n\nIn this way you have a complete and verifiable record of your work, at the level of whole files and also at the level of changes made by hand.\n\nThe JSON grouping format is very convenient for humans, but for tabular data a merge table is more useful. A merge table has one column with the original values from your data and one column with the new keys. These are named ``original`` and ``mergic`` in the output:\n\n.. code:: bash\n\n mergic table grouping_fixed.json > merge_table.csv\n\nThe file ``merge_table.csv`` looks like this:\n\n::\n\n original,mergic\n Lance Burton,Lance Burton\n Levar Burton,Levar Burton\n Dave Copperfield,David Copperfield\n David Copperfield,David Copperfield\n\nThis merge table can now be used with any tabular data system. For merges, first merge it on to both tables and then merge by the ``mergic`` key. For deduplication, merge it on to the table(s) of interest and then use the ``mergic`` column as you would have used the original data.\n\n\nInstallation\n============\n\n.. code:: bash\n\n pip install mergic\n\n\nUsing a Custom Distance Function\n================================\n\nThe ``mergic`` package provides a command-line script called ``mergic`` that uses Python's built-in ``difflib.SequenceMatcher.ratio()`` for calculating string distances, but a major strength of ``mergic`` is that it enables easy customization of the distance function via the ``mergic.Blender`` class. Making a custom ``mergic`` script is as easy as:\n\n.. code:: python\n\n #!/usr/bin/env python\n # custom_mergic.py\n import mergic\n\n # Any custom distance you want to try! e.g.,\n def my_distance(a, b):\n return abs(len(a) - len(b))\n\n mergic.Blender(my_distance).script()\n\nNow ``custom_mergic.py`` can be used just like the standard ``mergic`` script!\n\nThere is also a function that determines how to make a key for a group. It takes a list and returns a string. By default ``mergic.Blender`` will use the longest of a group's values (the first longest, in stable sorted order). You can also supply a custom function for generating keys.\n\n\nDistances You Might Like\n------------------------\n\n`Levenshtein string edit distance `__: The classic! It has many implementations; one of them is `python-Levenshtein `__.\n\n.. code:: python\n\n # pip install python-Levenshtein\n import Levenshtein\n Levenshtein.distance(\"fuzzy\", \"wuzzy\")\n # 1\n\nSeatGeek's `fuzzywuzzy `__: As described in their `blog post `__, some people have found these variants to work well in practice. Responses from ``fuzzywuzzy`` are phrased as integer percent similarities; one way to make a distance is to subtract from 100.\n\n.. code:: python\n\n # pip install fuzzywuzzy\n from fuzzywuzzy import fuzz\n 100 - fuzz.ratio(\"Levensthein\", \"Leviathan\")\n # 50\n\nThere are a ton of distances, even just within the two packages mentioned!\n\nYou can also make your own!",
"description_content_type": null,
"docs_url": null,
"download_url": "https://github.com/ajschumacher/mergic/tarball/0.0.7",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/ajschumacher/mergic",
"keywords": null,
"license": "MIT",
"maintainer": null,
"maintainer_email": null,
"name": "mergic",
"package_url": "https://pypi.org/project/mergic/",
"platform": "UNKNOWN",
"project_url": "https://pypi.org/project/mergic/",
"project_urls": {
"Download": "https://github.com/ajschumacher/mergic/tarball/0.0.7",
"Homepage": "https://github.com/ajschumacher/mergic"
},
"release_url": "https://pypi.org/project/mergic/0.0.7/",
"requires_dist": null,
"requires_python": null,
"summary": "workflow support for reproducible deduplication and merging",
"version": "0.0.7"
},
"last_serial": 1619766,
"releases": {
"0.0.1": [
{
"comment_text": "",
"digests": {
"md5": "a9f79676039c156bf6a678038f170214",
"sha256": "f1fe4bb4a3e8e33ebd463b54afcd69ba873e7739e124e95dcaf9e873d0ee307c"
},
"downloads": -1,
"filename": "mergic-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "a9f79676039c156bf6a678038f170214",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4811,
"upload_time": "2015-05-11T20:41:34",
"url": "https://files.pythonhosted.org/packages/54/c1/bac7f102c5df144a7edfc1aaf5978d4fce0cec445e2a860ec681efa33fc1/mergic-0.0.1.tar.gz"
}
],
"0.0.2": [
{
"comment_text": "",
"digests": {
"md5": "afc83085551b5ce7d1bea89314b948e3",
"sha256": "083b8b1dcad5d3e3884daf5dcfcd3fa53c7016d00d5a56fd03011ef73214185f"
},
"downloads": -1,
"filename": "mergic-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "afc83085551b5ce7d1bea89314b948e3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5203,
"upload_time": "2015-05-11T21:40:37",
"url": "https://files.pythonhosted.org/packages/cb/27/5ecacc251168e1160ca8df38ed0c9b7bdab6f1a968d08933853d14eded37/mergic-0.0.2.tar.gz"
}
],
"0.0.3": [
{
"comment_text": "",
"digests": {
"md5": "9ba3633be5716e680b578523a1db83fd",
"sha256": "af3575298e78f33b8df44cac25701f5adca493fb247d32be4fa2f215a9b0ff48"
},
"downloads": -1,
"filename": "mergic-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "9ba3633be5716e680b578523a1db83fd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6485,
"upload_time": "2015-05-13T01:28:20",
"url": "https://files.pythonhosted.org/packages/97/16/e880c4f3522e75ba8389ca3678ff1309563ad6fb6cae67204eb28d3fd80a/mergic-0.0.3.tar.gz"
}
],
"0.0.4": [
{
"comment_text": "",
"digests": {
"md5": "c538a4e66f6196f7c925e72a536f7763",
"sha256": "68838e789652c666fc760e4a88116e5c70125fd8202c59b29f41f1fe7295462e"
},
"downloads": -1,
"filename": "mergic-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "c538a4e66f6196f7c925e72a536f7763",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6813,
"upload_time": "2015-05-14T03:43:13",
"url": "https://files.pythonhosted.org/packages/5e/c9/4b9431963902b85464131c2c0059a900b2893ea2bb966eeb7fe3117085a3/mergic-0.0.4.tar.gz"
}
],
"0.0.4.1": [
{
"comment_text": "",
"digests": {
"md5": "c03e724c28739d5a8df20af2d78a6be1",
"sha256": "33725836a527ada6e9373ab521d05a59522ac451f229bed2dd4170f2a2235cdd"
},
"downloads": -1,
"filename": "mergic-0.0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "c03e724c28739d5a8df20af2d78a6be1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6723,
"upload_time": "2015-05-16T19:49:26",
"url": "https://files.pythonhosted.org/packages/5c/f2/fa1468c3536dc017aa42f83bf21c336509af30b6372006e01781fad0d6b1/mergic-0.0.4.1.tar.gz"
}
],
"0.0.5": [
{
"comment_text": "",
"digests": {
"md5": "d3713107b43918142a40fde59f1f3aac",
"sha256": "6d00d8d5993cf83f445362826425dd890bdc36039f01be3a23b1bce299a6fd11"
},
"downloads": -1,
"filename": "mergic-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "d3713107b43918142a40fde59f1f3aac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7008,
"upload_time": "2015-05-25T20:52:44",
"url": "https://files.pythonhosted.org/packages/65/6c/712d245cee7693e9b33c17341886975fdbbd7924ae30867ca69b10ec0a86/mergic-0.0.5.tar.gz"
}
],
"0.0.6": [
{
"comment_text": "",
"digests": {
"md5": "04fcc63b257215537fb8124c4569c498",
"sha256": "b454d480672918eb3e4f02317e143bda79c0d0e03eb3ee418262446f49589a5e"
},
"downloads": -1,
"filename": "mergic-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "04fcc63b257215537fb8124c4569c498",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7000,
"upload_time": "2015-05-25T21:02:46",
"url": "https://files.pythonhosted.org/packages/9a/4c/8b5c3e9f1d27d6a1cc10797e3bdb896db04e2ccb73930f483e3b44842ed4/mergic-0.0.6.tar.gz"
}
],
"0.0.7": [
{
"comment_text": "",
"digests": {
"md5": "437ec3c1828ef286de0b8feef8a795d5",
"sha256": "7862c7de60c16f14bb8f1e1a76104d39a5a744c196ba7430924f1dc48d96cae0"
},
"downloads": -1,
"filename": "mergic-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "437ec3c1828ef286de0b8feef8a795d5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8735,
"upload_time": "2015-07-05T02:15:05",
"url": "https://files.pythonhosted.org/packages/5c/81/12321181b824f2e260158967d9673ffd2a3d2e3369d59b5762199f69b023/mergic-0.0.7.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "437ec3c1828ef286de0b8feef8a795d5",
"sha256": "7862c7de60c16f14bb8f1e1a76104d39a5a744c196ba7430924f1dc48d96cae0"
},
"downloads": -1,
"filename": "mergic-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "437ec3c1828ef286de0b8feef8a795d5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8735,
"upload_time": "2015-07-05T02:15:05",
"url": "https://files.pythonhosted.org/packages/5c/81/12321181b824f2e260158967d9673ffd2a3d2e3369d59b5762199f69b023/mergic-0.0.7.tar.gz"
}
]
}