{ "info": { "author": "Luca Verginer", "author_email": "luca@verginer.eu", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "=======\ndisamby\n=======\n\n\n.. image:: https://img.shields.io/pypi/v/disamby.svg\n :target: https://pypi.python.org/pypi/disamby\n\n.. image:: https://img.shields.io/travis/verginer/disamby.svg\n :target: https://travis-ci.org/verginer/disamby\n\n.. image:: https://readthedocs.org/projects/disamby/badge/?version=latest\n :target: https://disamby.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://pyup.io/repos/github/verginer/disamby/shield.svg\n :target: https://pyup.io/repos/github/verginer/disamby/\n :alt: Updates\n\n* Free software: MIT license\n* Documentation: https://disamby.readthedocs.io.\n\n``disamby`` is a python package designed to carry out entity disambiguation based on fuzzy\nstring matching.\n\nIt works best for entities which if the same have very similar strings.\nExamples of situation where this disambiguation algorithm works fairly well is with\ncompany names and addresses which have typos, alternative spellings or composite names.\nOther use-cases include identifying people in a database where the name might be misspelled.\n\nThe algorithm works by exploiting how informative a given word/token is, based on the\nobserved frequencies in the whole corpus of strings. For example the word 'inc' in the\ncase of firm names is not very informative, however \"Solomon\" is, since the former appears\nrepeatedly whereas the second rarely.\n\nWith these frequencies the algorithms computes for a given pair of instances how similar\nthey are, and if they are above an arbitrary threshold they are connected in an\n\"alias graph\" (i.e. a directed network where an entity is connected to an other\nif it is similar enough). After all records have been connected in this way disamby\nreturns sets of entities, which are strongly connected [2]_ . Strongly connected means\nin this case that there exists a path from all nodes to all nodes within the component.\n\n\nExample\n-------\n\nTo use disamby in a project::\n\n import pandas as pd\n import disamby.preprocessors as pre\n form disamby import Disamby\n\n # create a dataframe with the fields you intend to match on as columns\n df = pd.DataFrame({\n 'name': ['Luca Georger', 'Luca Geroger', 'Adrian Sulzer'],\n 'address': ['Mira, 34, Augsburg', 'Miri, 34, Augsburg', 'Milano, 34']},\n index= ['L1', 'L2', 'O1']\n )\n\n # define the pipeline to process the strings, note that the last step must return\n # a tuple of strings\n pipeline = [\n pre.normalize_whitespace,\n pre.remove_punctuation,\n lambda x: pre.trigram(x) + pre.split_words(x) # any python function is allowed\n ]\n\n # instantiate the disamby object, it applies the given pre-processing pipeline and\n # computes their frequency.\n dis = Disamby(df, pipeline)\n\n # let disamby compute disambiguated sets. Node that a threshold must be given or it\n # defaults to 0.\n dis.disambiguated_sets(threshold=0.5)\n [{'L2', 'L1'}, {'O1'}] # output\n\n # To check if the sets are accurate you can get the rows from the\n # pandas dataframe like so:\n df.loc[['L2', 'L1']]\n\n\n\nInstallation\n------------\n\nTo install disamby, run this command in your terminal:\n\n.. code-block:: console\n\n $ pip install disamby\n\nThis is the preferred method to install disamby, as it will always install the most recent stable release.\nIf you don't have `pip`_ installed, this `Python installation guide`_ can guide\nyou through the process.\n\n.. _pip: https://pip.pypa.io\n.. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/\n\nYou can also install it from source as follows\nThe sources for disamby can be downloaded from the `Github repo`_.\nYou can either clone the public repository:\n\n.. code-block:: console\n\n $ git clone git://github.com/verginer/disamby\n\nOr download the `tarball`_:\n\n.. code-block:: console\n\n $ curl -OL https://github.com/verginer/disamby/tarball/master\n\nOnce you have a copy of the source, you can install it with:\n\n.. code-block:: console\n\n $ python setup.py install\n\n\n.. _Github repo: https://github.com/verginer/disamby\n.. _tarball: https://github.com/verginer/disamby/tarball/master\n\n\nCredits\n---------\nI got the inspiration for this package from the seminar \"The SearchEngine - A Tool for\nMatching by Fuzzy Criteria\" by Thorsten Doherr at the CISS [1]_ Summer School 2017\n\n.. [1] http://www.euro-ciss.eu/ciss/home.html\n.. [2] https://en.wikipedia.org/wiki/Strongly_connected_component\n\n\n=======\nHistory\n=======\n\n\n0.2.3 (2017-07-01)\n------------------\n\n* Fixes formatting breaking pypi display\n\n\n0.2.2 (2017-06-30)\n------------------\n\n* working release with minimal documentation\n* works with multiple field matching\n* carries out all steps autonomously from string pre-processing to\n identifying the strongly connected components\n\n\n0.1.0 (2017-06-24)\n------------------\n\n* First release on PyPI.\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/verginer/disamby", "keywords": "disamby", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "disamby", "package_url": "https://pypi.org/project/disamby/", "platform": "", "project_url": "https://pypi.org/project/disamby/", "project_urls": { "Homepage": "https://github.com/verginer/disamby" }, "release_url": "https://pypi.org/project/disamby/0.2.4/", "requires_dist": null, "requires_python": "", "summary": "Python package to carry out entity disambiguation based on string matching", "version": "0.2.4" }, "last_serial": 2992488, "releases": { "0.2.1": [ { "comment_text": "", "digests": { "md5": "64206ed15c1117994cbfdf4243bf0a9f", "sha256": "6025d07f17d14d08d2b3c503befaee346fbc8c3214df14cd09980e73f1f9fe5a" }, "downloads": -1, "filename": "disamby-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "64206ed15c1117994cbfdf4243bf0a9f", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 9219, "upload_time": "2017-06-30T08:15:51", "url": "https://files.pythonhosted.org/packages/5e/c2/0a2db0f5ab86a7376ac645f2ab84330058c73ec4e4da3442c262d80c9f36/disamby-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "83b49a3efb471891e923d4ed172de3f8", "sha256": "d0192d5617a899591ea90da493466f18fa99f386178de8cc16eadb141de2cf18" }, "downloads": -1, "filename": "disamby-0.2.1.tar.gz", "has_sig": false, "md5_digest": "83b49a3efb471891e923d4ed172de3f8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 636142, "upload_time": "2017-06-30T08:15:41", "url": "https://files.pythonhosted.org/packages/12/90/735ae71ddf2d6d8fc3498bccec719549294ca470d0b57cac3e483a2e2829/disamby-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "ef5f4549a276c754ff45a7d6bd6314d8", "sha256": "c6fe75ad1fb59c8a83019449bcf576296ce7d90b6cc5e267bba16f6a33c2ed8f" }, "downloads": -1, "filename": "disamby-0.2.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ef5f4549a276c754ff45a7d6bd6314d8", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 12587, "upload_time": "2017-06-30T21:19:24", "url": "https://files.pythonhosted.org/packages/68/68/ed1ec55acb8961493c5b5ef5ecb1f9b299f686c25d8b026c4db027afe424/disamby-0.2.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "8dfa342c1cca1ca3b0d94f4b4da3e0c3", "sha256": "d55cea979e11e071779bb3da72395d20eb33c3b622d172085c144d5fc0b2ce76" }, "downloads": -1, "filename": "disamby-0.2.2.tar.gz", "has_sig": false, "md5_digest": "8dfa342c1cca1ca3b0d94f4b4da3e0c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 638821, "upload_time": "2017-06-30T21:19:20", "url": "https://files.pythonhosted.org/packages/b6/fc/c74a7dfe9f2d03ed3064bc007a49a95f4c13eaacb03909ff1c636d090d82/disamby-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "c3ee9bbacb673a5ceaffa3b85b36fb37", "sha256": "52844689c88bd817289d16b1d0d3ce724bd7834f01914c5d22b00225b290ac94" }, "downloads": -1, "filename": "disamby-0.2.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c3ee9bbacb673a5ceaffa3b85b36fb37", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 12547, "upload_time": "2017-07-01T08:14:21", "url": "https://files.pythonhosted.org/packages/17/89/1d857f28977c7f038211d7cb84de05b5f1d2212955d4d30ac8d3c7c8bc65/disamby-0.2.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5e8cf9ba35a77b769eca1168f1fce0a0", "sha256": "02649b72011b14713dcf8a394e5effe741901c66b189775d3dc393cbbd64116f" }, "downloads": -1, "filename": "disamby-0.2.3.tar.gz", "has_sig": false, "md5_digest": "5e8cf9ba35a77b769eca1168f1fce0a0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 638784, "upload_time": "2017-07-01T08:14:18", "url": "https://files.pythonhosted.org/packages/8f/5f/08f0bb63c931105c0488eb91ad1aa19cc15de6f038f15c5ae244e4d61e1b/disamby-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "64ee029df741c169afc2cb3d9b0f44b1", "sha256": "8a96672d09e2253b9c2587e66cb07ec347d4771e64cae3aecf5844059b3d45b8" }, "downloads": -1, "filename": "disamby-0.2.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "64ee029df741c169afc2cb3d9b0f44b1", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 13305, "upload_time": "2017-07-01T14:15:50", "url": "https://files.pythonhosted.org/packages/ef/ae/b5f4a7149c8a246341a6cb67f305215bc312d123e81e4f36b770238c4ef5/disamby-0.2.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9a5f75191b81cece775d66c89f2e9353", "sha256": "301d502cf4f6ba909df06567059ef83e3e9bc9a23e17aa56c001f7e95ed1c461" }, "downloads": -1, "filename": "disamby-0.2.4.tar.gz", "has_sig": false, "md5_digest": "9a5f75191b81cece775d66c89f2e9353", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 639197, "upload_time": "2017-07-01T14:15:47", "url": "https://files.pythonhosted.org/packages/cb/e0/68ff4bc46425df21fdeafa9734efd00859d51a977b4e3f5ff05f562c482c/disamby-0.2.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "64ee029df741c169afc2cb3d9b0f44b1", "sha256": "8a96672d09e2253b9c2587e66cb07ec347d4771e64cae3aecf5844059b3d45b8" }, "downloads": -1, "filename": "disamby-0.2.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "64ee029df741c169afc2cb3d9b0f44b1", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 13305, "upload_time": "2017-07-01T14:15:50", "url": "https://files.pythonhosted.org/packages/ef/ae/b5f4a7149c8a246341a6cb67f305215bc312d123e81e4f36b770238c4ef5/disamby-0.2.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9a5f75191b81cece775d66c89f2e9353", "sha256": "301d502cf4f6ba909df06567059ef83e3e9bc9a23e17aa56c001f7e95ed1c461" }, "downloads": -1, "filename": "disamby-0.2.4.tar.gz", "has_sig": false, "md5_digest": "9a5f75191b81cece775d66c89f2e9353", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 639197, "upload_time": "2017-07-01T14:15:47", "url": "https://files.pythonhosted.org/packages/cb/e0/68ff4bc46425df21fdeafa9734efd00859d51a977b4e3f5ff05f562c482c/disamby-0.2.4.tar.gz" } ] }