{ "info": { "author": "Edgar Marca", "author_email": "matiskay@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 3.6", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "===============\nHTML Similarity\n===============\n\n.. image:: https://travis-ci.org/matiskay/html-similarity.svg?branch=master\n :target: https://travis-ci.org/matiskay/html-similarity\n\nThis package provides a set of functions to measure the similarity between web pages.\n\nInstall\n=======\n\nThe quick way::\n\n pip install html-similarity\n\nHow it works?\n=============\n\nStructural Similarity\n---------------------\n\nWe use sequence comparison fo the html tags to compute the structural similarity instead of\ntree edit distance because tree edit distance is slower than sequence comparison.\n\nThe idea of sequence comparison was taken from `Page Compare `_.\n\n\nStyle Similarity\n----------------\n\nExtracts css classes of each html document and calculates the jaccard similarity of the sets of classes.\nThe idea was taken from [1]_\n\n\nJoint Similarity (Structural Similarity and Style Similarity)\n-------------------------------------------------------------\n\nThe joint similarity metric is calculated as::\n\n k * structural_similarity(document_1, document_2) + (1 - k) * style_similarity(document_1, document_2)\n\nThis was taken from [1]_\n\nThe value is in the interval of 0 and 1.\n\nRecommendations for joint similarity\n------------------------------------\n\nUsing `k=0.3` give use better results. The style similarity can gives more information\nabout the similarity rather than the style.\n\n\nReferences\n==========\n\n.. [1] `T. Gowda and C. A. Mattmann, Clustering Web Pages Based on Structure and Style Similarity (Application Paper), 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), Pittsburgh, PA, 2016, pp. 175-180. `_\n\nDevelopment\n===========\n\nSee `CONTRIBUTING.md` file\n\n\nTODO\n====\n\n* [ ] Add information about the package in pypi\n* [ ] Add documentation\n* [ ] Add examples\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/matiskay/html-similarity", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "html-similarity", "package_url": "https://pypi.org/project/html-similarity/", "platform": "", "project_url": "https://pypi.org/project/html-similarity/", "project_urls": { "Homepage": "https://github.com/matiskay/html-similarity" }, "release_url": "https://pypi.org/project/html-similarity/0.3.2/", "requires_dist": [ "parsel (==1.2.0)" ], "requires_python": "", "summary": "A set of similarity metricts to compare html files.", "version": "0.3.2" }, "last_serial": 3282411, "releases": { "0.3.1": [ { "comment_text": "", "digests": { "md5": "cf8d0ddf665e60197ff7912f29b88ded", "sha256": "1587c52293292e48cae5eb5ffb5f02b8f95b446344f22d2ba371f832999ff0c2" }, "downloads": -1, "filename": "html_similarity-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "cf8d0ddf665e60197ff7912f29b88ded", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5689, "upload_time": "2017-10-26T23:50:15", "url": "https://files.pythonhosted.org/packages/1f/66/9a411c32ed3408e3e3bc9f43ad561c2f10b2eeab83700cd2c63ad5a70436/html_similarity-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4bdb96ae07c047b36e32395ab4d46426", "sha256": "4a49c9a22b82966e6c248602ebb8922f2411e0dd3d480f28bdd0a112ee67203c" }, "downloads": -1, "filename": "html-similarity-0.3.1.tar.gz", "has_sig": false, "md5_digest": "4bdb96ae07c047b36e32395ab4d46426", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3092, "upload_time": "2017-10-26T23:50:16", "url": "https://files.pythonhosted.org/packages/16/56/e7816c5577100d5c37495e8d8d276cb95370b3bf447ba1d71c031a85da3c/html-similarity-0.3.1.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "19ffc0da36c017cefd92210da00fadd4", "sha256": "f801a3dd9901c40b1824a57dc394dfcd846faf60e47dba9664b158a4cadebdaa" }, "downloads": -1, "filename": "html_similarity-0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "19ffc0da36c017cefd92210da00fadd4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5635, "upload_time": "2017-10-27T00:05:20", "url": "https://files.pythonhosted.org/packages/9f/86/2b6384bf252c241fbde594b6598c97e8e037ac7b9c83c139b358df3c6cd2/html_similarity-0.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a9a832b91d55ca09e177ec58e09cc0b4", "sha256": "d0e3353e5050050660a6c88898cf7215c69aa8d5041065cc122c8fb3af7da202" }, "downloads": -1, "filename": "html-similarity-0.3.2.tar.gz", "has_sig": false, "md5_digest": "a9a832b91d55ca09e177ec58e09cc0b4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3079, "upload_time": "2017-10-27T00:05:22", "url": "https://files.pythonhosted.org/packages/01/df/24b150d3fa3bd6bdb08badff1f1cb6e61c944a2f0a2fbb845e42521e38fd/html-similarity-0.3.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "19ffc0da36c017cefd92210da00fadd4", "sha256": "f801a3dd9901c40b1824a57dc394dfcd846faf60e47dba9664b158a4cadebdaa" }, "downloads": -1, "filename": "html_similarity-0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "19ffc0da36c017cefd92210da00fadd4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5635, "upload_time": "2017-10-27T00:05:20", "url": "https://files.pythonhosted.org/packages/9f/86/2b6384bf252c241fbde594b6598c97e8e037ac7b9c83c139b358df3c6cd2/html_similarity-0.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a9a832b91d55ca09e177ec58e09cc0b4", "sha256": "d0e3353e5050050660a6c88898cf7215c69aa8d5041065cc122c8fb3af7da202" }, "downloads": -1, "filename": "html-similarity-0.3.2.tar.gz", "has_sig": false, "md5_digest": "a9a832b91d55ca09e177ec58e09cc0b4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3079, "upload_time": "2017-10-27T00:05:22", "url": "https://files.pythonhosted.org/packages/01/df/24b150d3fa3bd6bdb08badff1f1cb6e61c944a2f0a2fbb845e42521e38fd/html-similarity-0.3.2.tar.gz" } ] }