{ "info": { "author": "Dimitris Giannitsaros", "author_email": "daremon@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Other Environment", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Topic :: Internet", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "urlclustering\n=============\n\nThis package facilitates the clustering of similar URLs of a website.\n\n**Live demo**: http://urlclustering.com\n\nGeneral information\n~~~~~~~~~~~~~~~~~~~\n\nYou give a (preferably long and complete) list of URLs as input e.g.:\n\n::\n\n urls = [\n 'http://example.com',\n 'http://example.com/about',\n 'http://example.com/contact',\n\n 'http://example.com/cat/sports',\n 'http://example.com/cat/tech',\n 'http://example.com/cat/life',\n 'http://example.com/cat/politics',\n\n 'http://example.com/tag/623/tag1',\n 'http://example.com/tag/335/tag2',\n 'http://example.com/tag/671/tag3',\n\n 'http://example.com/article/?id=1',\n 'http://example.com/article/?id=2',\n 'http://example.com/article/?id=3',\n ]\n\nYou get a list of clusters as a result. For each cluster you get:\n\n- a REGEX that matches all cluster URLs\n- a HUMAN readable string representing the cluster\n- a list with all matched cluster URLs\n\nSo for our example the result is:\n\n::\n\n REGEX: http://example.com/cat/([^/]+)\n HUMAN: http://example.com/cat/[...]\n URLS:\n http://example.com/cat/sports\n http://example.com/cat/tech\n http://example.com/cat/life\n http://example.com/cat/politics\n\n REGEX: http://example.com/tag/(\\d+)/([^/]+)\n HUMAN: http://example.com/tag/[NUMBER]/[...]\n URLS:\n http://example.com/tag/623/tag1\n http://example.com/tag/335/tag2\n http://example.com/tag/671/tag3\n\n REGEX: http://example.com/article/?\\?id=(\\d+)\n HUMAN: http://example.com/article?id=[NUMBER]\n URLS:\n http://example.com/article/?id=1\n http://example.com/article/?id=2\n http://example.com/article/?id=3\n\n UNCLUSTERED URLS:\n http://example.com\n http://example.com/about\n http://example.com/contact\n\nWhen to use\n~~~~~~~~~~~\n\nThis is most useful for website analysis tools that report findings to\nthe user. E.g. a service that crawls your website and reports page\nloading time may find that 10,000 pages take >2 seconds to load. Instead\nof listing 10,000 URLs it's better to cluster them. So the end user will\nsee something like:\n\n::\n\n Slow pages (>2 secs):\n - http://example.com/ (1 URL)\n - http://example.com/sitemap (1 URL)\n - http://example.com/search?q=[...] (578 URLs)\n - http://example.com/tags?tag1=[...]&tag2=[...] (409 URLs)\n - http://example.com/article?id=[NUMBER] (7209 URLs)\n\nHow it works:\n~~~~~~~~~~~~~\n\nURLs are grouped by domain. Only same domain URLs are clustered.\n\nURLs are then grouped by a signature which is the number of path\nelements and the number of QueryString parameters & values the URL has.\n\nExamples:\n\n- http://ex.com/about has a signature of (1, 0)\n- http://ex.com/article?123 has a signature of (1, 1)\n- http://ex.com/path/to/file?par1=val1&par2=val2 has a signature of (3,4)\n\nURLs with the same signature are inserted in a tree structure. For each\npart (path element or QS parameter or QS value) two nodes are created:\n\n- One with the verbatim part.\n- One with the reduced part i.e. a regex that could replace the part.\n\nLeaf nodes hold the number of URLs that match and the number of\nreductions.\n\nE.g. inserting URL ``http://ex.com/article?123`` will create 2 top\nnodes:\n\n::\n\n root 1: `article`\n root 2: `[^/]+`\n\nAnd each top node will have two children:\n\n::\n\n child 1: `123`\n child 2: `\\d+`\n\nInserting 3 URLs of the form ``/article/[0-9]+`` would lead to a tree\nlike this:\n\n::\n\n `article` `[^/]+`\n / / \\ \\ / / \\ \\\n `123` `456` `789` `\\d+` `123` `456` `789` `\\d+`\n 1 URL 1 URL 1 URL 3 URLs 1 URL 1 URL 1 URL 3 URLs\n 0 re 0 re 0 re 1 re 1 re 1 re 1 re 2 re\n\nThe final step is to choose the best leafs. In this case ``article`` ->\n``\\d+`` is best because it macthes all 3 URLs with 1 reduction so the\ncluster returned is http://ex.com/article/[NUMBER]\n\nLicense\n~~~~~~~\n\nCopyright (c) 2015 Dimitris Giannitsaros.\n\nLicensed under the MIT License.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/daremon/urlclustering", "keywords": "cluster clustering urls", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "urlclustering", "package_url": "https://pypi.org/project/urlclustering/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/urlclustering/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/daremon/urlclustering" }, "release_url": "https://pypi.org/project/urlclustering/0.4.1/", "requires_dist": null, "requires_python": null, "summary": "Facilitate clustering of similar URLs of a website", "version": "0.4.1" }, "last_serial": 1776055, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "16f8061aa3bf48071c8afdf5d68b33ae", "sha256": "81d909405addf9feeb31d18660d7011b6c6374eae8e18b6d54a93bb4692e1143" }, "downloads": -1, "filename": "urlclustering-0.1.tar.gz", "has_sig": false, "md5_digest": "16f8061aa3bf48071c8afdf5d68b33ae", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5095, "upload_time": "2015-10-06T09:23:35", "url": "https://files.pythonhosted.org/packages/c9/52/cfa8c6bbeefb2b0d57805c89a7c2ecc98a358b7382b2e6ede63802865f3e/urlclustering-0.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "09fdb5d13676f5a85bf097f29481b49a", "sha256": "7aab044d97e7fd406166b08c482267d2afdd3fa15fd35e08b1ca4a695c37d001" }, "downloads": -1, "filename": "urlclustering-0.2.tar.gz", "has_sig": false, "md5_digest": "09fdb5d13676f5a85bf097f29481b49a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6088, "upload_time": "2015-10-06T10:04:17", "url": "https://files.pythonhosted.org/packages/0b/52/9dca17d91e64b5f24ac9052c5a632467fccf3425eb90e6b969d9a791b71f/urlclustering-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "205a0260b840f58457eb40727acf8cb4", "sha256": "477e30b49ed6da0b00bcc1b8b40e3517dca575cfc5b76bb5e7be53727f261ff0" }, "downloads": -1, "filename": "urlclustering-0.3.tar.gz", "has_sig": false, "md5_digest": "205a0260b840f58457eb40727acf8cb4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6523, "upload_time": "2015-10-06T10:26:19", "url": "https://files.pythonhosted.org/packages/65/c9/b5192b3599b27aa8e2ea7b720823acc305eb99a4c5dbca236283a56889da/urlclustering-0.3.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "7df4dc454e3378d3edf86b8bac608ac6", "sha256": "9b4d33aa4fd18f8159d21a781dccfa6690c479b9c67e291408649d9c7b67731c" }, "downloads": -1, "filename": "urlclustering-0.4.tar.gz", "has_sig": false, "md5_digest": "7df4dc454e3378d3edf86b8bac608ac6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6562, "upload_time": "2015-10-19T13:06:52", "url": "https://files.pythonhosted.org/packages/9a/3c/ee253b07e4aba3e9f53b74315b23ba828ec525824fc5f7b66a9707989d66/urlclustering-0.4.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "20c52c9acd773a12c9d6763310d35a8c", "sha256": "792270c0e10aca65d21f94343396c2719ff5c99ee7e24c3e1b7a8e991ebaafd3" }, "downloads": -1, "filename": "urlclustering-0.4.1.tar.gz", "has_sig": false, "md5_digest": "20c52c9acd773a12c9d6763310d35a8c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6541, "upload_time": "2015-10-19T13:13:14", "url": "https://files.pythonhosted.org/packages/8e/0d/3dc97f51b1b043575ec613784076f71de430b7a72e8d97d9105f665d7a57/urlclustering-0.4.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "20c52c9acd773a12c9d6763310d35a8c", "sha256": "792270c0e10aca65d21f94343396c2719ff5c99ee7e24c3e1b7a8e991ebaafd3" }, "downloads": -1, "filename": "urlclustering-0.4.1.tar.gz", "has_sig": false, "md5_digest": "20c52c9acd773a12c9d6763310d35a8c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6541, "upload_time": "2015-10-19T13:13:14", "url": "https://files.pythonhosted.org/packages/8e/0d/3dc97f51b1b043575ec613784076f71de430b7a72e8d97d9105f665d7a57/urlclustering-0.4.1.tar.gz" } ] }