{ "info": { "author": "Daniel Omar Vergara P\u00e9rez", "author_email": "daniel.omar.vergara@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy" ], "description": "Vanguard kit\n==========================\n\n> A convenient way to calculate the edit distance between html files to scrape with confidence\n\nSometimes, scraping becomes a hard task, because the web sites are in continous changing.\nWhat about if there was a way to prevent those changes before scrape a site?\nVanguard is a tool kit that provides a way to calculate the edit distance between\ntwo html files by the Zhang-Shasha algorithm.\nThis package is based on [zss](https://github.com/timtadh/zhang-shasha).\n\n## Installation\n\nOS X & Linux:\n\nFrom PYPI\n\n```sh\n$ pip3 install vanguardkit\n```\n\nfrom the source\n\n```sh\n$ git clone https://github.com/dany2691/vanguard-kit.git\n$ cd vanguard-kit\n$ python3 setup.py install\n```\n\n## Usage example\n\nWith vanguard, it is possible to convert html content into a tree (graph) of nodes.\nThe create_html_tree function is the responsible to do that, it returns an instance of the VanguardNode class that inherits from the zss.Node class:\n\n```python\nfrom vanguardkit import create_html_tree\n\nwith open(\"target_website.html\") as target_website:\n thml_tree = create_html_tree(target_website)\n```\n\nIt is possible to segment specific parts of an html file.\n\nBy tag:\n```python\nwith open(\"target_website.html\") as target_website:\n html_tree = create_html_tree(\n html_file=target_website,\n specific_tag=\"footer\"\n )\n```\n\nBy tag and class:\n```python\nwith open(\"target_website.html\") as target_website:\n html_tree = create_html_tree(\n html_file=target_website,\n specific_tag=\"div\",\n class_=\"main-div\"\n )\n```\n\nBy tag and id:\n```python\nwith open(\"target_website.html\") as target_website:\n html_tree = create_html_tree(\n html_file=target_website,\n specific_tag=\"div\",\n id=\"super-div\"\n )\n```\n\n## Calculating distance\n\nAs previously said, the used algorithm is the Zhang-Shasha, that computes the edit distance between the two given trees. Ths is possible with the zss package behind the scenes; vanguard only provides a way to convert html files into trees.\n\n```python\nfrom vanguard_kit import create_html_tree, calcuate_html_tree_distance\n\nwith open(\"stored_target_website.html\") as stored_file:\n with open(\"current_target_website.html\") as current_file:\n previous_tree = create_html_tree(stored_file)\n current_tree = create_html_tree(current_file)\n print(calcuate_html_tree_distance(previous_tree, current_tree))\n # Rrints 1\n```\n\nDue to the VanguardNode class implements the __sub__ dunder method, the next way to calculate the edit distance is possible:\n\n```python\nfrom vanguard_kit import create_html_tree, calcuate_html_tree_distance\n\nwith open(\"stored_target_website.html\") as stored_file:\n with open(\"current_target_website.html\") as current_file:\n previous_tree = create_html_tree(stored_file)\n current_tree = create_html_tree(current_file)\n print(previous_tree - current_tree)\n # Rrints 1\n```\n\nThen, the next statement returns True:\n\n```python\ncalcuate_html_tree_distance(previous_tree, current_tree) == previous_tree - current_tree\n```\n\n# Development setup\n\nThis project uses _pipenv_ for dependecy resolution. It's a kind of mix between\npip and virtualenv. Follow the next instructions to setup the development enviroment.\n\n```sh\n$ git clone https://github.com/dany2691/vanguard-kit.git\n$ cd vanguard_kit\n$ pipenv shell\n$ pipenv install -e .\n```\n\nTo run the test-suite, inside the pybundler directory:\n\n```shell\n$ pytest tests/ -vv\n```\n\n## Meta\n\nDaniel Omar Vergara P\u00e9rez \u2013 [@dan1_net](https://twitter.com/dan1_net) \u2013 daniel.omar.vergara@gmail.com\n\n[https://github.com/dany2691](https://github.com/dany2691)\n\n## Contributing\n\n1. Fork it ()\n2. Create your feature branch (`git checkout -b feature/fooBar`)\n3. Commit your changes (`git commit -am 'Add some fooBar'`)\n4. Push to the branch (`git push origin feature/fooBar`)\n5. Create a new Pull Request\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dany2691/vanguard-kit", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "vanguardkit", "package_url": "https://pypi.org/project/vanguardkit/", "platform": "", "project_url": "https://pypi.org/project/vanguardkit/", "project_urls": { "Homepage": "https://github.com/dany2691/vanguard-kit" }, "release_url": "https://pypi.org/project/vanguardkit/0.1.0/", "requires_dist": [ "zss", "beautifulsoup4", "html5lib" ], "requires_python": "", "summary": "A convenient way to calculate the edit distance between html files", "version": "0.1.0" }, "last_serial": 5965254, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "b078afbaa86c7c2a20f7a3c891501068", "sha256": "961b1ba5bc205fc27cb64b1fa5f67132cd4f524c3c227c358b4a0d96d8833095" }, "downloads": -1, "filename": "vanguardkit-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "b078afbaa86c7c2a20f7a3c891501068", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5243, "upload_time": "2019-10-12T20:08:14", "url": "https://files.pythonhosted.org/packages/a7/72/7fb345c60a08df78e5477828e03785718eb1c7a84b4c506a9e9c884b29dc/vanguardkit-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c9c9b7ba5f42559118cf96114c12ddf7", "sha256": "ea2cadbc8b1248408a593cb8922056c731705ab9bdace7345f7fffcd891151a6" }, "downloads": -1, "filename": "vanguardkit-0.1.0.tar.gz", "has_sig": false, "md5_digest": "c9c9b7ba5f42559118cf96114c12ddf7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3994, "upload_time": "2019-10-12T20:08:17", "url": "https://files.pythonhosted.org/packages/75/30/f7021faf909724cd38e28e1857ade7213401c593194a7ca8c7969c0fad0d/vanguardkit-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b078afbaa86c7c2a20f7a3c891501068", "sha256": "961b1ba5bc205fc27cb64b1fa5f67132cd4f524c3c227c358b4a0d96d8833095" }, "downloads": -1, "filename": "vanguardkit-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "b078afbaa86c7c2a20f7a3c891501068", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5243, "upload_time": "2019-10-12T20:08:14", "url": "https://files.pythonhosted.org/packages/a7/72/7fb345c60a08df78e5477828e03785718eb1c7a84b4c506a9e9c884b29dc/vanguardkit-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c9c9b7ba5f42559118cf96114c12ddf7", "sha256": "ea2cadbc8b1248408a593cb8922056c731705ab9bdace7345f7fffcd891151a6" }, "downloads": -1, "filename": "vanguardkit-0.1.0.tar.gz", "has_sig": false, "md5_digest": "c9c9b7ba5f42559118cf96114c12ddf7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3994, "upload_time": "2019-10-12T20:08:17", "url": "https://files.pythonhosted.org/packages/75/30/f7021faf909724cd38e28e1857ade7213401c593194a7ca8c7969c0fad0d/vanguardkit-0.1.0.tar.gz" } ] }