{ "info": { "author": "Luca Cappelletti", "author_email": "cappelletti.luca94@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Information Technology", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": ".. role:: py(code)\n :language: python\n\n.. role:: json(code)\n :language: json\n\n\nTinyCrawler\n====================\n\n|travis| |sonar_quality| |sonar_maintainability| |sonar_coverage| |code_climate_maintainability| |pip|\n\nAn highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.\n\n**REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.**\n\nInstalling TinyCrawler\n------------------------\n\n.. code:: shell\n\n pip install tinycrawler\n\n\nPreview (Test case)\n---------------------\nThis is the preview of the console when running the `test_base.py`_.\n\n|preview|\n\nUsage example\n---------------------\n\n.. code:: python\n\n from tinycrawler import TinyCrawler, Log, Statistics\n from bs4 import BeautifulSoup, SoupStrainer\n import pandas as pd\n from requests import Response\n from urllib.parse import urlparse\n import os\n import json\n\n\n def html_sanitization(html: str) -> str:\n \"\"\"Return sanitized html.\"\"\"\n return html.replace(\"WRONG CONTENT\", \"RIGHT CONTENT\")\n\n\n def get_product_name(response: Response) -> str:\n \"\"\"Return product name from given Response object.\"\"\"\n return response.url.split(\"/\")[-1].split(\".html\")[0]\n\n\n def get_product_category(soup: BeautifulSoup) -> str:\n \"\"\"Return product category from given BeautifulSoup object.\"\"\"\n return soup.find_all(\"span\")[-2].get_text()\n\n\n def parse_tables(html: str, path: str, strainer: SoupStrainer):\n \"\"\"Parse table at given strained html object saving them as csv at given path.\"\"\"\n for table in BeautifulSoup(\n html, \"lxml\", parse_only=strainer).find_all(\"table\"):\n df = pd.read_html(html_sanitization(str(table)))[0].drop(0)\n table_name = df.columns[0]\n df.set_index(table_name, inplace=True)\n df.to_csv(\"{path}/{table_name}.csv\".format(\n path=path, table_name=table_name))\n\n\n def parse_metadata(html: str, path: str, strainer: SoupStrainer):\n \"\"\"Parse metadata from given strained html and saves them as json at given path.\"\"\"\n with open(\"{path}/metadata.json\".format(path=path), \"w\") as f:\n json.dump({\n \"category\":\n get_product_category(\n BeautifulSoup(html, \"lxml\", parse_only=strainer))\n }, f)\n\n\n def parse(response: Response):\n path = \"{root}/{product}\".format(\n root=urlparse(response.url).netloc, product=get_product_name(response))\n if not os.path.exists(path):\n os.makedirs(path)\n parse_tables(\n response.text, path,\n SoupStrainer(\n \"table\",\n attrs={\"class\": \"table table-hover table-condensed table-fixed\"}))\n\n parse_metadata(\n response.text, path,\n SoupStrainer(\"span\"))\n\n\n def url_validator(url: str, logger: Log, statistics: Statistics)->bool:\n \"\"\"Return a boolean representing if the crawler should parse given url.\"\"\"\n return url.startswith(\"https://www.example.com/it/alimenti\"\")\n\n\n def file_parser(response: Response, logger: Log, statistics):\n if response.url.endswith(\".html\"):\n parse(response)\n\n\n seed = \"https://www.example.com/it/alimenti\"\n crawler = TinyCrawler(follow_robots_txt=False)\n crawler.set_file_parser(file_parser)\n crawler.set_url_validator(url_validator)\n\n crawler.load_proxies(\"http://mytestserver.domain\", \"proxies.json\")\n\n crawler.run(seed)\n\n\n\nProxies are expected to be in the following format:\n\n.. code:: python\n\n [\n {\n \"ip\": \"89.236.17.108\",\n \"port\": 3128,\n \"type\": [\n \"https\",\n \"http\"\n ]\n },\n {\n \"ip\": \"128.199.141.151\",\n \"port\": 3128,\n \"type\": [\n \"https\",\n \"http\"\n ]\n }\n ]\n\n\nLicense\n--------------\nThe software is released under the MIT license.\n\n.. _`test_base.py`: https://github.com/LucaCappelletti94/tinycrawler/blob/master/tests/test_base.py\n\n.. |preview| image:: https://github.com/LucaCappelletti94/tinycrawler/blob/master/preview.png?raw=true\n\n.. |travis| image:: https://travis-ci.org/LucaCappelletti94/tinycrawler.png\n :target: https://travis-ci.org/LucaCappelletti94/tinycrawler\n\n.. |sonar_quality| image:: https://sonarcloud.io/api/project_badges/measure?project=tinycrawler.lucacappelletti&metric=alert_status\n :target: https://sonarcloud.io/dashboard/index/tinycrawler.lucacappelletti\n\n.. |sonar_maintainability| image:: https://sonarcloud.io/api/project_badges/measure?project=tinycrawler.lucacappelletti&metric=sqale_rating\n :target: https://sonarcloud.io/dashboard/index/tinycrawler.lucacappelletti\n\n.. |sonar_coverage| image:: https://sonarcloud.io/api/project_badges/measure?project=tinycrawler.lucacappelletti&metric=coverage\n :target: https://sonarcloud.io/dashboard/index/tinycrawler.lucacappelletti\n\n.. |code_climate_maintainability| image:: https://api.codeclimate.com/v1/badges/25fb7c6119e188dbd12c/maintainability\n :target: https://codeclimate.com/github/LucaCappelletti94/tinycrawler/maintainability\n :alt: Maintainability\n\n.. |pip| image:: https://badge.fury.io/py/tinycrawler.svg\n :target: https://badge.fury.io/py/tinycrawler\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/LucaCappelletti94/tinycrawler", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "tinycrawler", "package_url": "https://pypi.org/project/tinycrawler/", "platform": "", "project_url": "https://pypi.org/project/tinycrawler/", "project_urls": { "Homepage": "https://github.com/LucaCappelletti94/tinycrawler" }, "release_url": "https://pypi.org/project/tinycrawler/1.7.5/", "requires_dist": null, "requires_python": "", "summary": "Small customizable multiprocessing multi-proxy crawler.", "version": "1.7.5" }, "last_serial": 4516067, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "68aa7ea8340af70f0e7bc526c679c186", "sha256": "125ac5270b2cb23187cec702d1f2015ae97a1aabd5c53e609930f9617477cc0d" }, "downloads": -1, "filename": "tinycrawler-1.0.0.tar.gz", "has_sig": false, "md5_digest": "68aa7ea8340af70f0e7bc526c679c186", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11128, "upload_time": "2018-06-16T15:45:22", "url": "https://files.pythonhosted.org/packages/fc/9d/e2ee56cecf0b1c0289f09357658b19847750e327dc3ca4099ab403ccca17/tinycrawler-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "4cd25615c1e82386022c883646eaa17f", "sha256": "f775ff380bd9046318480250051ea0c35271c22b881f82ef8d9aed138e25992d" }, "downloads": -1, "filename": "tinycrawler-1.0.1.tar.gz", "has_sig": false, "md5_digest": "4cd25615c1e82386022c883646eaa17f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11125, "upload_time": "2018-06-16T15:46:00", "url": "https://files.pythonhosted.org/packages/c3/03/e6a6d68c420ff17002fb4f444d741cf394b639c0c224a122f2dcd15c5d9e/tinycrawler-1.0.1.tar.gz" } ], "1.2.0": [ { "comment_text": "", "digests": { "md5": "d7881783ea27bfdf2828405d4c6e91ad", "sha256": "041f190b41e3e3cde1306b75fadf7535d3aec9b7ebf3756d4aba7482d905e57d" }, "downloads": -1, "filename": "tinycrawler-1.2.0.tar.gz", "has_sig": false, "md5_digest": "d7881783ea27bfdf2828405d4c6e91ad", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13024, "upload_time": "2018-06-17T19:18:06", "url": "https://files.pythonhosted.org/packages/61/96/ac1bf737f002a51de5a6a7de422a1ca7212b7d4c4e53dd6b2b2feecf7187/tinycrawler-1.2.0.tar.gz" } ], "1.5.0": [ { "comment_text": "", "digests": { "md5": "56802ae50a24343b0cb0cf3b5c798f75", "sha256": "c102c651798ab85527aec0edafafbadf47f8e35aeccc5631f6bb923b61e37978" }, "downloads": -1, "filename": "tinycrawler-1.5.0.tar.gz", "has_sig": false, "md5_digest": "56802ae50a24343b0cb0cf3b5c798f75", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13608, "upload_time": "2018-10-04T10:08:43", "url": "https://files.pythonhosted.org/packages/b5/0e/191d1716f46ab1fd12f637a3ba712424f4f945e00cb9de3a6482420ef74f/tinycrawler-1.5.0.tar.gz" } ], "1.6.0": [ { "comment_text": "", "digests": { "md5": "3939029f96a0d4c63f3e0c429c5d213a", "sha256": "1591900b5ed537cf21b964f4d4b27eacec3445aa3fe031907733a3e0e90ff9e8" }, "downloads": -1, "filename": "tinycrawler-1.6.0.tar.gz", "has_sig": false, "md5_digest": "3939029f96a0d4c63f3e0c429c5d213a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13921, "upload_time": "2018-11-04T08:18:57", "url": "https://files.pythonhosted.org/packages/99/7e/df0cc8146e4b7127ccf57ea3056d50301aba78bb2dbabb036e7eff0f8050/tinycrawler-1.6.0.tar.gz" } ], "1.6.1": [ { "comment_text": "", "digests": { "md5": "9db86d2b71264924dce9a16a26c6b0ca", "sha256": "e129950052d43e3f009bee9646f52b7c68dd94585eceeaec4f9e84953dd35f9c" }, "downloads": -1, "filename": "tinycrawler-1.6.1.tar.gz", "has_sig": false, "md5_digest": "9db86d2b71264924dce9a16a26c6b0ca", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14638, "upload_time": "2018-11-05T10:04:39", "url": "https://files.pythonhosted.org/packages/b1/12/da01a10b297ea1d2df591d17c4e2050c067eb4e1efb86eb8e4285b6e5352/tinycrawler-1.6.1.tar.gz" } ], "1.7.0": [ { "comment_text": "", "digests": { "md5": "a173da8b6e27d9b2ae29f3467752d7b2", "sha256": "a3bdbe6be07e35b38c58f14e024658ad8f1af9961b7fe4f55c647baf44a0640e" }, "downloads": -1, "filename": "tinycrawler-1.7.0.tar.gz", "has_sig": false, "md5_digest": "a173da8b6e27d9b2ae29f3467752d7b2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15218, "upload_time": "2018-11-16T14:16:29", "url": "https://files.pythonhosted.org/packages/46/1f/a47590e89f74f507737d46a37da0be18da9a57c167c5626b73e93ae9a28a/tinycrawler-1.7.0.tar.gz" } ], "1.7.1": [ { "comment_text": "", "digests": { "md5": "145aabcca699cf637d11ac46801ca1a8", "sha256": "2e82816f66d2a98b9bd805ebbe4c35f9c4c9c72db12058f4b42b2c9c85914d31" }, "downloads": -1, "filename": "tinycrawler-1.7.1.tar.gz", "has_sig": false, "md5_digest": "145aabcca699cf637d11ac46801ca1a8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16149, "upload_time": "2018-11-16T14:26:26", "url": "https://files.pythonhosted.org/packages/b2/41/30f50a1babd8a171b2c1441458f2a6213c5db1e1dc0a66bf1b24279313bc/tinycrawler-1.7.1.tar.gz" } ], "1.7.5": [ { "comment_text": "", "digests": { "md5": "f2221d0c2ca4962d62504ebceadc6312", "sha256": "18059f7ada5aea225777f72dd3b119d44cc24ebc9174f0a09e4fbb41f141a517" }, "downloads": -1, "filename": "tinycrawler-1.7.5.tar.gz", "has_sig": false, "md5_digest": "f2221d0c2ca4962d62504ebceadc6312", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16250, "upload_time": "2018-11-22T09:17:45", "url": "https://files.pythonhosted.org/packages/c2/f7/4d10a49ea78f4c4c7ceab208c3d67fd60ad8436c02d636bf8ef3baafc40f/tinycrawler-1.7.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f2221d0c2ca4962d62504ebceadc6312", "sha256": "18059f7ada5aea225777f72dd3b119d44cc24ebc9174f0a09e4fbb41f141a517" }, "downloads": -1, "filename": "tinycrawler-1.7.5.tar.gz", "has_sig": false, "md5_digest": "f2221d0c2ca4962d62504ebceadc6312", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16250, "upload_time": "2018-11-22T09:17:45", "url": "https://files.pythonhosted.org/packages/c2/f7/4d10a49ea78f4c4c7ceab208c3d67fd60ad8436c02d636bf8ef3baafc40f/tinycrawler-1.7.5.tar.gz" } ] }