{ "info": { "author": "Eric Ziethen", "author_email": "ericziethen@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Operating System :: OS Independent", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: Implementation :: CPython", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
LicenseVersion
Travis CICoverage
WheelImplementation
StatusDownloads
Supported versions
\n\n# ezscrape\n\nezscrape provides a boilerplate for simple scraping tasks.\n\nIt provide generic access to scraping functionality without exposing the user directly the the underlying libraries used (e.g. requests, selenium) when using the scraping functionality.\nThe used scraper is chosen based on the specified config parameters and will prefer the most flexibe / least resource intensive ones if possible.\n\nThe exceptions of the underlying modules are handled and converted into the status of the result object.\n\n# Setup Requirements\n## Setup Chrome and Webdriver\n\nFor some websites selenium will be used for scraping (e.g. requests doesn't support javascript)\n\nFor that purpose the following environment variables need to be set otherwise an exception will be raised when the code is using selenium.\n\n1. The Chrome Executable\n\n *CHROME_EXEC_PATH*\n\n2. The Chrome Webdriver Executable\n\n *CHROME_WEBDRIVER_PATH*\n\n# Usage\n\n1. The basic concept of a simple scrape is\n2. Create the Scrape Config with a Url\n3. Optional - set additional parameters on the scrape config\n4. Scrape with the given Config\n5. Check the Return Object if the Scrape was succesfull\n6. Get the HTML from the Return Object\n\n\n## Scrape a simple HTML Page\n\n~~~\n\nimport ezscrape.scraping.scraper as scraper\nfrom ezscrape.scraping.core import ScrapeConfig\nfrom ezscrape.scraping.core import ScrapeStatus\n\nresult = scraper.scrape_url(ScrapeConfig('http://www.website.com'))\n\nif result.status == ScrapeStatus.SUCCESS:\n html = result.first_page.html\nelse:\n print(result.error_msg)\n\n~~~\n\n## Scrape a Page with Multiple Pages\n\n~~~\n\nimport ezscrape.scraping.scraper as scraper\nfrom ezscrape.scraping.core import ScrapeConfig\nfrom ezscrape.scraping.core import WaitForXpathElem\nfrom ezscrape.scraping.core import ScrapeStatus\n\nconfig = ScrapeConfig('http://www.website.com')\n# Add contition to wait until Element with \"title='id'\" is loaded\"\nconfig.wait_for_elem_list.append(WaitForXpathElem(R'''//a[@title='id']'''))\n\nresult = scraper.scrape_url(config)\n\nfor page in result:\n if page.status == ScrapeStatus.SUCCESS:\n html = page.html\n else:\n print(result.error_msg)\n~~~\n\n## Scrape a Page and wait until an Element is Loaded\n\n~~~\n\nimport ezscrape.scraping.scraper as scraper\nfrom ezscrape.scraping.core import ScrapeConfig\nfrom ezscrape.scraping.core import WaitForXpathElem\nfrom ezscrape.scraping.core import ScrapeStatus\n\n\nconfig = ScrapeConfig('http://www.website.com')\n# Add contition to wait until Element with \"title='id'\" is loaded\"\nconfig.wait_for_elem_list.append(WaitForXpathElem(R'''//a[@title='id']'''))\n\nresult = scraper.scrape_url(config)\n\nif result.status == ScrapeStatus.SUCCESS:\n html = result.first_page.html\nelse:\n print(result.error_msg)\n\n~~~\n\n# Scrape Config\n\nezscrape.scraping.core.ScrapeConfig\n\nThe url is specified when creating the object.\n\n~~~\nfrom ezscrape.scraping.core import ScrapeConfig\n\nconfig = ScrapeConfig('http://some-url.com') \n~~~\n\nAdditional parameters can be specified\n\n| Option | Purpose | Type | Default | Example Use Case |\n|---------------------------------|------------------------------------------|------------------------------------------|-------------------|------------------------------------------|\n| ScrapeConfig.url | The URL used for the request | str | N/A | Required for all Requests |\n| ScrapeConfig.request_timeout | The timeout in seconds of the request | long | 15 | Wait longer before timeout in a slow Network environment. |\n| ScrapeConfig.page_load_wait | Time ti wait until a page is loaded completely before it times out | int | 5.0 | Specify a longer time if the page loads dynamic elements slowly |\n| ScrapeConfig.proxy_http | HTTP Proxy to use | str | N/A | Send the request through an HTTP proxy (Proxy needs to support the Target protocol i.e. HTTP/HTTPS) |\n| ScrapeConfig.proxy_https | HTTPS Proxy to use | str | N/A | Send the request through an HTTPS proxy (Proxy needs to support the Target protocol i.e. HTTP/HTTPS) |\n| ScrapeConfig.useragent | Custom Useragent to use | str | Internally Chosen | User want to scrape with a custom Useragent |\n| ScrapeConfig.max_pages | Maximum Pages to collect if \"next_button\" specifies | int | 15 | User only wants to return 3 Pages max even if more pages available |\n| ScrapeConfig.next_button | Add a button element that needs to be loaded and clicked for ultiple pages | ezscrape.scraping.core.WaitForPageElem

or one of the subtypes e.g.

ezscrape.scraping.core.WaitForXpathElem | N/A | User wants to return multiple pages if the next page links are generated with Javascript |\n| ScrapeConfig.wait_for_elem_list | A list of Elements that need to be loaded on the page before returning the scrape result | List of
ezscrape.scraping.core.WaitForPageElem

or one of the subtypes e.g.

ezscrape.scraping.core.WaitForXpathElem | N/A | User is interested in multiple elements of a Javascript/Ajax page and needs to wait for all to load completely. |\n\n# Scrape Status\n\nThe following statuses are supported in ezscrape.scraping.core.ScrapeStatus\n\n| Status | Meaning |\n|--------------------------|-------------------------|\n| ScrapeStatus.SUCCESS | Scrape Succesfull |\n| ScrapeStatus.TIMEOUT | A timeout error occured |\n| ScrapeStatus.PROXY_ERROR | A proxy error occured |\n| ScrapeStatus.ERROR | A generic error occured |\n\nFor non Success cases, additional error details are given in the [ScrapeResult](#scrape-result) object\n\n# Scrape Result\n\nThe following attributes are available in ezscrape.scraping.core.ScrapeResults\n\n| Attribute | Purpose | Type |\n|------------------------|------------------------------------------|-------------------------------------|\n| ScrapeResult.url | The url Scraped | str |\n| ScrapeResult.caller_ip | The caller IP.

This is not set for all cases. But where it is it should be reliable e.g. if Scraped through proxy, the proxy IP should be shown) | str |\n| ScrapeResult.status | The overall status of the Scrape | ezscrape.scraping.core.ScrapeStatus |\n| ScrapeResult.error_msg | The error message if the result is not SUCCESS | str |\n| request_time_ms | The combined scrape time of all pages scraped | float |\n| first_page | The ScrapePage scraped (first if multiple pages) | ezscrape.scraping.core.ScrapePage |\n\n# Scrape Page\n\nThe following attributes are available in ezscrape.scraping.core.ScrapePage\n\n| Attribute | Purpose | Type |\n|-----------------|------------------------------------------|-------------------------------------|\n| html | The HTML content scraped | str |\n| request_time_ms | the scrape duration for this page | float |\n| status | The scrape status for this page

ScrapePage doesn't have it's own error message. For details check ScrapeResult.error_msg | ezscrape.scraping.core.ScrapeStatus |\n\n## Contributing\nPull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.\n\nPlease make sure to update tests as appropriate.\n\n## License\n[GPLv3](https://choosealicense.com/licenses/gpl-3.0/)\n\n [1]: # Scrape Result\n [2]: # Scrape Result\n\n\n", "description_content_type": "text/markdown; charset=UTF-8", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ericziethen/ezscrape", "keywords": "", "license": "GNU GPLv3", "maintainer": "", "maintainer_email": "", "name": "ezscrape", "package_url": "https://pypi.org/project/ezscrape/", "platform": "any", "project_url": "https://pypi.org/project/ezscrape/", "project_urls": { "Homepage": "https://github.com/ericziethen/ezscrape" }, "release_url": "https://pypi.org/project/ezscrape/0.3/", "requires_dist": [ "fake-useragent (>=0.1.11)", "requests (>=2.21.0)", "selenium (>=3.141.0)" ], "requires_python": ">= 3.7", "summary": "Collection of Scraping tools", "version": "0.3" }, "last_serial": 5586665, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "4c55e60d14b02bd39e94b83135e5db0d", "sha256": "f1c16c378a91462faf6e197695607a56b85f2f4a28421886d3d98671d767ebda" }, "downloads": -1, "filename": "ezscrape-0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4c55e60d14b02bd39e94b83135e5db0d", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">= 3.7", "size": 23587, "upload_time": "2019-07-24T04:35:18", "url": "https://files.pythonhosted.org/packages/6e/1c/528ec0a0976aea64b9654bc30843769fb4f86bd4dbdec6faac27b610ca4d/ezscrape-0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4c49f479d543472608416ad1cf897bd7", "sha256": "defe06444efadb4be71c4fb019ee18f58ce55aa79af65c28ec29bc0c004b5e09" }, "downloads": -1, "filename": "ezscrape-0.2.tar.gz", "has_sig": false, "md5_digest": "4c49f479d543472608416ad1cf897bd7", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.7", "size": 22111, "upload_time": "2019-07-24T04:35:20", "url": "https://files.pythonhosted.org/packages/63/d1/552dabec209ddf8756976b9a732448700f4fb0c8fbf957dacda8ecf5449b/ezscrape-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "464df57d42be8f24434bb1364f3b0cd2", "sha256": "9b3d9a2d7c576890d8994f682d11cae3c25b735b6a2006cb1d7cbdade667f3f9" }, "downloads": -1, "filename": "ezscrape-0.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "464df57d42be8f24434bb1364f3b0cd2", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">= 3.7", "size": 26087, "upload_time": "2019-07-26T03:45:38", "url": "https://files.pythonhosted.org/packages/7c/24/37586c437e57dff4f877960962c8c7ba207b8a09b0980b1ccd4f53f917df/ezscrape-0.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "801b6dd9db60304da73e1472afeef029", "sha256": "3144116b5e4c089b08808bd8746461e85884002a3154611a1f0f56f0212e26c5" }, "downloads": -1, "filename": "ezscrape-0.3.tar.gz", "has_sig": false, "md5_digest": "801b6dd9db60304da73e1472afeef029", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.7", "size": 25617, "upload_time": "2019-07-26T03:45:39", "url": "https://files.pythonhosted.org/packages/21/2c/1561ee38b0972ad9d270366f548e1449ff3e4d32729774fbeeac1d8f6281/ezscrape-0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "464df57d42be8f24434bb1364f3b0cd2", "sha256": "9b3d9a2d7c576890d8994f682d11cae3c25b735b6a2006cb1d7cbdade667f3f9" }, "downloads": -1, "filename": "ezscrape-0.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "464df57d42be8f24434bb1364f3b0cd2", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">= 3.7", "size": 26087, "upload_time": "2019-07-26T03:45:38", "url": "https://files.pythonhosted.org/packages/7c/24/37586c437e57dff4f877960962c8c7ba207b8a09b0980b1ccd4f53f917df/ezscrape-0.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "801b6dd9db60304da73e1472afeef029", "sha256": "3144116b5e4c089b08808bd8746461e85884002a3154611a1f0f56f0212e26c5" }, "downloads": -1, "filename": "ezscrape-0.3.tar.gz", "has_sig": false, "md5_digest": "801b6dd9db60304da73e1472afeef029", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.7", "size": 25617, "upload_time": "2019-07-26T03:45:39", "url": "https://files.pythonhosted.org/packages/21/2c/1561ee38b0972ad9d270366f548e1449ff3e4d32729774fbeeac1d8f6281/ezscrape-0.3.tar.gz" } ] }