{ "info": { "author": "Teddy Coder", "author_email": "fedor_coder@mail.ru", "bugtrack_url": null, "classifiers": [], "description": "# First_scrap\n\n- - -\n[English](README.md), [\u0420\u0443\u0441\u0441\u043a\u0438\u0439](README-ru.md)\n- - -\n\nFirst_scrap is a library for scraping sites with multiprocessing, random proxies and user-agents.\n\n## Installation\n\nTo get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:\n\n python3 -m venv env\n source ./env/bin/activate\n\nTo install First_scrap use pip package manager:\n\n pip install firstscrap\n\nAnother installing approach is getting source code from GitHub. For this execute the commands in your console:\n\n git clone http://github.com/theodor85/first_scrap\n cd first_scrap\n python setup.py develop\n\n## How to use\nTo extract data from single web-page create a class that derives from the `PageHandler` abstract class.\n\nIn your class, you must define a constructor in which you must call the base class constructor and define two instance fields: `URL` and `use_selenium`:\n\n- `URL` - URL of the web-page from which you want to extract data.\n- `use_selenium` - this boolean field determine if BeautifulSoup will be used (if it sets in `False`) or Selenium (`True`).\n\nAs well you must define a method `extract_data_from_html(self, soup=None, selenium_driver=None)`. Use BeautifulSoup (`soup`) object or Selenium (`selenium_driver`) for data extraction.\n\nAn example is given below.\n\n```python\nfrom firstscrap.pagehandler import PageHandler\n\n# class for one web-page handling\nclass OnePageHandler(PageHandler):\n\n def __init__(self, URL):\n super(FlatHandler, self).__init__()\n self.URL = URL\n self.use_selenium = False\n\n def extract_data_from_html(self, soup=None, selenium_driver=None):\n\n data = {}\n data['link'] = self.URL\n data['name'] = soup.find('h1', class_='information__title___1nM29').get_text().strip()\n data['price'] = soup.find('div', class_='information__price___2Lpc0').span.get_text().strip()\n```\nThen create an instance of that class and call the `execute()` method. \n\n```python\nhandler = FlatHandler('')\ndata = handler.execute()\n```\n\nThere are your extracted data in the `data` variable.\n\nTo extract data from list of many same web-pages use the `list_handler` function:\n\n```python\nfrom firstscrap.listhandler import list_handler\n\nresult = list_handler(list_of_links, OnePageHandler, with_processes=True, process_limit=5)\n```\n\nThe function takes parametres:\n- `list_of_links` - list of links to pages from which data will be extracted;\n- `OnePageHandler` - descendant of `PageHandler` class, extracts data from one web-page;\n- `with_processes` - boolean parameter, if multiprocessing will be used;\n- `process_limit` - max number of processes.\n\n## What's under hood\n\nWhen extracting data from a single page:\n\n1. Random proxy server and user-agent are selected from the lists stored in the file.\n2. These proxies and user-agents are used to access the page we need.\n3. With BeautifulSoup or Selenium (depending on the use_selenium field), the data is retrieved from the page and returned by the ' execute ()`method.\n\nWhen extracting data from a page list:\n\n1. If the `with_processes = False` parameter, the program retrieves data one by one from all pages in the passed list. At the same time, a random proxy server and user-agent are used every time.\n2. Otherwise, the program starts processing each page in a separate process, and the number of processes running at the same time does not exceed `process_limit`.\n\n### Prerequisites\n\nTo use the Selenium library opportunities, you must install the Google Chrome browser ([download here](https://www.google.com/intl/ru_ALL/chrome/)) and chromedriver ([installation instructions](https://sites.google.com/a/chromium.org/chromedriver/getting-started)) on your system.\n\nSupporting for other brousers is planned.\n\n## Running the tests\n\nTo run the tests type in your console:\n\n python -m unittest -v tests/tests.py\n\nBefore running the tests enjure that your internet connection is active.\n\n## Contributing\n\nMerge you code to the 'develop' branch for contributing please.\n\nForks and pull requests are welcome! If you like first_scrap, do not forget to put a star!\n\n## Bug reports\n\nTo bug report please mail to fedor_coder@mail.ru with tag \"first_scrap bug reporting\".\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.txt](LICENSE.txt) file for details.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/theodor85/first_scrap", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "firstscrap", "package_url": "https://pypi.org/project/firstscrap/", "platform": "", "project_url": "https://pypi.org/project/firstscrap/", "project_urls": { "Homepage": "https://github.com/theodor85/first_scrap" }, "release_url": "https://pypi.org/project/firstscrap/0.1.0/", "requires_dist": null, "requires_python": "", "summary": "Scraping sites with multiprocessing, random proxies and user-agents", "version": "0.1.0" }, "last_serial": 5263585, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "4e64cc171ef00ddb50444698a679acfc", "sha256": "ddb4f8d4895f8bb4f407d6ad6f4d1475af441c986f29985f56d7cee8c0096dbc" }, "downloads": -1, "filename": "firstscrap-0.1.0.tar.gz", "has_sig": false, "md5_digest": "4e64cc171ef00ddb50444698a679acfc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14176, "upload_time": "2019-05-13T17:33:28", "url": "https://files.pythonhosted.org/packages/b8/d1/c2e8dcb60847608fdb015ea4b03cd738ef3b892e06ada84994132d5e2383/firstscrap-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4e64cc171ef00ddb50444698a679acfc", "sha256": "ddb4f8d4895f8bb4f407d6ad6f4d1475af441c986f29985f56d7cee8c0096dbc" }, "downloads": -1, "filename": "firstscrap-0.1.0.tar.gz", "has_sig": false, "md5_digest": "4e64cc171ef00ddb50444698a679acfc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14176, "upload_time": "2019-05-13T17:33:28", "url": "https://files.pythonhosted.org/packages/b8/d1/c2e8dcb60847608fdb015ea4b03cd738ef3b892e06ada84994132d5e2383/firstscrap-0.1.0.tar.gz" } ] }