{
    "info": {
        "author": "Teddy Coder",
        "author_email": "fedor_coder@mail.ru",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "# First_scrap\n\n- - -\n[English](README.md), [\u0420\u0443\u0441\u0441\u043a\u0438\u0439](README-ru.md)\n- - -\n\nFirst_scrap is a library for scraping sites with multiprocessing, random proxies and user-agents.\n\n## Installation\n\nTo get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:\n\n    python3 -m venv env\n    source ./env/bin/activate\n\nTo install First_scrap use pip package manager:\n\n    pip install firstscrap\n\nAnother installing approach is getting source code from GitHub. For this execute the commands in your console:\n\n    git clone http://github.com/theodor85/first_scrap\n    cd first_scrap\n    python setup.py develop\n\n## How to use\nTo extract data from single web-page create a class that derives from the `PageHandler` abstract class.\n\nIn your class, you must define a constructor in which you must call the base class constructor and define two instance fields: `URL` and `use_selenium`:\n\n- `URL` - URL of the web-page from which you want to extract data.\n- `use_selenium` - this boolean field determine if BeautifulSoup will be used (if it sets in `False`) or Selenium (`True`).\n\nAs well you must define a method `extract_data_from_html(self, soup=None, selenium_driver=None)`. Use BeautifulSoup (`soup`) object or Selenium (`selenium_driver`) for data extraction.\n\nAn example is given below.\n\n```python\nfrom firstscrap.pagehandler import PageHandler\n\n# class for one web-page handling\nclass OnePageHandler(PageHandler):\n\n    def __init__(self, URL):\n        super(FlatHandler, self).__init__()\n        self.URL = URL\n        self.use_selenium = False\n\n    def extract_data_from_html(self, soup=None, selenium_driver=None):\n\n        data = {}\n        data['link']    = self.URL\n        data['name']    = soup.find('h1', class_='information__title___1nM29').get_text().strip()\n        data['price']   = soup.find('div', class_='information__price___2Lpc0').span.get_text().strip()\n```\nThen create an instance of that class and call the `execute()` method. \n\n```python\nhandler = FlatHandler('<your URL>')\ndata = handler.execute()\n```\n\nThere are your extracted data in the `data` variable.\n\nTo extract data from list of many same web-pages use the `list_handler` function:\n\n```python\nfrom firstscrap.listhandler import list_handler\n\nresult = list_handler(list_of_links, OnePageHandler, with_processes=True, process_limit=5)\n```\n\nThe function takes parametres:\n- `list_of_links` - list of links to pages from which data will be extracted;\n- `OnePageHandler` - descendant of `PageHandler` class, extracts data from one web-page;\n- `with_processes` - boolean parameter, if multiprocessing will be used;\n- `process_limit` - max number of processes.\n\n## What's under hood\n\nWhen extracting data from a single page:\n\n1. Random proxy server and user-agent are selected from the lists stored in the file.\n2. These proxies and user-agents are used to access the page we need.\n3. With BeautifulSoup or Selenium (depending on the use_selenium field), the data is retrieved from the page and returned by the ' execute ()`method.\n\nWhen extracting data from a page list:\n\n1. If the `with_processes = False` parameter, the program retrieves data one by one from all pages in the passed list. At the same time, a random proxy server and user-agent are used every time.\n2. Otherwise, the program starts processing each page in a separate process, and the number of processes running at the same time does not exceed `process_limit`.\n\n### Prerequisites\n\nTo use the Selenium library opportunities, you must install the Google Chrome browser ([download here](https://www.google.com/intl/ru_ALL/chrome/)) and chromedriver ([installation instructions](https://sites.google.com/a/chromium.org/chromedriver/getting-started)) on your system.\n\nSupporting for other brousers is planned.\n\n## Running the tests\n\nTo run the tests type in your console:\n\n    python -m unittest -v tests/tests.py\n\nBefore running the tests enjure that your internet connection is active.\n\n## Contributing\n\nMerge you code to the 'develop' branch for contributing please.\n\nForks and pull requests are welcome! If you like first_scrap, do not forget to put a star!\n\n## Bug reports\n\nTo bug report please mail to fedor_coder@mail.ru with tag \"first_scrap bug reporting\".\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.txt](LICENSE.txt) file for details.",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/theodor85/first_scrap",
        "keywords": "",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "firstscrap",
        "package_url": "https://pypi.org/project/firstscrap/",
        "platform": "",
        "project_url": "https://pypi.org/project/firstscrap/",
        "project_urls": {
            "Homepage": "https://github.com/theodor85/first_scrap"
        },
        "release_url": "https://pypi.org/project/firstscrap/0.1.0/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Scraping sites with multiprocessing, random proxies and user-agents",
        "version": "0.1.0"
    },
    "last_serial": 5263585,
    "releases": {
        "0.1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "4e64cc171ef00ddb50444698a679acfc",
                    "sha256": "ddb4f8d4895f8bb4f407d6ad6f4d1475af441c986f29985f56d7cee8c0096dbc"
                },
                "downloads": -1,
                "filename": "firstscrap-0.1.0.tar.gz",
                "has_sig": false,
                "md5_digest": "4e64cc171ef00ddb50444698a679acfc",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 14176,
                "upload_time": "2019-05-13T17:33:28",
                "url": "https://files.pythonhosted.org/packages/b8/d1/c2e8dcb60847608fdb015ea4b03cd738ef3b892e06ada84994132d5e2383/firstscrap-0.1.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "4e64cc171ef00ddb50444698a679acfc",
                "sha256": "ddb4f8d4895f8bb4f407d6ad6f4d1475af441c986f29985f56d7cee8c0096dbc"
            },
            "downloads": -1,
            "filename": "firstscrap-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4e64cc171ef00ddb50444698a679acfc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14176,
            "upload_time": "2019-05-13T17:33:28",
            "url": "https://files.pythonhosted.org/packages/b8/d1/c2e8dcb60847608fdb015ea4b03cd738ef3b892e06ada84994132d5e2383/firstscrap-0.1.0.tar.gz"
        }
    ]
}