{ "info": { "author": "Michal Hodur, Mikhail Korobov", "author_email": "michal.hodur@rentier.io, kmike84@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Framework :: Scrapy", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "scrapy-rotating-proxies\n=======================\n\n.. image:: https://img.shields.io/pypi/v/scrapy-rotating-proxies.svg\n :target: https://pypi.python.org/pypi/scrapy-rotating-proxies\n :alt: PyPI Version\n\n.. image:: https://travis-ci.org/TeamHG-Memex/scrapy-rotating-proxies.svg?branch=master\n :target: http://travis-ci.org/TeamHG-Memex/scrapy-rotating-proxies\n :alt: Build Status\n\n.. image:: http://codecov.io/github/TeamHG-Memex/scrapy-rotating-proxies/coverage.svg?branch=master\n :target: http://codecov.io/github/TeamHG-Memex/scrapy-rotating-proxies?branch=master\n :alt: Code Coverage\n\nThis package provides a Scrapy_ middleware to use rotating proxies,\ncheck that they are alive and adjust crawling speed.\n\n.. _Scrapy: https://scrapy.org/\n\nLicense is MIT.\n\nInstallation\n------------\n\n::\n\n pip install scrapy-rotating-proxies\n\nUsage\n-----\n\nAdd ``ROTATING_PROXY_LIST`` option with a list of proxies to settings.py::\n\n ROTATING_PROXY_LIST = [\n 'proxy1.com:8000',\n 'proxy2.com:8031',\n # ...\n ]\n\nAs an alternative, you can specify a ``ROTATING_PROXY_LIST_PATH`` options\nwith a path to a file with proxies, one per line::\n\n ROTATING_PROXY_LIST_PATH = '/my/path/proxies.txt'\n\n``ROTATING_PROXY_LIST_PATH`` takes precedence over ``ROTATING_PROXY_LIST``\nif both options are present.\n\nThen add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES::\n\n DOWNLOADER_MIDDLEWARES = {\n # ...\n 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,\n 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,\n # ...\n }\n\nAfter this all requests will be proxied using one of the proxies from\nthe ``ROTATING_PROXY_LIST`` / ``ROTATING_PROXY_LIST_PATH``.\n\nRequests with \"proxy\" set in their meta are not handled by\nscrapy-rotating-proxies. To disable proxying for a request set\n``request.meta['proxy'] = None``; to set proxy explicitly use\n``request.meta['proxy'] = \"\"``.\n\n\nConcurrency\n-----------\n\nBy default, all default Scrapy concurrency options (``DOWNLOAD_DELAY``,\n``AUTHTHROTTLE_...``, ``CONCURRENT_REQUESTS_PER_DOMAIN``, etc) become\nper-proxy for proxied requests when RotatingProxyMiddleware is enabled.\nFor example, if you set ``CONCURRENT_REQUESTS_PER_DOMAIN=2`` then\nspider will be making at most 2 concurrent connections to each proxy,\nregardless of request url domain.\n\nCustomization\n-------------\n\n``scrapy-rotating-proxies`` keeps track of working and non-working proxies,\nand re-checks non-working from time to time.\n\nDetection of a non-working proxy is site-specific.\nBy default, ``scrapy-rotating-proxies`` uses a simple heuristic:\nif a response status code is not 200, response body is empty or if\nthere was an exception then proxy is considered dead.\n\nYou can override ban detection method by passing a path to\na custom BanDectionPolicy in ``ROTATING_PROXY_BAN_POLICY`` option, e.g.::\n\n # settings.py\n ROTATING_PROXY_BAN_POLICY = 'myproject.policy.MyBanPolicy'\n\nThe policy must be a class with ``response_is_ban``\nand ``exception_is_ban`` methods. These methods can return True\n(ban detected), False (not a ban) or None (unknown). It can be convenient\nto subclass and modify default BanDetectionPolicy::\n\n # myproject/policy.py\n from rotating_proxies.policy import BanDetectionPolicy\n\n class MyPolicy(BanDetectionPolicy):\n def response_is_ban(self, request, response):\n # use default rules, but also consider HTTP 200 responses\n # a ban if there is 'captcha' word in response body.\n ban = super(MyPolicy, self).response_is_ban(request, response)\n ban = ban or b'captcha' in response.body\n return ban\n\n def exception_is_ban(self, request, exception):\n # override method completely: don't take exceptions in account\n return None\n\nInstead of creating a policy you can also implement ``response_is_ban``\nand ``exception_is_ban`` methods as spider methods, for example::\n\n class MySpider(scrapy.Spider):\n # ...\n\n def response_is_ban(self, request, response):\n return b'banned' in response.body\n\n def exception_is_ban(self, request, exception):\n return None\n\nIt is important to have these rules correct because action for a failed\nrequest and a bad proxy should be different: if it is a proxy to blame\nit makes sense to retry the request with a different proxy.\n\nNon-working proxies could become alive again after some time.\n``scrapy-rotating-proxies`` uses a randomized exponential backoff for these\nchecks - first check happens soon, if it still fails then next check is\ndelayed further, etc. Use ``ROTATING_PROXY_BACKOFF_BASE`` to adjust the\ninitial delay (by default it is random, from 0 to 5 minutes). The randomized\nexponential backoff is capped by ``ROTATING_PROXY_BACKOFF_CAP``.\n\nSettings\n--------\n\n* ``ROTATING_PROXY_LIST`` - a list of proxies to choose from;\n* ``ROTATING_PROXY_LIST_PATH`` - path to a file with a list of proxies;\n* ``ROTATING_PROXY_LOGSTATS_INTERVAL`` - stats logging interval in seconds,\n 30 by default;\n* ``ROTATING_PROXY_CLOSE_SPIDER`` - When True, spider is stopped if\n there are no alive proxies. If False (default), then when there is no\n alive proxies all dead proxies are re-checked.\n* ``ROTATING_PROXY_PAGE_RETRY_TIMES`` - a number of times to retry\n downloading a page using a different proxy. After this amount of retries\n failure is considered a page failure, not a proxy failure.\n Think of it this way: every improperly detected ban cost you\n ``ROTATING_PROXY_PAGE_RETRY_TIMES`` alive proxies. Default: 5.\n\n It is possible to change this option per-request using\n ``max_proxies_to_try`` request.meta key - for example, you can use a higher\n value for certain pages if you're sure they should work.\n* ``ROTATING_PROXY_BACKOFF_BASE`` - base backoff time, in seconds.\n Default is 300 (i.e. 5 min).\n* ``ROTATING_PROXY_BACKOFF_CAP`` - backoff time cap, in seconds.\n Default is 3600 (i.e. 60 min).\n* ``ROTATING_PROXY_BAN_POLICY`` - path to a ban detection policy.\n Default is ``'rotating_proxies.policy.BanDetectionPolicy'``.\n\n\nFAQ\n---\n\nQ: Where to get proxy lists? How to write and maintain ban rules?\n\nA: It is up to you to find proxies and maintain proper ban rules\nfor web sites; ``scrapy-rotating-proxies`` doesn't have anything built-in.\nThere are commercial proxy services like https://crawlera.com/ which can\nintegrate with Scrapy (see https://github.com/scrapy-plugins/scrapy-crawlera)\nand take care of all these details.\n\nContributing\n------------\n\n* source code: https://github.com/TeamHG-Memex/scrapy-rotating-proxies\n* bug tracker: https://github.com/TeamHG-Memex/scrapy-rotating-proxies/issues\n\nTo run tests, install tox_ and run ``tox`` from the source checkout.\n\n.. _tox: https://tox.readthedocs.io/en/latest/\n\n----\n\n.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg\n :target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=scrapy-rotating-proxies\n :alt: define hyperiongray\n\n\nCHANGES\n=======\n\n0.6 (2018-12-28)\n----------------\n\nProxy information is added to scrapy stats:\n\n* proxies/unchecked\n* proxies/reanimated\n* proxies/dead\n* proxies/good\n* proxies/mean_backoff\n\n0.5 (2017-10-09)\n----------------\n\n* ``ROTATING_PROXY_LIST_PATH`` option allows to pass file name\n with a proxy list.\n\n0.4 (2017-06-06)\n----------------\n\n* ``ROTATING_PROXY_BACKOFF_CAP`` option allows to change max backoff time\n from the default 1 hour.\n\n0.3.2 (2017-06-05)\n------------------\n\n* fixed proxy authentication issue.\n\n0.3.1 (2017-03-20)\n------------------\n\n* fixed OverflowError during backoff computation.\n\n0.3 (2017-03-14)\n----------------\n\n* redirects with empty bodies are no longer considered bans\n (thanks Diga Widyaprana).\n* ``ROTATING_PROXY_BAN_POLICY`` option allows to customize ban detection\n for all spiders.\n\n0.2.3 (2017-03-03)\n------------------\n\n* ``max_proxies_to_try`` request.meta key allows to override\n ``ROTATING_PROXY_PAGE_RETRY_TIMES`` option per-request.\n\n0.2.2 (2017-03-01)\n------------------\n\n* Update default ban detection rules: scrapy.exceptions.IgnoreRequest\n is not a ban.\n\n0.2.1 (2017-02-08)\n------------------\n\n* changed ``ROTATING_PROXY_PAGE_RETRY_TIMES`` default value - it is now 5.\n\n0.2 (2017-02-07)\n----------------\n\n* improved default ban detection rules;\n* log ban stats.\n\n0.1 (2017-02-01)\n----------------\n\nInitial release\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/rentieranalytics/rentier-scrapy-proxy-rotator", "keywords": "", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "rentier-scrapy-proxy-rotator", "package_url": "https://pypi.org/project/rentier-scrapy-proxy-rotator/", "platform": "", "project_url": "https://pypi.org/project/rentier-scrapy-proxy-rotator/", "project_urls": { "Homepage": "https://github.com/rentieranalytics/rentier-scrapy-proxy-rotator" }, "release_url": "https://pypi.org/project/rentier-scrapy-proxy-rotator/0.6.2/", "requires_dist": [ "attrs (>16.0.0)", "six", "typing" ], "requires_python": "", "summary": "Rentier Analytics Scrapy Proxy Rotator", "version": "0.6.2" }, "last_serial": 4850974, "releases": { "0.6.2": [ { "comment_text": "", "digests": { "md5": "facc8d2b764dd30d0672eb8311bcddc4", "sha256": "53cda5b289856470216ba951e03f3ded813d8c56577916aa959230451a684ce0" }, "downloads": -1, "filename": "rentier_scrapy_proxy_rotator-0.6.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "facc8d2b764dd30d0672eb8311bcddc4", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12561, "upload_time": "2019-02-21T16:48:41", "url": "https://files.pythonhosted.org/packages/73/6b/228b8676e235565786d35b6b176f0bd6f30a5e18b0264fafc408702c5a0f/rentier_scrapy_proxy_rotator-0.6.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "162e249484af2b1aed45a30aa772b4b1", "sha256": "237ec0147eef395a4ed8951997da222bdecb5c7acb53d8b7abc9406b3a082b0b" }, "downloads": -1, "filename": "rentier-scrapy-proxy-rotator-0.6.2.tar.gz", "has_sig": false, "md5_digest": "162e249484af2b1aed45a30aa772b4b1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13170, "upload_time": "2019-02-21T16:48:43", "url": "https://files.pythonhosted.org/packages/ca/89/bdf101da507142cc9c5cc787a60041e0326aba1e1f79cc5d599378cc4893/rentier-scrapy-proxy-rotator-0.6.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "facc8d2b764dd30d0672eb8311bcddc4", "sha256": "53cda5b289856470216ba951e03f3ded813d8c56577916aa959230451a684ce0" }, "downloads": -1, "filename": "rentier_scrapy_proxy_rotator-0.6.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "facc8d2b764dd30d0672eb8311bcddc4", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 12561, "upload_time": "2019-02-21T16:48:41", "url": "https://files.pythonhosted.org/packages/73/6b/228b8676e235565786d35b6b176f0bd6f30a5e18b0264fafc408702c5a0f/rentier_scrapy_proxy_rotator-0.6.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "162e249484af2b1aed45a30aa772b4b1", "sha256": "237ec0147eef395a4ed8951997da222bdecb5c7acb53d8b7abc9406b3a082b0b" }, "downloads": -1, "filename": "rentier-scrapy-proxy-rotator-0.6.2.tar.gz", "has_sig": false, "md5_digest": "162e249484af2b1aed45a30aa772b4b1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13170, "upload_time": "2019-02-21T16:48:43", "url": "https://files.pythonhosted.org/packages/ca/89/bdf101da507142cc9c5cc787a60041e0326aba1e1f79cc5d599378cc4893/rentier-scrapy-proxy-rotator-0.6.2.tar.gz" } ] }