{ "info": { "author": "Scrapinghub", "author_email": "info@scrapinghub.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5" ], "description": "==========\nscrapy-hcf\n==========\n\n.. image:: https://travis-ci.org/scrapy-plugins/scrapy-hcf.svg?branch=master\n :target: https://travis-ci.org/scrapy-plugins/scrapy-hcf\n\n.. image:: https://codecov.io/gh/scrapy-plugins/scrapy-hcf/branch/master/graph/badge.svg\n :target: https://codecov.io/gh/scrapy-plugins/scrapy-hcf\n\n\nThis Scrapy spider middleware uses the HCF backend from Scrapinghub's\nScrapy Cloud service to retrieve the new urls to crawl\nand store back the links extracted.\n\n\nInstallation\n============\n\nInstall scrapy-hcf using ``pip``::\n\n $ pip install scrapy-hcf\n\n\nConfiguration\n=============\n\nTo activate this middleware it needs to be added to the ``SPIDER_MIDDLEWARES``\ndict, i.e::\n\n SPIDER_MIDDLEWARES = {\n 'scrapy_hcf.HcfMiddleware': 543,\n }\n\nAnd the following settings need to be defined:\n\n``HS_AUTH``\n Scrapy Cloud API key\n\n``HS_PROJECTID``\n Scrapy Cloud project ID (not needed if the spider is ran on dash)\n\n``HS_FRONTIER``\n Frontier name.\n\n``HS_CONSUME_FROM_SLOT``\n Slot from where the spider will read new URLs.\n\nNote that ``HS_FRONTIER`` and ``HS_CONSUME_FROM_SLOT`` can be overriden\nfrom inside a spider using the spider attributes ``hs_frontier``\nand ``hs_consume_from_slot`` respectively.\n\nThe following optional Scrapy settings can be defined:\n\n``HS_ENDPOINT``\n URL to the API endpoint, i.e: http://localhost:8003.\n The default value is provided by the python-hubstorage package.\n\n``HS_MAX_LINKS``\n Number of links to be read from the HCF, the default is 1000.\n\n``HS_START_JOB_ENABLED``\n Enable whether to start a new job when the spider finishes.\n The default is ``False``\n\n``HS_START_JOB_ON_REASON``\n This is a list of closing reasons,\n if the spider ends with any of these reasons a new job will be started\n for the same slot. The default is ``['finished']``\n\n``HS_NUMBER_OF_SLOTS``\n This is the number of slots that the middleware will use to store the new links.\n The default is 8.\n\n\nUsage\n=====\n\nThe following keys can be defined in a Scrapy Request meta in order to control the behavior\nof the HCF middleware:\n\n``'use_hcf'``\n If set to ``True`` the request will be stored in the HCF.\n\n``'hcf_params'``\n Dictionary of parameters to be stored in the HCF with the request fingerprint\n\n ``'qdata'``\n data to be stored along with the fingerprint in the request queue\n\n ``'fdata'``\n data to be stored along with the fingerprint in the fingerprint set\n\n ``'p'``\n Priority - lower priority numbers are returned first. The default is 0\n\nThe value of ``'qdata'`` parameter could be retrieved later using\n``response.meta['hcf_params']['qdata']``.\n\nThe spider can override the default slot assignation function by setting the\nspider ``slot_callback`` method to a function with the following signature::\n\n def slot_callback(request):\n ...\n return slot\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/scrapy-plugins/scrapy-hcf", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "scrapy-hcf", "package_url": "https://pypi.org/project/scrapy-hcf/", "platform": "Any", "project_url": "https://pypi.org/project/scrapy-hcf/", "project_urls": { "Homepage": "http://github.com/scrapy-plugins/scrapy-hcf" }, "release_url": "https://pypi.org/project/scrapy-hcf/1.0.0/", "requires_dist": [ "hubstorage", "scrapinghub", "scrapy" ], "requires_python": "", "summary": "Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs", "version": "1.0.0" }, "last_serial": 2228423, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "1ae7aa6dcc20675f52a7a61719528ec4", "sha256": "b1657f11f08f2f09878bfd9aa4e6c947663b6c1e34579d287c29400bbbd5fb61" }, "downloads": -1, "filename": "scrapy_hcf-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1ae7aa6dcc20675f52a7a61719528ec4", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 4805, "upload_time": "2016-07-18T14:17:28", "url": "https://files.pythonhosted.org/packages/e9/8f/c77ce6675835eedd70cd7f7cfadfed8071c55707ad094bf190851e6f8c87/scrapy_hcf-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "38054c0fa202b68fca7c32daaa773714", "sha256": "694252fe394d0a9dad0e77ff227c477247ece36e9b712e9ec4a5626402c24ece" }, "downloads": -1, "filename": "scrapy-hcf-0.1.0.tar.gz", "has_sig": false, "md5_digest": "38054c0fa202b68fca7c32daaa773714", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4444, "upload_time": "2016-07-18T14:17:31", "url": "https://files.pythonhosted.org/packages/21/57/ab88e88eb16383f142e5cd039556b086d5ee14a7176a443b2401215f5ba2/scrapy-hcf-0.1.0.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "c5722a8e5372dda845b2f66a3b30d53b", "sha256": "a5ae573a655a3efe31af7262238a33f294192e0f18a7fa2a5590ed729e29d493" }, "downloads": -1, "filename": "scrapy_hcf-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c5722a8e5372dda845b2f66a3b30d53b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 4804, "upload_time": "2016-07-18T14:38:20", "url": "https://files.pythonhosted.org/packages/6e/bb/709957580f8bd2e5163094776d650f9ea91dc2450c67b1b7db4003a0b99f/scrapy_hcf-1.0.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3cc8d2e1352891fc40b5fa9f3d4b433d", "sha256": "ef36065bba23571af7a6a3a770c710664a5f86447cedee6b476e2c846c63013f" }, "downloads": -1, "filename": "scrapy-hcf-1.0.0.tar.gz", "has_sig": false, "md5_digest": "3cc8d2e1352891fc40b5fa9f3d4b433d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4434, "upload_time": "2016-07-18T14:38:22", "url": "https://files.pythonhosted.org/packages/2f/7c/b39ebb49f0e0dbc04d41b32e5252346c3b2a84b3dfc0cbd2c4c56fc8c701/scrapy-hcf-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c5722a8e5372dda845b2f66a3b30d53b", "sha256": "a5ae573a655a3efe31af7262238a33f294192e0f18a7fa2a5590ed729e29d493" }, "downloads": -1, "filename": "scrapy_hcf-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c5722a8e5372dda845b2f66a3b30d53b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 4804, "upload_time": "2016-07-18T14:38:20", "url": "https://files.pythonhosted.org/packages/6e/bb/709957580f8bd2e5163094776d650f9ea91dc2450c67b1b7db4003a0b99f/scrapy_hcf-1.0.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3cc8d2e1352891fc40b5fa9f3d4b433d", "sha256": "ef36065bba23571af7a6a3a770c710664a5f86447cedee6b476e2c846c63013f" }, "downloads": -1, "filename": "scrapy-hcf-1.0.0.tar.gz", "has_sig": false, "md5_digest": "3cc8d2e1352891fc40b5fa9f3d4b433d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4434, "upload_time": "2016-07-18T14:38:22", "url": "https://files.pythonhosted.org/packages/2f/7c/b39ebb49f0e0dbc04d41b32e5252346c3b2a84b3dfc0cbd2c4c56fc8c701/scrapy-hcf-1.0.0.tar.gz" } ] }