{ "info": { "author": "Mikhail Korobov", "author_email": "kmike84@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Framework :: Scrapy", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "scrapy-crawl-once\n=================\n\n.. image:: https://img.shields.io/pypi/v/scrapy-crawl-once.svg\n :target: https://pypi.python.org/pypi/scrapy-crawl-once\n :alt: PyPI Version\n\n.. image:: https://travis-ci.org/TeamHG-Memex/scrapy-crawl-once.svg?branch=master\n :target: http://travis-ci.org/TeamHG-Memex/scrapy-crawl-once\n :alt: Build Status\n\n.. image:: http://codecov.io/github/TeamHG-Memex/scrapy-crawl-once/coverage.svg?branch=master\n :target: http://codecov.io/github/TeamHG-Memex/scrapy-crawl-once?branch=master\n :alt: Code Coverage\n\nThis package provides a Scrapy_ middleware which allows to avoid re-crawling\npages which were already downloaded in previous crawls.\n\n.. _Scrapy: https://scrapy.org/\n\nLicense is MIT.\n\nInstallation\n------------\n\n::\n\n pip install scrapy-crawl-once\n\nUsage\n-----\n\nTo enable it, modify your settings.py::\n\n SPIDER_MIDDLEWARES = {\n # ...\n 'scrapy_crawl_once.CrawlOnceMiddleware': 100,\n # ...\n }\n\n DOWNLOADER_MIDDLEWARES = {\n # ...\n 'scrapy_crawl_once.CrawlOnceMiddleware': 50,\n # ...\n }\n\nBy default it does nothing. To avoid crawling a particular page\nmultiple times set ``request.meta['crawl_once'] = True``. When a response\nis received and a callback is successful, the fingerprint of such request\nis stored to a database. When spider schedules a new request middleware\nfirst checks if its fingerprint is in the database, and drops the request\nif it is there.\n\nOther ``request.meta`` keys:\n\n* ``crawl_once_value`` - a value to store in DB. By default, timestamp\n is stored.\n* ``crawl_once_key`` - request unique id; by default request_fingerprint\n is used.\n\nSettings\n--------\n\n* ``CRAWL_ONCE_ENABLED`` - set it to False to disable middleware.\n Default is True.\n* ``CRAWL_ONCE_PATH`` - a path to a folder with crawled requests database.\n By default ``.scrapy/crawl_once/`` path inside a project dir is used;\n this folder contains ``.sqlite`` files with databases of\n seen requests.\n* ``CRAWL_ONCE_DEFAULT`` - default value for ``crawl_once`` meta key\n (False by default). When True, all requests are handled by\n this middleware unless disabled explicitly using\n ``request.meta['crawl_once'] = False``.\n\nAlternatives\n------------\n\nhttps://github.com/scrapy-plugins/scrapy-deltafetch is a similar package; it\ndoes almost the same. Differences:\n\n* scrapy-deltafetch chooses whether to discard a request or not based on\n yielded items; scrapy-crawl-once uses an explicit\n ``request.meta['crawl_once']`` flag.\n* scrapy-deltafetch uses bsddb3, scrapy-crawl-once uses sqlite.\n\nAnother alternative is a built-in `Scrapy HTTP cache`_. Differences:\n\n* scrapy cache stores all pages on disc, scrapy-crawl-once only keeps request\n fingerprints;\n* scrapy cache allows a more fine grained invalidation consistent with how\n browsers work;\n* with scrapy cache all pages are still processed (though not all pages are\n downloaded).\n\n.. _Scrapy HTTP cache: https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache\n\nContributing\n------------\n\n* source code: https://github.com/TeamHG-Memex/scrapy-crawl-once\n* bug tracker: https://github.com/TeamHG-Memex/scrapy-crawl-once/issues\n\nTo run tests, install tox_ and run ``tox`` from the source checkout.\n\n.. _tox: https://tox.readthedocs.io/en/latest/\n\n\nCHANGES\n=======\n\n0.1.1 (2017-03-04)\n------------------\n\n* new ``'crawl_once/initial'`` value in scrapy stats - it contains the\n initial size (number of records) of crawl_once database.\n\n0.1 (2017-03-03)\n----------------\n\nInitial release.\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/TeamHG-Memex/scrapy-crawl-once", "keywords": "", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "scrapy-crawl-once", "package_url": "https://pypi.org/project/scrapy-crawl-once/", "platform": "", "project_url": "https://pypi.org/project/scrapy-crawl-once/", "project_urls": { "Homepage": "https://github.com/TeamHG-Memex/scrapy-crawl-once" }, "release_url": "https://pypi.org/project/scrapy-crawl-once/0.1.1/", "requires_dist": null, "requires_python": "", "summary": "Scrapy middleware which allows to crawl only new content", "version": "0.1.1" }, "last_serial": 4308914, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "77b3bebf665781907a28f0a69671d9e0", "sha256": "8511cf936704875e389d23214b8991b00035e605c8a56659294d378404d40b4a" }, "downloads": -1, "filename": "scrapy_crawl_once-0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "77b3bebf665781907a28f0a69671d9e0", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 6967, "upload_time": "2017-03-02T23:22:23", "url": "https://files.pythonhosted.org/packages/43/2b/7d1a983ffc5cf9ec3e80831546488466c4de5fb29fed3deb5251ba1c9574/scrapy_crawl_once-0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4f6f2755603e59eb58c4092c573cbf69", "sha256": "4d552e52ddcdc285447aa4b9b71ffcbd602e8fd6be2a198ff885a61a1bf50047" }, "downloads": -1, "filename": "scrapy-crawl-once-0.1.tar.gz", "has_sig": false, "md5_digest": "4f6f2755603e59eb58c4092c573cbf69", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4977, "upload_time": "2017-03-02T23:22:15", "url": "https://files.pythonhosted.org/packages/8b/56/589f5cdf261ec1e9e2f04886970bbf7d0cfff80aef2754096d29b4b2c78d/scrapy-crawl-once-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "7cbd808e48d307faf08a88e23e87d7b3", "sha256": "60ea4e7529f99ad1ec6cacbad53828fbfa5959cc4dddfe8047557e7c189e920c" }, "downloads": -1, "filename": "scrapy_crawl_once-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "7cbd808e48d307faf08a88e23e87d7b3", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 7156, "upload_time": "2017-03-03T20:28:42", "url": "https://files.pythonhosted.org/packages/49/97/8684f7a85d6be3a52f50cce2411eaaaf6c4e0d6c1598fa7b4e99578ba2cb/scrapy_crawl_once-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d3a838f8f0f01ab8555cc54e4d060d09", "sha256": "8ab832ab5c4073ba2aa498a8c6bb2a792117ecd7deadca41d7fd1cdee534caf4" }, "downloads": -1, "filename": "scrapy-crawl-once-0.1.1.tar.gz", "has_sig": false, "md5_digest": "d3a838f8f0f01ab8555cc54e4d060d09", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5096, "upload_time": "2017-03-03T20:28:33", "url": "https://files.pythonhosted.org/packages/fc/e3/e196d13482add6f506976e92fca549f3bfdeb5a015a5dc5146cfacd30d32/scrapy-crawl-once-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7cbd808e48d307faf08a88e23e87d7b3", "sha256": "60ea4e7529f99ad1ec6cacbad53828fbfa5959cc4dddfe8047557e7c189e920c" }, "downloads": -1, "filename": "scrapy_crawl_once-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "7cbd808e48d307faf08a88e23e87d7b3", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 7156, "upload_time": "2017-03-03T20:28:42", "url": "https://files.pythonhosted.org/packages/49/97/8684f7a85d6be3a52f50cce2411eaaaf6c4e0d6c1598fa7b4e99578ba2cb/scrapy_crawl_once-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d3a838f8f0f01ab8555cc54e4d060d09", "sha256": "8ab832ab5c4073ba2aa498a8c6bb2a792117ecd7deadca41d7fd1cdee534caf4" }, "downloads": -1, "filename": "scrapy-crawl-once-0.1.1.tar.gz", "has_sig": false, "md5_digest": "d3a838f8f0f01ab8555cc54e4d060d09", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5096, "upload_time": "2017-03-03T20:28:33", "url": "https://files.pythonhosted.org/packages/fc/e3/e196d13482add6f506976e92fca549f3bfdeb5a015a5dc5146cfacd30d32/scrapy-crawl-once-0.1.1.tar.gz" } ] }