{ "info": { "author": "orangain", "author_email": "orangain@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Framework :: Scrapy", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# Scrapy S3 Pipeline\n\n[![PyPI version](https://badge.fury.io/py/scrapy-s3pipeline.svg)](https://badge.fury.io/py/scrapy-s3pipeline)\n\nScrapy pipeline to store items into S3 bucket with JSONLines format. Unlike built-in [FeedExporter](https://docs.scrapy.org/en/latest/topics/feed-exports.html#s3), the pipeline has the following features:\n\n* The pipeline upload items to S3 by chunk while crawler is running.\n* Support GZip compression.\n\nThe pipeline aims to run crawler and scraper in different processes, e.g. run crawler process with Scrapy in AWS Fargate and run scraper process with lxml in AWS Lambda.\n\n## Requirements\n\n* Python 3.4+ (Tested in 3.7)\n* Scrapy 1.1+ (Tested in 1.6)\n* boto3\n\n## Install\n\n```shell-session\n$ pip3 install scrapy-s3pipeline\n```\n\n## Getting started\n\n1. Install Scrapy S3 Pipeline with pip.\n\n ```shell-session\n $ pip3 install scrapy-s3pipeline\n ```\n\n2. Add `'s3pipeline.S3Pipeline'` to `ITEM_PIPELINES` setting in your Scrapy project.\n\n ```py\n ITEM_PIPELINES = {\n 's3pipeline.S3Pipeline': 100, # Add this line.\n }\n ```\n\n3. Add `S3PIPELINE_URL` setting. You need to change `my-bucket` to your bucket name.\n\n ```py\n S3PIPELINE_URL = 's3://my-bucket/{name}/{time}/items.{chunk:07d}.jl.gz'\n ```\n\n4. Setup AWS credentials via AWS CLI's `aws configure` command. Alternatively, use Scrapy's `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` settings.\n\n5. Run your spider. You will see items in your bucket after 100 items are crawled or the spider is closed.\n\n## Settings\n\n### S3PIPELINE_URL (Required)\n\nS3 Bucket URL to store items.\n\ne.g.: `s3://my-bucket/{name}/{time}/items.{chunk:07d}.jl.gz`\n\nThe following replacement fields are supported in `S3PIPELINE_URL`.\n\n* `{chunk}` - gets replaced by a start index of items in current chunk, e.g. '0', '100', '200',....\n* `{time}` - gets replaced by a timestamp when the spider is started.\n\nYou can also use other spider fields, e.g. `{name}`. You can use [format string syntax](https://docs.python.org/3/library/string.html#formatstrings) here, e.g. `{chunk:07d}`.\n\n### S3PIPELINE_MAX_CHUNK_SIZE (Optional)\n\nDefault: `100`\n\nMax count of items in a single chunk.\n\n### S3PIPELINE_GZIP (Optional)\n\nDefault: `True` if `S3PIPELINE_URL` ends with `.gz`; otherwise `False`.\n\nIf `True`, uploaded files will be compressed with Gzip.\n\n## Page item\n\nFor convinience, Scrapy S3 Pipeline provides `s3pipeline.Page` item class to store entire HTTP body. It has `url`, `body` and `crawled_at` fields.\n\nThis make it easy to store entire HTTP body and run scraper in other process. It's friendly to server-less architecture which run scraper in AWS Lambda.\n\nExample usage of Page:\n\n```py\nfrom datetime import datetime, timezone\n\nimport scrapy\nfrom s3pipeline import Page\n\n# ...\n\nclass YourSpider(scrapy.Spider):\n\n # ...\n\n def parse(self, response):\n # You can create Page instance just one line.\n yield Page.from_response(response)\n\n # Or, you can fill item fields manually.\n item = Page()\n item['url'] = response.url\n item['body'] = response.text\n item['crawled_at'] = datetime.now(timezone.utc).replace(microsecond=0).isoformat()\n yield item\n```\n\nNote: Page's body is omitted when printed to logs to improve readbility of logs.\n\n## Development\n\n### Test\n\n```\n$ python3 setup.py test\n```\n\n### Release\n\n```\n$ python3 setup.py bdist_wheel sdist\n$ twine upload dist/*\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/orangain/scrapy-s3pipeline", "keywords": "scrapy pipeline aws s3 serverless", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "scrapy-s3pipeline", "package_url": "https://pypi.org/project/scrapy-s3pipeline/", "platform": "", "project_url": "https://pypi.org/project/scrapy-s3pipeline/", "project_urls": { "Homepage": "https://github.com/orangain/scrapy-s3pipeline" }, "release_url": "https://pypi.org/project/scrapy-s3pipeline/0.3.0/", "requires_dist": [ "Scrapy (>=1.1)", "boto3" ], "requires_python": "", "summary": "Scrapy pipeline to store chunked items into AWS S3 bucket", "version": "0.3.0" }, "last_serial": 5167031, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "369b49e9a1de10f369ac7eb9d7e3b629", "sha256": "ce56be19555a59c01c2ac839a20a82aa39519d8fd7ad994e8b4af82a9bedeed7" }, "downloads": -1, "filename": "scrapy_s3pipeline-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "369b49e9a1de10f369ac7eb9d7e3b629", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4353, "upload_time": "2017-12-03T06:10:19", "url": "https://files.pythonhosted.org/packages/f2/11/36d4702decd2c2ace734080da0ee98b71c5bf8f3074ba5f74bf95773b7dc/scrapy_s3pipeline-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7ca6a2b75dbbd2877309f609806f8bd9", "sha256": "0c9e72dee1103229d1b0d424fe5829655edfd73e42c88a1e1940fe93bad8c418" }, "downloads": -1, "filename": "scrapy-s3pipeline-0.1.tar.gz", "has_sig": false, "md5_digest": "7ca6a2b75dbbd2877309f609806f8bd9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2681, "upload_time": "2017-12-03T06:10:20", "url": "https://files.pythonhosted.org/packages/e8/58/c37dda4f500a9eed6f1dd635149551bf122796f8a4f10b4d891d00399dd0/scrapy-s3pipeline-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "1636fd39cf8ed42a1f470dda16e0594a", "sha256": "b5aacf02e41fb4cced0511c9471c1b70a7bb8dd994bc5757e93a521d7ed4e105" }, "downloads": -1, "filename": "scrapy_s3pipeline-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "1636fd39cf8ed42a1f470dda16e0594a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4469, "upload_time": "2017-12-03T13:53:52", "url": "https://files.pythonhosted.org/packages/8a/a2/97e22056b68f89c12d18269ae492dd83b91e06eb5a8478491b68c0a7dd8c/scrapy_s3pipeline-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "805e02e35f7e9af7d142b1fa3e83a580", "sha256": "c29bfddf58892348a763f7fa83d05d5ad63e0c9022d1b3245c9081227b59975c" }, "downloads": -1, "filename": "scrapy-s3pipeline-0.1.1.tar.gz", "has_sig": false, "md5_digest": "805e02e35f7e9af7d142b1fa3e83a580", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2742, "upload_time": "2017-12-03T13:53:54", "url": "https://files.pythonhosted.org/packages/58/e5/487492036772c1046c8f10ca9c21eb1c81857f0e9d994be2ac8c5f195ceb/scrapy-s3pipeline-0.1.1.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "4e5b83557879c41eb9651d7d54efe050", "sha256": "0cca2aae19f8372bde92a4a362a726f76c0bc86c4e2c1fcfd8c04d25c6aa8a07" }, "downloads": -1, "filename": "scrapy_s3pipeline-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "4e5b83557879c41eb9651d7d54efe050", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4465, "upload_time": "2017-12-05T15:05:22", "url": "https://files.pythonhosted.org/packages/84/7e/822b18ef71748128d9917fd7290064b76d5e1057606793a356202db25eb4/scrapy_s3pipeline-0.2.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a525690c2016a96e7b9227796017bd6f", "sha256": "75a62c5739978b68db117787030b429a525c86741cb14518a718096cb4a9b95b" }, "downloads": -1, "filename": "scrapy-s3pipeline-0.2.0.tar.gz", "has_sig": false, "md5_digest": "a525690c2016a96e7b9227796017bd6f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2744, "upload_time": "2017-12-05T15:05:27", "url": "https://files.pythonhosted.org/packages/b3/80/3cfa7e56865359ec7f6be62c98ee77c2c763fa44356a1130f3cbb53258be/scrapy-s3pipeline-0.2.0.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "a1b786b5a8c3dcb2cb1d30b4f7ac1df1", "sha256": "1b6edf080a57f7f63cd364be281a22c156f6779a3c6b0dcf6bb100eaeb0298bb" }, "downloads": -1, "filename": "scrapy_s3pipeline-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "a1b786b5a8c3dcb2cb1d30b4f7ac1df1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6028, "upload_time": "2019-04-20T05:28:07", "url": "https://files.pythonhosted.org/packages/f9/44/f35bb31fd761ad011c29b8e2064586c0b8e59a7208924478c41a6c1315a0/scrapy_s3pipeline-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "55535539c0e2466b4f6d23a5ad8c1259", "sha256": "6b505538ef922dd8848c7ab5b2f2161795deb998e1c21e068e6c0dc81887e23e" }, "downloads": -1, "filename": "scrapy-s3pipeline-0.3.0.tar.gz", "has_sig": false, "md5_digest": "55535539c0e2466b4f6d23a5ad8c1259", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4656, "upload_time": "2019-04-20T05:28:09", "url": "https://files.pythonhosted.org/packages/64/73/0f731c30a891ff63ce922c0ccbcf1e0a433c6319eb84cbceecb710754867/scrapy-s3pipeline-0.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a1b786b5a8c3dcb2cb1d30b4f7ac1df1", "sha256": "1b6edf080a57f7f63cd364be281a22c156f6779a3c6b0dcf6bb100eaeb0298bb" }, "downloads": -1, "filename": "scrapy_s3pipeline-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "a1b786b5a8c3dcb2cb1d30b4f7ac1df1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6028, "upload_time": "2019-04-20T05:28:07", "url": "https://files.pythonhosted.org/packages/f9/44/f35bb31fd761ad011c29b8e2064586c0b8e59a7208924478c41a6c1315a0/scrapy_s3pipeline-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "55535539c0e2466b4f6d23a5ad8c1259", "sha256": "6b505538ef922dd8848c7ab5b2f2161795deb998e1c21e068e6c0dc81887e23e" }, "downloads": -1, "filename": "scrapy-s3pipeline-0.3.0.tar.gz", "has_sig": false, "md5_digest": "55535539c0e2466b4f6d23a5ad8c1259", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4656, "upload_time": "2019-04-20T05:28:09", "url": "https://files.pythonhosted.org/packages/64/73/0f731c30a891ff63ce922c0ccbcf1e0a433c6319eb84cbceecb710754867/scrapy-s3pipeline-0.3.0.tar.gz" } ] }