{ "info": { "author": "Mikhail Korobov", "author_email": "kmike84@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Framework :: Scrapy", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Libraries :: Application Frameworks", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "=========================================================\nScrapyJS - Scrapy & JavaScript integration through Splash\n=========================================================\n\n.. image:: https://travis-ci.org/scrapy-plugins/scrapy-splash.svg?branch=master\n :target: http://travis-ci.org/scrapy-plugins/scrapy-splash\n\nThis library provides Scrapy_ and JavaScript integration using Splash_.\nThe license is BSD 3-clause.\n\n.. _Scrapy: https://github.com/scrapy/scrapy\n.. _Splash: https://github.com/scrapinghub/splash\n\nInstallation\n============\n\nInstall ScrapyJS using pip::\n\n $ pip install scrapyjs\n\nScrapyJS uses Splash_ HTTP API, so you also need a Splash instance.\nUsually to install & run Splash, something like this is enough::\n\n $ docker run -p 8050:8050 scrapinghub/splash\n\nCheck Splash `install docs`_ for more info.\n\n.. _install docs: http://splash.readthedocs.org/en/latest/install.html\n\n\nConfiguration\n=============\n\n1. Add the Splash server address to ``settings.py`` of your Scrapy project like this::\n\n SPLASH_URL = 'http://192.168.59.103:8050'\n\n2. Enable the Splash middleware by adding it to ``DOWNLOADER_MIDDLEWARES`` in your ``settings.py`` file::\n\n DOWNLOADER_MIDDLEWARES = {\n 'scrapyjs.SplashMiddleware': 725,\n }\n\n.. note::\n\n Order `725` is just before `HttpProxyMiddleware` (750) in default\n scrapy settings.\n\n3. Set a custom ``DUPEFILTER_CLASS``::\n\n DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'\n\n4. If you use Scrapy HTTP cache then a custom cache storage backend is required.\n ScrapyJS provides a subclass of ``scrapy.contrib.httpcache.FilesystemCacheStorage``::\n\n HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'\n\n If you use other cache storage then it is necesary to subclass it and\n replace all ``scrapy.util.request.request_fingerprint`` calls with\n ``scrapyjs.splash_request_fingerprint``.\n\n.. note::\n\n Steps (3) and (4) are necessary because Scrapy doesn't provide a way\n to override request fingerprints calculation algorithm globally; this\n could change in future.\n\n\nUsage\n=====\n\nTo render the requests with Splash, use the ``'splash'`` Request `meta` key::\n\n yield Request(url, self.parse_result, meta={\n 'splash': {\n 'args': {\n # set rendering arguments here\n 'html': 1,\n 'png': 1,\n\n # 'url' is prefilled from request url\n },\n\n # optional parameters\n 'endpoint': 'render.json', # optional; default is render.json\n 'splash_url': '', # overrides SPLASH_URL\n 'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,\n }\n })\n\n* ``meta['splash']['args']`` contains arguments sent to Splash.\n ScrapyJS adds request.url to these arguments automatically.\n\n Note that by default Scrapy escapes URL fragments using AJAX escaping scheme.\n If you want to pass a URL with a fragment to Splash then set ``url``\n in ``args`` dict manually.\n\n* ``meta['splash']['endpoint']`` is the Splash endpoint to use. By default\n `render.json `_\n is used.\n\n See Splash `HTTP API docs`_ for a full list of available endpoints\n and parameters.\n\n.. _HTTP API docs: http://splash.readthedocs.org/en/latest/api.html\n\n* ``meta['splash']['splash_url']`` overrides the Splash URL set\n in ``settings.py``.\n\n* ``meta['splash']['slot_policy']`` customize how\n concurrency & politeness are maintained for Splash requests.\n\n Currently there are 3 policies available:\n\n 1. ``scrapyjs.SlotPolicy.PER_DOMAIN`` (default) - send Splash requests to\n downloader slots based on URL being rendered. It is useful if you want\n to maintain per-domain politeness & concurrency settings.\n\n 2. ``scrapyjs.SlotPolicy.SINGLE_SLOT`` - send all Splash requests to\n a single downloader slot. It is useful if you want to throttle requests\n to Splash.\n\n 3. ``scrapyjs.SlotPolicy.SCRAPY_DEFAULT`` - don't do anything with slots.\n It is similar to ``SINGLE_SLOT`` policy, but can be different if you access\n other services on the same address as Splash.\n\n\nExamples\n========\n\nGet HTML contents::\n\n import scrapy\n\n class MySpider(scrapy.Spider):\n start_urls = [\"http://example.com\", \"http://example.com/foo\"]\n\n def start_requests(self):\n for url in self.start_urls:\n yield scrapy.Request(url, self.parse, meta={\n 'splash': {\n 'endpoint': 'render.html',\n 'args': {'wait': 0.5}\n }\n })\n\n def parse(self, response):\n # response.body is a result of render.html call; it\n # contains HTML processed by a browser.\n # ...\n\nGet HTML contents and a screenshot::\n\n import json\n import base64\n import scrapy\n\n class MySpider(scrapy.Spider):\n\n # ...\n yield scrapy.Request(url, self.parse_result, meta={\n 'splash': {\n 'args': {\n 'html': 1,\n 'png': 1,\n 'width': 600,\n 'render_all': 1,\n }\n }\n })\n\n # ...\n def parse_result(self, response):\n data = json.loads(response.body_as_unicode())\n body = data['html']\n png_bytes = base64.b64decode(data['png'])\n # ...\n\nRun a simple `Splash Lua Script`_::\n\n import json\n import base64\n\n class MySpider(scrapy.Spider):\n\n # ...\n script = \"\"\"\n function main(splash)\n assert(splash:go(splash.args.url))\n return splash:evaljs(\"document.title\")\n end\n \"\"\"\n yield scrapy.Request(url, self.parse_result, meta={\n 'splash': {\n 'args': {'lua_source': script},\n 'endpoint': 'execute',\n }\n })\n\n # ...\n def parse_response(self, response):\n doc_title = response.body_as_unicode()\n # ...\n\n\n.. _Splash Lua Script: http://splash.readthedocs.org/en/latest/scripting-tutorial.html\n\n\nHTTP Basic Auth\n===============\n\nIf you need HTTP Basic Authentication to access Splash, use\nScrapy's HttpAuthMiddleware_.\n\n.. _HttpAuthMiddleware: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpauth\n\nWhy not use the Splash HTTP API directly?\n=========================================\n\nThe obvious alternative to ScrapyJS would be to send requests directly\nto the Splash `HTTP API`_. Take a look at the example below and make\nsure to read the observations after it::\n\n import json\n\n import scrapy\n from scrapy.http.headers import Headers\n\n RENDER_HTML_URL = \"http://127.0.0.1:8050/render.html\"\n\n class MySpider(scrapy.Spider):\n start_urls = [\"http://example.com\", \"http://example.com/foo\"]\n\n def start_requests(self):\n for url in self.start_urls:\n body = json.dumps({\"url\": url, \"wait\": 0.5})\n headers = Headers({'Content-Type': 'application/json'})\n yield scrapy.Request(RENDER_HTML_URL, self.parse, method=\"POST\",\n body=body, headers=headers)\n\n def parse(self, response):\n # response.body is a result of render.html call; it\n # contains HTML processed by a browser.\n # ...\n\n\nIt works and is easy enough, but there are some issues that you should be\naware of:\n\n1. There is a bit of boilerplate.\n\n2. As seen by Scrapy, we're sending requests to ``RENDER_HTML_URL`` instead\n of the target URLs. It affects concurrency and politeness settings:\n ``CONCURRENT_REQUESTS_PER_DOMAIN``, ``DOWNLOAD_DELAY``, etc could behave\n in unexpected ways since delays and concurrency settings are no longer\n per-domain.\n\n3. Some options depend on each other - for example, if you use timeout_\n Splash option then you may want to set ``download_timeout``\n scrapy.Request meta key as well.\n\nScrapyJS utlities allow to handle such edge cases and reduce the boilerplate.\n\n.. _HTTP API: http://splash.readthedocs.org/en/latest/api.html\n.. _timeout: http://splash.readthedocs.org/en/latest/api.html#arg-timeout\n\n\n\nContributing\n============\n\nSource code and bug tracker are on github:\nhttps://github.com/scrapy-plugins/scrapy-splash\n\nTo run tests, install \"tox\" Python package and then run ``tox`` command\nfrom the source checkout.\n\n\nChanges\n=======\n\n0.2 (2016-03-26)\n----------------\n\n* Scrapy 1.0 and 1.1 support;\n* Python 3 support;\n* documentation improvements;\n* project is moved to https://github.com/scrapy-plugins/scrapy-splash.\n\n0.1.1 (2015-03-16)\n------------------\n\nFixed fingerprint calculation for non-string meta values.\n\n0.1 (2015-02-28)\n----------------\n\nInitial release", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/scrapy-plugins/scrapy-splash", "keywords": null, "license": "BSD", "maintainer": null, "maintainer_email": null, "name": "scrapyjs", "package_url": "https://pypi.org/project/scrapyjs/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/scrapyjs/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/scrapy-plugins/scrapy-splash" }, "release_url": "https://pypi.org/project/scrapyjs/0.2/", "requires_dist": null, "requires_python": null, "summary": "JavaScript support for Scrapy using Splash", "version": "0.2" }, "last_serial": 2027669, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "b5d4023cba74d6f8c6e53aa5f6ac04f0", "sha256": "f741d93b4a3c0e935ec7a620634b0edd012e70ba06e38f32a3da41c359ed0d0e" }, "downloads": -1, "filename": "scrapyjs-0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b5d4023cba74d6f8c6e53aa5f6ac04f0", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 11904, "upload_time": "2015-02-27T21:34:54", "url": "https://files.pythonhosted.org/packages/ee/29/98ec11ecf609bb36466eda328683b4571abd58b74b63620aa9864e244d53/scrapyjs-0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "be69eb7c5a00b55e9bdc58061ade4435", "sha256": "ed3d672a48e9cd2ebabafc5790006e3f03cab4535e4ea5f5327672ccd7769595" }, "downloads": -1, "filename": "scrapyjs-0.1.tar.gz", "has_sig": false, "md5_digest": "be69eb7c5a00b55e9bdc58061ade4435", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11020, "upload_time": "2015-02-27T21:34:41", "url": "https://files.pythonhosted.org/packages/3b/f8/9211bc99be8d5ebe2dddb517e44b7f13693ceb6b681d20cd725d4d8a4b66/scrapyjs-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "53c7ed66a97f2dcddc034ed52122b324", "sha256": "040f7c0116ae40c98156e49c18113b54ab7f8272ffbb8488f03d12cac32e7219" }, "downloads": -1, "filename": "scrapyjs-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "53c7ed66a97f2dcddc034ed52122b324", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 12031, "upload_time": "2015-03-16T00:45:36", "url": "https://files.pythonhosted.org/packages/62/cb/9543087e4aa37276fd2e48d804771274ab185059a62b009f733251fb4157/scrapyjs-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3823769c6a6d6f8f9ee6209a7b585853", "sha256": "83bff510da818e4101fbf10f44ddb029971c482d28c22c78d67d493038688546" }, "downloads": -1, "filename": "scrapyjs-0.1.1.tar.gz", "has_sig": false, "md5_digest": "3823769c6a6d6f8f9ee6209a7b585853", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11291, "upload_time": "2015-03-16T00:45:27", "url": "https://files.pythonhosted.org/packages/c9/9d/79a97adeaf5a919384f1586d5a610307a874dd2e009193b33fd3e8be8c39/scrapyjs-0.1.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "59912f0495e92811212db93ac66a70d1", "sha256": "2c0000d40d6c081360757c578fd70fa997f8b548b8083d06f6b2eca94935d150" }, "downloads": -1, "filename": "scrapyjs-0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "59912f0495e92811212db93ac66a70d1", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 13321, "upload_time": "2016-03-25T23:08:00", "url": "https://files.pythonhosted.org/packages/1a/5f/dcfb268903063cc8265e86f211e3e2d03cb05fe37bf8e6531fa6f4a8921f/scrapyjs-0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b6d9c3d1f20a461543e4d0d871d82572", "sha256": "1833ea816daaa91c362c7d1c86d8f97269f70cbfdf783b2b946378936e944057" }, "downloads": -1, "filename": "scrapyjs-0.2.tar.gz", "has_sig": false, "md5_digest": "b6d9c3d1f20a461543e4d0d871d82572", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12710, "upload_time": "2016-03-25T23:08:33", "url": "https://files.pythonhosted.org/packages/89/73/15a3cb81e96438a2e7ced2a1938a9dd9b798d633e9c7c84f1d31dd6e8812/scrapyjs-0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "59912f0495e92811212db93ac66a70d1", "sha256": "2c0000d40d6c081360757c578fd70fa997f8b548b8083d06f6b2eca94935d150" }, "downloads": -1, "filename": "scrapyjs-0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "59912f0495e92811212db93ac66a70d1", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 13321, "upload_time": "2016-03-25T23:08:00", "url": "https://files.pythonhosted.org/packages/1a/5f/dcfb268903063cc8265e86f211e3e2d03cb05fe37bf8e6531fa6f4a8921f/scrapyjs-0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b6d9c3d1f20a461543e4d0d871d82572", "sha256": "1833ea816daaa91c362c7d1c86d8f97269f70cbfdf783b2b946378936e944057" }, "downloads": -1, "filename": "scrapyjs-0.2.tar.gz", "has_sig": false, "md5_digest": "b6d9c3d1f20a461543e4d0d871d82572", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12710, "upload_time": "2016-03-25T23:08:33", "url": "https://files.pythonhosted.org/packages/89/73/15a3cb81e96438a2e7ced2a1938a9dd9b798d633e9c7c84f1d31dd6e8812/scrapyjs-0.2.tar.gz" } ] }