{ "info": { "author": "Dusan Madar", "author_email": "madar.dusan@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Web Environment", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.6", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "[![Build Status](https://travis-ci.org/DusanMadar/ScrapeMeAgain.svg?branch=master)](https://travis-ci.org/DusanMadar/ScrapeMeAgain)\n[![Coverage Status](https://coveralls.io/repos/github/DusanMadar/ScrapeMeAgain/badge.svg?branch=master)](https://coveralls.io/github/DusanMadar/ScrapeMeAgain?branch=master)\n[![PyPI version](https://badge.fury.io/py/scrapemeagain.svg)](https://badge.fury.io/py/scrapemeagain)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n\n# ScrapeMeAgain\n\nScrapeMeAgain is a Python 3 powered web scraper. It uses multithreading ([`ThreadPoolExecutor`](https://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor)) and multiprocessing to get the work done quicker and stores collected data in an [SQLite](http://www.sqlite.org/) database.\n\n## Installation\n\n```\npip install scrapemeagain\n```\n\n### System requirements\n\n[Tor](https://www.torproject.org/) in combination with [Privoxy](http://www.privoxy.org/) are used for anonymity (i.e. regular IP address changes).\n\nUsing `Docker` and `Docker Compose` is the preferred (and easier) way to\nuse ScrapeMeAgain.\n\nYou have to manually install and setup `Tor` and `Privoxy` on your system if not using `Docker`. For further information about installation and configuration refer to:\n\n- [A step-by-step guide how to use Python with Tor and Privoxy](https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b)\n- [Crawling anonymously with Tor in Python](http://sacharya.com/crawling-anonymously-with-tor-in-python/) ([alternative link (Gist)](https://gist.github.com/KhepryQuixote/46cf4f3b999d7f658853))\n\n## Usage\n\nYou have to provide your own database table description and an actual scraper class which must follow the `BaseScraper` interface. See `examples/examplescraper` for more details.\n\n### Dockerized\n\nWith Docker it is possible to use multiple Tor IPs at the same time and, unless you abuse it, scrape data faster.\n\nThe easiest way to start is to duplicate `examples/examplescraper` and then update, rename, expand, etc. your scraper and related classes.\n\nYour scraper must define `config.py` and `main_dockerized.py`. These two names are hardcoded throughout the codebase.\n\n`scrapemeagain-compose.py` dynamically creates a `docker-compose.yml` which orchestrates scraper instances. The idea is that the first scraper (`scp1`) is a `master` scraper and hence is the host for a couple of helper services which communicate over HTTP (see [`dockerized/apps`](https://github.com/DusanMadar/ScrapeMeAgain/tree/master/scrapemeagain/dockerized/apps)).\n\n1. Get Docker host Ip\n\n```\nip addr show docker0\n```\n\n**NOTE** Your Docker interface name may be different from _docker0_.\n\n2. Run `examplesite` on Docker host IP\n\n```\npython3 examples/examplescraper/examplesite/app.py 172.17.0.1\n```\n\n**NOTE** Your Docker host IP may be different from _172.17.0.1_.\n\n3. Start `docker-compose`\n\n```\nscrapemeagain-compose.py -s $(pwd)/examples/examplescraper -c tests.integration.fake_config | docker-compose -f - up\n```\n\n**NOTE** A special config file path is provided: `-c tests.integration.fake_config`. This is _required only for test/demo purposes_. You don't have to provide specific config for a real/production scraper.\n\n### Local\n\n1. Run `examplesite`\n\n```\npython3 examples/examplescraper/examplesite/app.py\n```\n\n2. Start `examplescraper`\n\n```\npython3 examples/examplescraper/main.py\n```\n\n**NOTE** You may need to update your `PYTHONPATH`, e.g. `export PYTHONPATH=$PYTHONPATH:$(pwd)/examples`.\n\n## Development\n\nTo simplify running integration tests with latest changes:\n\n- replace `image: dusanmadar/scrapemeagain:x.y.z` with `image: scp:latest`\n in the `scrapemeagain/dockerized/docker-compose.yml` template\n\n- and make sure to rebuild the image locally before running tests, e.g.\n\n```\ndocker build . -t scp:latest; python -m unittest discover -p test_integration.py\n```\n\n## Legacy\n\nThe Python 2.7 version of ScrapeMeAgain, which also provides geocoding capabilities, is available under the `legacy` branch and is no longer maintained.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/DusanMadar/ScrapeMeAgain", "keywords": "web scraper", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "scrapemeagain", "package_url": "https://pypi.org/project/scrapemeagain/", "platform": "linux", "project_url": "https://pypi.org/project/scrapemeagain/", "project_urls": { "Homepage": "https://github.com/DusanMadar/ScrapeMeAgain" }, "release_url": "https://pypi.org/project/scrapemeagain/1.0.5/", "requires_dist": [ "beautifulsoup4", "requests", "sqlalchemy", "stem", "flask", "toripchanger" ], "requires_python": ">=3.6", "summary": "Yet another Python web scraping application", "version": "1.0.5" }, "last_serial": 5167588, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "fcf6d7ba022afce66f07a53141150fa3", "sha256": "1d0e1da573f451e5cbf11e8622d1e40e2de4f91d0f518f029d2bd1ac09da02de" }, "downloads": -1, "filename": "scrapemeagain-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "fcf6d7ba022afce66f07a53141150fa3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 31035, "upload_time": "2018-12-09T12:37:01", "url": "https://files.pythonhosted.org/packages/8e/9f/ced7f079a6b985a13b3ef78d7bd8807b781a5a69fa6ce6e241bca451e71c/scrapemeagain-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eeb702d7e318d3390403a7a5ea1a0709", "sha256": "c1620d69c8d3fed4cb2be92f757d7cd45a1f3977a0a673c60fce4b1da7c32a64" }, "downloads": -1, "filename": "scrapemeagain-1.0.0.tar.gz", "has_sig": false, "md5_digest": "eeb702d7e318d3390403a7a5ea1a0709", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22082, "upload_time": "2018-12-09T12:37:03", "url": "https://files.pythonhosted.org/packages/35/16/48520a717d0a4aa2b92d75c5967da85f24dfe46bcd8263af70ea0dd4b6e0/scrapemeagain-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "0ff28063f1999d7ba1bd44a8876380ad", "sha256": "2ba914abcbfe7a1e2eb0b994ea237389f42bd50bdbd1b3bb4641cd230969977a" }, "downloads": -1, "filename": "scrapemeagain-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "0ff28063f1999d7ba1bd44a8876380ad", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 31169, "upload_time": "2018-12-09T20:35:50", "url": "https://files.pythonhosted.org/packages/97/7c/b34d9f5cb897b509057b96671b130a364651ca4b45f005e5403532daddb2/scrapemeagain-1.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3b6551370ca78095c29cf7b0a1865ceb", "sha256": "ab7806c584023f0d77cca3c5cf4e148a7b86c8a1e5a666a1568972d466c748ac" }, "downloads": -1, "filename": "scrapemeagain-1.0.1.tar.gz", "has_sig": false, "md5_digest": "3b6551370ca78095c29cf7b0a1865ceb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 22207, "upload_time": "2018-12-09T20:35:52", "url": "https://files.pythonhosted.org/packages/99/6e/fb1b2c469e99701ee3412f80d17e01f21ae0d464e13686a05c05924edfca/scrapemeagain-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "b0775b48f1ae5c748f181410b4e201f9", "sha256": "79d56c57f59b73385fd4a9479071ee17f0546f16d0beaa4eb41a7bb86ca6e776" }, "downloads": -1, "filename": "scrapemeagain-1.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b0775b48f1ae5c748f181410b4e201f9", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 32692, "upload_time": "2018-12-15T10:54:36", "url": "https://files.pythonhosted.org/packages/40/76/be8ac876d9e5d1716bbb6448ec6129ede1241a5b94c68df36de12816c576/scrapemeagain-1.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "687949bba91db379c44d990f4129c9a3", "sha256": "9168f7550cc6016d479782d28731f9878478d08375b6b090f42a212c54f1dddc" }, "downloads": -1, "filename": "scrapemeagain-1.0.2.tar.gz", "has_sig": false, "md5_digest": "687949bba91db379c44d990f4129c9a3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 25211, "upload_time": "2018-12-15T10:54:38", "url": "https://files.pythonhosted.org/packages/ba/d1/2d3eb6eb45c799aae872b1725f1eeec54e9aecced5f3d37914b1626ffbb8/scrapemeagain-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "1bb1c6691ebb2b3a292883a61f4ed01f", "sha256": "76a8f7d415942de9c5798ae3999815f5fcbc29508d74e562df963f0ea59ac27a" }, "downloads": -1, "filename": "scrapemeagain-1.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "1bb1c6691ebb2b3a292883a61f4ed01f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 33016, "upload_time": "2019-02-17T14:40:29", "url": "https://files.pythonhosted.org/packages/51/4d/0c94dbbc51a1bd7c6ec2a97748d8318a60de732016fead2f48182ed5461b/scrapemeagain-1.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c1d7d17cfb9b71c416e75cb37b18b897", "sha256": "ae4ae7734de576af41a8da1cb234455c560312a25888c9cce2532b2359fffc62" }, "downloads": -1, "filename": "scrapemeagain-1.0.3.tar.gz", "has_sig": false, "md5_digest": "c1d7d17cfb9b71c416e75cb37b18b897", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 27210, "upload_time": "2019-02-17T14:31:41", "url": "https://files.pythonhosted.org/packages/24/02/6b50c4acabea0d9928e90c9f6e1d3df9e698e750a35096aae3ea438128a1/scrapemeagain-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "c155c0d375366e505acd304f9ab3a2ac", "sha256": "5a8df6eac5935a703468a14095e1d720f80f85570dfc1cf1d01b27c1c0335b36" }, "downloads": -1, "filename": "scrapemeagain-1.0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "c155c0d375366e505acd304f9ab3a2ac", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 35290, "upload_time": "2019-03-23T19:03:52", "url": "https://files.pythonhosted.org/packages/21/c2/eaaba4acd29fdb0e4063c32fe2f0bbf15fb7c4a5e35b2893e39e66a8f7b3/scrapemeagain-1.0.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5826efe96e58208fab0d71928230652f", "sha256": "01f847b18d8cd04c7139ff59b2ed35ce0348d2b1d4a302fdd09dd501e5e8811b" }, "downloads": -1, "filename": "scrapemeagain-1.0.4.tar.gz", "has_sig": false, "md5_digest": "5826efe96e58208fab0d71928230652f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 27235, "upload_time": "2019-03-23T19:03:54", "url": "https://files.pythonhosted.org/packages/81/b0/dbb5cc1a4457125d79cde92878a707731a1f08da0c7a6d75922197556cf5/scrapemeagain-1.0.4.tar.gz" } ], "1.0.5": [ { "comment_text": "", "digests": { "md5": "724e87a312f7201242d3fbbea75c1ff8", "sha256": "0072374022b5f87268ea544e86cdd6a6851b836f35e3c24f76d948789ac30226" }, "downloads": -1, "filename": "scrapemeagain-1.0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "724e87a312f7201242d3fbbea75c1ff8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 35417, "upload_time": "2019-04-20T10:31:59", "url": "https://files.pythonhosted.org/packages/ba/9e/feb2e1a4f794a427e5a70c8b34ea3777fcd8abc9c53bf7205e12ac085695/scrapemeagain-1.0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "938572f125624db38ff3a5293b0a8f50", "sha256": "b5b03965204274fa8b2aa190c9297bbf4e9113f7a6067505fa6f5fa560e15a39" }, "downloads": -1, "filename": "scrapemeagain-1.0.5.tar.gz", "has_sig": false, "md5_digest": "938572f125624db38ff3a5293b0a8f50", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 27421, "upload_time": "2019-04-20T10:32:01", "url": "https://files.pythonhosted.org/packages/bb/9e/8f907a02dd4067de48dca3f929c1239245a0fb3b0a589a6b84adab3de91a/scrapemeagain-1.0.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "724e87a312f7201242d3fbbea75c1ff8", "sha256": "0072374022b5f87268ea544e86cdd6a6851b836f35e3c24f76d948789ac30226" }, "downloads": -1, "filename": "scrapemeagain-1.0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "724e87a312f7201242d3fbbea75c1ff8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6", "size": 35417, "upload_time": "2019-04-20T10:31:59", "url": "https://files.pythonhosted.org/packages/ba/9e/feb2e1a4f794a427e5a70c8b34ea3777fcd8abc9c53bf7205e12ac085695/scrapemeagain-1.0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "938572f125624db38ff3a5293b0a8f50", "sha256": "b5b03965204274fa8b2aa190c9297bbf4e9113f7a6067505fa6f5fa560e15a39" }, "downloads": -1, "filename": "scrapemeagain-1.0.5.tar.gz", "has_sig": false, "md5_digest": "938572f125624db38ff3a5293b0a8f50", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 27421, "upload_time": "2019-04-20T10:32:01", "url": "https://files.pythonhosted.org/packages/bb/9e/8f907a02dd4067de48dca3f929c1239245a0fb3b0a589a6b84adab3de91a/scrapemeagain-1.0.5.tar.gz" } ] }