{ "info": { "author": "J. Fernando Sanchez", "author_email": "balkian@gmail.com", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python :: 3" ], "description": "# GSICrawler\n\nGSICrawler is a service that extracts information from several sources, such as Twitter, Facebook and news outlets.\n\nGSICrawler uses these services under the hood:\n\n* The HTTP API for the scrapers/tasks (web). **This is the public-facing part, the one with which you will interact a user.**\n* A frontend for celery (flower)\n* A backend that takes care of the tasks (celery)\n* A broker for the celery backend (redis)\n\nThere are several scrapers available, each accepts a different set of parameters (e.g. a query, a maximum number of results, etc.).\nThe results of any scraper can be returned in JSON format, or stored in an elasticsearch server.\nSome results will take long to process.\nIf that is the case, the API will return information about the running task, so you can query the service for the result later.\nPlease, read the API specification for your scraper of interest.\n\n\nExample:\n\n```\n# Scrape NYTimes for articles containing \"terror\", and store it in an elasticsearch endpoint (`http://elasticsearch:9200/crawler/news`).\n$ curl -X GET --header 'Accept: application/json' 'http://0.0.0.0:5000/api/v1/scrapers/nyt/?query=terror&number=5&output=elasticsearch&esendpoint=elasticsearch&index=crawler&doctype=news'\n\n{\n \"parameters\": {\n \"number\": 5,\n \"output\": \"elasticsearch\",\n \"query\": \"terror\"\n },\n \"source\": \"NYTimes\",\n \"status\": \"PENDING\",\n \"task_id\": \"bf5dd994-9860-4c63-975e-d09fb85a463c\"\n}\n\n\n# The task\n$ curl --header 'Accept: application/json' 'http://0.0.0.0:5000/api/v1/tasks/bf5dd994-9860-4c63-975e-d09fb85a463c' \n\n{\n \"results\": \"Check your results at: elasticsearch/crawler/_search\",\n \"status\": \"SUCCESS\",\n \"task_id\": \"bf5dd994-9860-4c63-975e-d09fb85a463c\"\n}\n```\n\n## Instructions\n\nSome of the crawlers require API keys and secrets to work.\nYou can configure the services locally with a `.env` file in this directory.\nIt should look like this:\n\n```\nTWITTER_ACCESS_TOKEN=\nTWITTER_ACCESS_TOKEN_SECRET=\nTWITTER_CONSUMER_KEY=\nTWITTER_CONSUMER_SECRET=\nFACEBOOK_APP_ID=\nFACEBOOK_APP_SECRET=\nNEWS_API_KEY=\nNY_TIMES_API_KEY=\n```\n\nOnce the environment variables are in place, run:\n\n```\ndocker compose up\n```\n\nThis will start all the necessary services, with the default configuration.\nAdditionally, it will deploy an elasticsearch instance, which can be used to store the results of the crawler.\n\n\nYou can test the service in your browser, using the OpenAPI dashboard: http://localhost:5000/\n\n\n\n## Scaling and distribution \n\nFor ease of deployment, the GSICrawler docker image runs three services in a single container (web, flower and celery backend).\nHowever, this behavior can be changed by using a different command (by default, it's `all`) and setting the appropriate environment variables:\n\n```\nGSICRAWLER_BROKER=redis://localhost:6379\nGSICRAWLER_RESULT_BACKEND=db+sqlite:///usr/src/app/results.db\n# If results_backend is missing, GSICRAWLER_BROKER will be used\n```\n\n\n## Developing new scrapers\n\nAs of this writing, to add a new scraper to GSICrawler you need to:\n\n* Develop the scraping function\n* Add a task to the `gsicrawler/tasks.py` file\n* Add the task to the controller (`gsicrawler/controllers/tasks.py`)\n* Add the new endpoint to the API (`gsicrawler-api.yaml`).\n* If you are using environment variables (e.g. for an API key), add them to your `.env` file.\n\nIf you are also deploying this with CI/CD and/or Kubernetes:\n\n* Add any new environment variables to the deployment file (`k8s/gsicrawler-deployment.yaml.tmpl`)\n* Add the variables to your CI/CD environment (e.g. https://docs.gitlab.com/ee/ci/variables/)\n\n## Troubleshooting\n\nElasticsearch may crash on startup and complain about vm.max_heap_count.\nThis will solve it temporarily, until the next boot:\n\n```\nsudo sysctl -w vm.max_map_count=262144 \n```\n\nIf you want to make this permanent, set the value in your `/etc/sysctl.conf`.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/gsi-upm/gsicrawler/archive/0.2.0.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gsi-upm/gsicrawler", "keywords": "scraping", "license": "", "maintainer": "", "maintainer_email": "", "name": "gsicrawler", "package_url": "https://pypi.org/project/gsicrawler/", "platform": "", "project_url": "https://pypi.org/project/gsicrawler/", "project_urls": { "Download": "https://github.com/gsi-upm/gsicrawler/archive/0.2.0.tar.gz", "Homepage": "https://github.com/gsi-upm/gsicrawler" }, "release_url": "https://pypi.org/project/gsicrawler/0.2.0/", "requires_dist": null, "requires_python": ">3.3", "summary": "Scrapers and web interface.", "version": "0.2.0" }, "last_serial": 5924357, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "90a6bb4d6a6402210d5fb98d12a4d259", "sha256": "2fcc9905fad2c1e564e3017368b8aa864629e95d0a1d1c6dba2ef6c9bb221b0c" }, "downloads": -1, "filename": "gsicrawler-0.2.0.tar.gz", "has_sig": false, "md5_digest": "90a6bb4d6a6402210d5fb98d12a4d259", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.3", "size": 5555, "upload_time": "2019-10-03T16:42:31", "url": "https://files.pythonhosted.org/packages/19/41/fd4eec1c5690523b127fc40c0aad35840ae8945628dd1d9085adfa72e247/gsicrawler-0.2.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "90a6bb4d6a6402210d5fb98d12a4d259", "sha256": "2fcc9905fad2c1e564e3017368b8aa864629e95d0a1d1c6dba2ef6c9bb221b0c" }, "downloads": -1, "filename": "gsicrawler-0.2.0.tar.gz", "has_sig": false, "md5_digest": "90a6bb4d6a6402210d5fb98d12a4d259", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.3", "size": 5555, "upload_time": "2019-10-03T16:42:31", "url": "https://files.pythonhosted.org/packages/19/41/fd4eec1c5690523b127fc40c0aad35840ae8945628dd1d9085adfa72e247/gsicrawler-0.2.0.tar.gz" } ] }