{ "info": { "author": "Stuart MacKay", "author_email": "smackay@flagstonesoftware.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python", "Topic :: Internet :: WWW/HTTP" ], "description": "===================\nChecklists_scrapers\n===================\nChecklists_scrapers is a set of web scrapers for downloading records from\non-line databases of observations of birds. Scrapers are available for:\n\n:ebird:\n a database hosted by the Laboratory for Ornithology at Cornell University,\n covering the Americas, Oceania and an increasing number of records for\n European countries.\n\n:worldbirds:\n a network of databases hosted by WorldBirds (BirdLife International),\n with good coverage of countries around the Mediterranean and Africa.\n\nThe eBird API only provides checklist records for up to the past 30 days so\nthe scrapers must be run on a regular basis. They are intended to provide a\ncontinuous update of records and so are ideal for mirroring subsets of the\nrecords available (for a given region for example) so you don't have to\nrepeatedly run reports or submit requests for data from the database hosts.\n\nSo, what is this for?\n---------------------\nChecklists_scrapers was written to aggregate records from different databases\nfor loading into a\n`django-checklists `_\ndatabase. However, since the downloaded checklists are in JSON format the file\nmay be used with any similar database.\n\nThe scrapers (and django-checklists) are current used by the\n`Birding Lisboa `_ news service which covers the\narea around the Tejo estuary, Portugal. All the downloaded checklists are\nloaded into a database which is used to publish the latest news as well as\ngenerate annual reports.\n\nA similar database could be used for any purpose - analysing observations\nfor conservation, environmental management or education. Aggregating the\nobservations from multiple databases with the scrapers makes this task a\nlot easier.\n\nInstalling & Configuring\n------------------------\nChecklists_scrapers is available from PyPI. You can install it with pip or\neasy_install::\n\n pip install checklists_scrapers\n\nThe scrapers are built using the scrapy engine which uses settings, in the same\nway as Django, for configuration and runtime values. The settings file is\nconfigured to initialize its values from environment variables. That makes it\neasy to configure the scrapers, particularly for the most common use-case,\nrunning them from a scheduler such as cron.\n\nThe only required setting is to tell scrapy (the engine used by the scrapers)\nthe path to the settings module::\n\n export SCRAPY_SETTINGS_MODULE=checklists_scrapers.settings\n\nThe remaining settings have sensible defaults so only those that are\ninstallation dependent, such as the mail server used for sending out status\nreports. Here is this script that is used to run the scrapers for Birding\nLisboa from cron::\n\n #!/bin/bash\n\n export SCRAPY_SETTINGS_MODULE=checklists_scrapers.settings\n\n export LOG_LEVEL=INFO\n\n export DOWNLOAD_DIR=/tmp/checklists_scrapers\n\n export MAIL_HOST=mail.example.com\n export MAIL_PORT=25\n export MAIL_USER=\n export MAIL_PASS=\n export MAIL_FROM=scrapers@example.com\n\n export REPORT_RECIPIENTS=admins@example.com\n\n source /home/project/venv/bin/activate\n cd /home/project\n\n scrapy crawl ebird -a region=PT-11\n scrapy crawl ebird -a region=PT-15\n\nThe settings can also be defined when the scrapers are run using the -S\noption::\n\n scrapy crawl ebird -a region=PT-15 -s LOG_LEVEL=DEBUG\n\nHowever this obvious becomes a little cumbersome if more than one or two\nsettings are involved.\n\n scrapy crawl ebird -a region=PT-15 -s DOWNLOAD_DIR=downloads\n\nNOTE: the environment variables use a prefix \"CHECKLISTS\" as a namespace\nto avoid interfering with any other variables. When the setting is defined\nusing the -s option when running the scrapers, this prefix must be dropped::\n\nNOTE: REPORT_RECIPIENTS is a comma-separated list of one or more\nemail addresses. The default value is an empty string so no status reports\nwill be mailed out. However if the LOG_LEVEL is set to 'DEBUG' the status\nreport will be written to the file checklists_scrapers_status.txt in the\nDOWNLOAD_DIR directory.\n\nEverything is now ready to run.\n\nScraping\n--------\nThe arguments passed to the scrapers on the command line specify value such as\nwhich region to download observations from and login details for scrapers\nthat need an account to access the data::\n\n scrapy crawl ebird -a region=PT-11\n\n scrapy crawl worldbirds -a username= -a password= -a country=uk\n\nSee the docs for each spider to get a list of the command line arguments and\nsettings.\n\nIf you have defined the settings for a mail server and the setting\nREPORT_RECIPIENTS then a status report will be sent out each time\nthe scrapers are run. If the LOG_LEVEL is set to 'DEBUG' the report is also\nwritten to the directory where the checklists are downloaded to. The report\ncontains a list of the checklist downloaded along with an errors (complete with\nstack traces) and any warnings::\n\n Scraper: ebird\n Date: 03 Jan 2014\n Time: 11:00\n\n -------------------------\n Checklists downloaded\n -------------------------\n 2013-12-27 09:59, Jardim Botanico da Universidade de Lisboa\n 2013-12-28 10:20, Baia Cascais\n 2013-12-28 13:31, PN Sintra-Cascais--Cabo da Roca\n 2013-12-29 07:45, RN Estuario do Tejo--Vala da Saragossa\n\n ----------\n Errors\n ----------\n URL: http://ebird.org/ebird/view/checklist?subID=S161101101\n Traceback (most recent call last):\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/twisted/internet/base.py\", line 1201, in mainLoop\n self.runUntilCurrent()\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/twisted/internet/base.py\", line 824, in runUntilCurrent\n call.func(*call.args, **call.kw)\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 382, in callback\n self._startRunCallbacks(result)\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 490, in _startRunCallbacks\n self._runCallbacks()\n --- ---\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 577, in _runCallbacks\n current.result = callback(current.result, *args, **kw)\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/checklists_scrapers/spiders/ebird_spider.py\", line 585, in parse_checklist\n checklist = self.merge_checklists(original, update)\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/checklists_scrapers/spiders/ebird_spider.py\", line 602, in merge_checklists\n original['entries'], update['entries'])\n File \"/home/birdinglisboa/venv/local/lib/python2.7/site-packages/checklists_scrapers/spiders/ebird_spider.py\", line 695, in merge_entries\n if count in key[index]:\n exceptions.TypeError: string indices must be integers\n\n ------------\n Warnings\n ------------\n 2014-01-01 11:55, Parque da Paz\n API: http://ebird.org/ws1.1/data/obs/loc/recent?r=L1127099&detail=full&back=7&includeProvisional=true&fmt=json\n URL: http://ebird.org/ebird/view/checklist?subID=S16160707\n Could not update record from API. There are 2 records that match: species=White Wagtail; count=4.\n\nChecklists downloaded also included the name of the observer, which was removed\nhere for obvious reasons. The stack traces in the Errors section is useful if\nthere is a bug but it is also a first indication that the format of the\ninformation being scraped has changed. In either case report it as an issue and\nit will get fixed.\n\nWarnings are generally informative. Here a warning is generated because the\nchecklist contained two equal counts for White Wagtail in the API records -\nonly the species is reported information on subspecies is dropped. However\nthe subspecies is reported on the checklist web page. That means when the web\npage was scraped it was not possible to distinguish between the two records.\nThe records should be edited to add any useful information such as comments,\nwhich are only available from the web page.\n\nLinks\n#####\n\n* Documentation: http://checklists_scrapers.readthedocs.org/\n* Repository: https://github.com/StuartMacKay/checklists_scrapers/\n* Package: https://pypi.python.org/pypi/checklists_scrapers/\n* Buildbot: http://travis-ci.org/#!/StuartMacKay/checklists_scrapers/\n\n.. image:: https://secure.travis-ci.org/StuartMacKay/checklists.png?branch=master\n :target: http://travis-ci.org/StuartMacKay/checklists_scrapers/\n\n\nLicence\n#######\nChecklists_scrapers is available under the modified BSD licence.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://pypi.python.org/pypi/checklists_scrapers/", "keywords": "eBird worldbirds web scraper birds checklists", "license": "GPL", "maintainer": null, "maintainer_email": null, "name": "checklists_scrapers", "package_url": "https://pypi.org/project/checklists_scrapers/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/checklists_scrapers/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://pypi.python.org/pypi/checklists_scrapers/" }, "release_url": "https://pypi.org/project/checklists_scrapers/0.2.3/", "requires_dist": null, "requires_python": null, "summary": "Web scrapers for downloading checklists of birds from onlinedatabases such as eBird.", "version": "0.2.3" }, "last_serial": 1172694, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "e3219409af8c09738f2f08cc15e55b46", "sha256": "24f9dbff6c58d14d46c24c5ef4bbf84fb7834bda329d9a571bea4c097bdb3570" }, "downloads": -1, "filename": "checklists_scrapers-0.1.0.tar.gz", "has_sig": false, "md5_digest": "e3219409af8c09738f2f08cc15e55b46", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 46625, "upload_time": "2014-02-20T07:44:00", "url": "https://files.pythonhosted.org/packages/5b/9b/593796d3b9e4ff61ab9790d9b65089652607b645f50dabef99f4e688a5f6/checklists_scrapers-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "cbaa04fcf7502e915e7065635a8f47c0", "sha256": "3285e1b3113891300b8433e9f0f9dd51bb4d595b785fab80ba268d1b08fcb2e2" }, "downloads": -1, "filename": "checklists_scrapers-0.1.1.tar.gz", "has_sig": false, "md5_digest": "cbaa04fcf7502e915e7065635a8f47c0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 46867, "upload_time": "2014-03-05T17:14:20", "url": "https://files.pythonhosted.org/packages/4f/2f/d31a95e5096950dc5e292de71369ac3c7b884a764074640c47c770ac8660/checklists_scrapers-0.1.1.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "5d89722dfde854e398c5496c70e3a88c", "sha256": "d6949e2e94c0355b7ba10533e33b64eb7d805e69ed0ab2faa7dbc4283f06c9d8" }, "downloads": -1, "filename": "checklists_scrapers-0.2.0.tar.gz", "has_sig": false, "md5_digest": "5d89722dfde854e398c5496c70e3a88c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 46746, "upload_time": "2014-03-26T22:30:47", "url": "https://files.pythonhosted.org/packages/cb/da/e57ceacc3e917a5c66700309443047519f388ab4beab6893a502d1e286ad/checklists_scrapers-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "804d37cf5a426500574adf3aa4142003", "sha256": "18a45f3ee8536b6d097e2ed3983f62ca432022374e9867ef81826fd5dae7240d" }, "downloads": -1, "filename": "checklists_scrapers-0.2.1.tar.gz", "has_sig": false, "md5_digest": "804d37cf5a426500574adf3aa4142003", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 46787, "upload_time": "2014-04-26T15:57:36", "url": "https://files.pythonhosted.org/packages/47/1f/d7c94755c076138d3ef7328ce79a04d720449ed7ba0dbd78feba32240ea3/checklists_scrapers-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "c0ff9e7ae8ac250dd84fc630ee809299", "sha256": "60c44e15225bcc763c756d9928a999cd3d89fcba5d451a34fe735cacbdc4404e" }, "downloads": -1, "filename": "checklists_scrapers-0.2.2.tar.gz", "has_sig": false, "md5_digest": "c0ff9e7ae8ac250dd84fc630ee809299", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47082, "upload_time": "2014-07-28T13:26:31", "url": "https://files.pythonhosted.org/packages/c6/0a/51ad4c24a21ba3199661bb7343e4e7b2460a9e8a5050ba95a9192fc69bc2/checklists_scrapers-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "583bd9e881bb9234b1b7f5878f6d4327", "sha256": "7c9c59bee4728abca08c754d68cdab33f70b5da2bc289fd07c40b191e90b21f1" }, "downloads": -1, "filename": "checklists_scrapers-0.2.3.tar.gz", "has_sig": false, "md5_digest": "583bd9e881bb9234b1b7f5878f6d4327", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47830, "upload_time": "2014-07-29T08:37:06", "url": "https://files.pythonhosted.org/packages/85/ef/27fb703f5183c12673df10fddb5f916ebba6a80db411de8c1506dd61688d/checklists_scrapers-0.2.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "583bd9e881bb9234b1b7f5878f6d4327", "sha256": "7c9c59bee4728abca08c754d68cdab33f70b5da2bc289fd07c40b191e90b21f1" }, "downloads": -1, "filename": "checklists_scrapers-0.2.3.tar.gz", "has_sig": false, "md5_digest": "583bd9e881bb9234b1b7f5878f6d4327", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47830, "upload_time": "2014-07-29T08:37:06", "url": "https://files.pythonhosted.org/packages/85/ef/27fb703f5183c12673df10fddb5f916ebba6a80db411de8c1506dd61688d/checklists_scrapers-0.2.3.tar.gz" } ] }