{ "info": { "author": "Marc Brinkmann", "author_email": "git@marcbrinkmann.de", "bugtrack_url": null, "classifiers": [], "description": "from ragstoriches\n=================\n\n``ragstoriches`` is a combined library/framework to ease writing web scrapers\nusing gevent and requests.\n\nA simple example to tell the story:\n\n.. code-block:: python\n\n #!/usr/bin/env python\n # -*- coding: utf-8 -*-\n\n from urlparse import urljoin\n import re\n\n from bs4 import BeautifulSoup\n from ragstoriches.scraper import Scraper\n\n rr = Scraper(__name__)\n\n @rr.scraper\n def index(requests, context,\n url='http://eastidaho.craigslist.org/search/act?query=+'):\n soup = BeautifulSoup(requests.get(url).text)\n\n for row in soup.find_all(class_='row'):\n yield 'posting', context, urljoin(url, row.find('a').attrs['href'])\n\n nextpage = soup.find(class_='nextpage')\n if nextpage:\n yield 'index', context, urljoin(url, nextpage.find('a').attrs['href'])\n\n\n @rr.scraper\n def posting(requests, context, url):\n soup = BeautifulSoup(requests.get(url).text)\n infos = soup.find(class_='postinginfos').find_all(class_='postinginfo')\n\n title = soup.find(class_='postingtitle').text.strip()\n id = re.findall('\\d+', infos[0].text)[0]\n date = infos[1].find('date').text.strip()\n body = soup.find(id='postingbody').text.strip()\n\n print title\n print '=' * len(title)\n print 'post *%s*, posted on %s' % (id, date)\n print body\n print\n\nInstall the library and `BeautifulSoup 4\n`_ using ``pip install\nragstoriches beautifulsoup4``, then save the above as ``craigs.py``,\nfinally run with ``ragstoriches craigs.py``.\n\nYou will get a bunch of jumbled input, so next step is redirecting ``stdout``\nto a file::\n\n ragstoriches craigs.py > output.md\n\nTry giving different urls for this scraper on the command-line:\n\n.. code-block:: sh\n\n ragstoriches craigs.py http://newyork.craigslist.org/mnh/acc/ > output.md # hustle\n ragstoriches craigs.py http://orangecounty.craigslist.org/wet/ > output.md # writing OC\n ragstoriches craigs.py http://seattle.craigslist.org/w4m/ > output.md # sleepless in seattle\n\nThere are a lot of commandline-options available, see ``ragstoriches --help``\nfor a list.\n\n\nWriting scrapers\n----------------\n\nA scraper module consists of some initialization code and a number of\nsubscrapers. Scraping starts by calling the a scraper named ``index`` on the\nscraper ``rr`` in the moduel (see the example above).\n\nThe ``requests`` argument should be treated like the `requests\n`_ module (it actually is an instance of requests\nPool). As long as you use it for fetching webpages, you never have to worry\nabout blocking or exceeding concurrency limits.\n\nThe ``context`` variable is arbitrary, but by convention a dictionary. It's a\nway of passing state from one scraper to another or sharing it. It is only\npassed on by ``ragstoriches`` and never touched otherwise.\n\nThe ``url`` is the url to scrape and parse.\n\nReturn values of scrapers are ignored. However, if a scraper is a generater\n(i.e. contains a yield statement), any value it yields must be a 3-tuple\nconsisting of the name of a scraper, a context object and another url. These\nare added to the queue of jobs to scrape.\n\nGood friends of ``ragstoriches`` are the `urlparse.urljoin\n`_ function\nand `BeautifulSoup4 `_.\n\n\nUsage as a library\n------------------\n\nYou can use ragstoriches as a library as well by not using the commandline\ntools but simply importing a scraper and running it with the ``scrape()``\nmethod. Remember to monkey-patch using gevent first.\n\nSee the source files for details, as there is not that much documentation\navailable at this point.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/mbr/ragstoriches", "keywords": null, "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "ragstoriches", "package_url": "https://pypi.org/project/ragstoriches/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/ragstoriches/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://github.com/mbr/ragstoriches" }, "release_url": "https://pypi.org/project/ragstoriches/0.2/", "requires_dist": null, "requires_python": null, "summary": "Develop highly-concurrent web scrapers, easily.", "version": "0.2" }, "last_serial": 798439, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "1bfd0bc2e8d04ddea35222e1b87ebd6b", "sha256": "dfc981d0adf3a0d294c2783bd1630d7d38af1cc9b5c81a6f7359f6c7b319bbee" }, "downloads": -1, "filename": "ragstoriches-0.2.tar.gz", "has_sig": true, "md5_digest": "1bfd0bc2e8d04ddea35222e1b87ebd6b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4872, "upload_time": "2013-03-10T16:43:01", "url": "https://files.pythonhosted.org/packages/52/0e/46ae7bfb921137c59e449b55c2a7f64df08ded8c4f78d02c18ca54efaa91/ragstoriches-0.2.tar.gz" } ], "0.3.1dev": [ { "comment_text": "", "digests": { "md5": "ba336638539376f7c7380d1bad645ebb", "sha256": "8c572bb9a58364b5e767708fcd35c2120f4716cd1a13821bf6395158ac34bb79" }, "downloads": -1, "filename": "ragstoriches-0.3.1dev.tar.gz", "has_sig": true, "md5_digest": "ba336638539376f7c7380d1bad645ebb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6589, "upload_time": "2013-03-13T01:20:30", "url": "https://files.pythonhosted.org/packages/b6/71/fa3f9a5e97ad97467bf75784b6302a713db5d7ca9f3a3a50e857adef09e8/ragstoriches-0.3.1dev.tar.gz" } ], "0.3dev": [ { "comment_text": "", "digests": { "md5": "4e7d5dc6ccd3946daa01251ea3d971e9", "sha256": "a216eca9cbfcd8bd0c2923e674a55ef6600fc28dec5d422fbc3ebd5f44046401" }, "downloads": -1, "filename": "ragstoriches-0.3dev.tar.gz", "has_sig": true, "md5_digest": "4e7d5dc6ccd3946daa01251ea3d971e9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5619, "upload_time": "2013-03-12T19:12:04", "url": "https://files.pythonhosted.org/packages/a8/69/ee1c274e2c28046b632c2c0b16f10dc5229d5581db395d66f0b90da7357d/ragstoriches-0.3dev.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1bfd0bc2e8d04ddea35222e1b87ebd6b", "sha256": "dfc981d0adf3a0d294c2783bd1630d7d38af1cc9b5c81a6f7359f6c7b319bbee" }, "downloads": -1, "filename": "ragstoriches-0.2.tar.gz", "has_sig": true, "md5_digest": "1bfd0bc2e8d04ddea35222e1b87ebd6b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4872, "upload_time": "2013-03-10T16:43:01", "url": "https://files.pythonhosted.org/packages/52/0e/46ae7bfb921137c59e449b55c2a7f64df08ded8c4f78d02c18ca54efaa91/ragstoriches-0.2.tar.gz" } ] }