{
    "info": {
        "author": "Marc Brinkmann",
        "author_email": "git@marcbrinkmann.de",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "from ragstoriches\n=================\n\n``ragstoriches`` is a combined library/framework to ease writing web scrapers\nusing gevent and requests.\n\nA simple example to tell the story:\n\n.. code-block:: python\n\n  #!/usr/bin/env python\n  # -*- coding: utf-8 -*-\n\n  from urlparse import urljoin\n  import re\n\n  from bs4 import BeautifulSoup\n  from ragstoriches.scraper import Scraper\n\n  rr = Scraper(__name__)\n\n  @rr.scraper\n  def index(requests, context,\n            url='http://eastidaho.craigslist.org/search/act?query=+'):\n      soup = BeautifulSoup(requests.get(url).text)\n\n      for row in soup.find_all(class_='row'):\n          yield 'posting', context, urljoin(url, row.find('a').attrs['href'])\n\n      nextpage = soup.find(class_='nextpage')\n      if nextpage:\n          yield 'index', context, urljoin(url, nextpage.find('a').attrs['href'])\n\n\n  @rr.scraper\n  def posting(requests, context, url):\n      soup = BeautifulSoup(requests.get(url).text)\n      infos = soup.find(class_='postinginfos').find_all(class_='postinginfo')\n\n      title = soup.find(class_='postingtitle').text.strip()\n      id = re.findall('\\d+', infos[0].text)[0]\n      date = infos[1].find('date').text.strip()\n      body = soup.find(id='postingbody').text.strip()\n\n      print title\n      print '=' * len(title)\n      print 'post *%s*, posted on %s' % (id, date)\n      print body\n      print\n\nInstall the library and `BeautifulSoup 4\n<https://pypi.python.org/pypi/beautifulsoup4>`_ using ``pip install\nragstoriches beautifulsoup4``, then save the above as ``craigs.py``,\nfinally run with ``ragstoriches craigs.py``.\n\nYou will get a bunch of jumbled input, so next step is redirecting ``stdout``\nto a file::\n\n   ragstoriches craigs.py > output.md\n\nTry giving different urls for this scraper on the command-line:\n\n.. code-block:: sh\n\n   ragstoriches craigs.py http://newyork.craigslist.org/mnh/acc/  > output.md  # hustle\n   ragstoriches craigs.py http://orangecounty.craigslist.org/wet/ > output.md  # writing OC\n   ragstoriches craigs.py http://seattle.craigslist.org/w4m/      > output.md  # sleepless in seattle\n\nThere are a lot of commandline-options available, see ``ragstoriches --help``\nfor a list.\n\n\nWriting scrapers\n----------------\n\nA scraper module consists of some initialization code and a number of\nsubscrapers. Scraping starts by calling the a scraper named ``index`` on the\nscraper ``rr`` in the moduel (see the example above).\n\nThe ``requests`` argument should be treated like the `requests\n<http://python-requests.org>`_ module (it actually is an instance of requests\nPool). As long as you use it for fetching webpages, you never have to worry\nabout blocking or exceeding concurrency limits.\n\nThe ``context`` variable is arbitrary, but by convention a dictionary. It's a\nway of passing state from one scraper to another or sharing it. It is only\npassed on by ``ragstoriches`` and never touched otherwise.\n\nThe ``url`` is the url to scrape and parse.\n\nReturn values of scrapers are ignored. However, if a scraper is a generater\n(i.e. contains a yield statement), any value it yields must be a 3-tuple\nconsisting of the name of a scraper, a context object and another url. These\nare added to the queue of jobs to scrape.\n\nGood friends of ``ragstoriches`` are the `urlparse.urljoin\n<http://docs.python.org/2/library/urlparse.html#urlparse.urljoin>`_ function\nand `BeautifulSoup4 <https://beautiful-soup-4.readthedocs.org/en/latest/>`_.\n\n\nUsage as a library\n------------------\n\nYou can use ragstoriches as a library as well by not using the commandline\ntools but simply importing a scraper and running it with the ``scrape()``\nmethod. Remember to monkey-patch using gevent first.\n\nSee the source files for details, as there is not that much documentation\navailable at this point.",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "http://github.com/mbr/ragstoriches",
        "keywords": null,
        "license": "MIT",
        "maintainer": null,
        "maintainer_email": null,
        "name": "ragstoriches",
        "package_url": "https://pypi.org/project/ragstoriches/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/ragstoriches/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "http://github.com/mbr/ragstoriches"
        },
        "release_url": "https://pypi.org/project/ragstoriches/0.2/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "Develop highly-concurrent web scrapers, easily.",
        "version": "0.2"
    },
    "last_serial": 798439,
    "releases": {
        "0.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "1bfd0bc2e8d04ddea35222e1b87ebd6b",
                    "sha256": "dfc981d0adf3a0d294c2783bd1630d7d38af1cc9b5c81a6f7359f6c7b319bbee"
                },
                "downloads": -1,
                "filename": "ragstoriches-0.2.tar.gz",
                "has_sig": true,
                "md5_digest": "1bfd0bc2e8d04ddea35222e1b87ebd6b",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 4872,
                "upload_time": "2013-03-10T16:43:01",
                "url": "https://files.pythonhosted.org/packages/52/0e/46ae7bfb921137c59e449b55c2a7f64df08ded8c4f78d02c18ca54efaa91/ragstoriches-0.2.tar.gz"
            }
        ],
        "0.3.1dev": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "ba336638539376f7c7380d1bad645ebb",
                    "sha256": "8c572bb9a58364b5e767708fcd35c2120f4716cd1a13821bf6395158ac34bb79"
                },
                "downloads": -1,
                "filename": "ragstoriches-0.3.1dev.tar.gz",
                "has_sig": true,
                "md5_digest": "ba336638539376f7c7380d1bad645ebb",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 6589,
                "upload_time": "2013-03-13T01:20:30",
                "url": "https://files.pythonhosted.org/packages/b6/71/fa3f9a5e97ad97467bf75784b6302a713db5d7ca9f3a3a50e857adef09e8/ragstoriches-0.3.1dev.tar.gz"
            }
        ],
        "0.3dev": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "4e7d5dc6ccd3946daa01251ea3d971e9",
                    "sha256": "a216eca9cbfcd8bd0c2923e674a55ef6600fc28dec5d422fbc3ebd5f44046401"
                },
                "downloads": -1,
                "filename": "ragstoriches-0.3dev.tar.gz",
                "has_sig": true,
                "md5_digest": "4e7d5dc6ccd3946daa01251ea3d971e9",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5619,
                "upload_time": "2013-03-12T19:12:04",
                "url": "https://files.pythonhosted.org/packages/a8/69/ee1c274e2c28046b632c2c0b16f10dc5229d5581db395d66f0b90da7357d/ragstoriches-0.3dev.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "1bfd0bc2e8d04ddea35222e1b87ebd6b",
                "sha256": "dfc981d0adf3a0d294c2783bd1630d7d38af1cc9b5c81a6f7359f6c7b319bbee"
            },
            "downloads": -1,
            "filename": "ragstoriches-0.2.tar.gz",
            "has_sig": true,
            "md5_digest": "1bfd0bc2e8d04ddea35222e1b87ebd6b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4872,
            "upload_time": "2013-03-10T16:43:01",
            "url": "https://files.pythonhosted.org/packages/52/0e/46ae7bfb921137c59e449b55c2a7f64df08ded8c4f78d02c18ca54efaa91/ragstoriches-0.2.tar.gz"
        }
    ]
}