{ "info": { "author": "Bruno Rocha", "author_email": "rochacbruno@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Natural Language :: English", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3" ], "description": "Create scraper using Scrapy Selectors\n============================================\n\n[![Build\nStatus](https://travis-ci.org/rochacbruno/scrapy_model.png)](https://travis-ci.org/rochacbruno/scrapy_model)\n\n[![PyPi version](https://pypip.in/v/scrapy_model/badge.png)](https://pypi.python.org/pypi/scrapy_model/)\n[![PyPi downloads](https://pypip.in/d/scrapy_model/badge.png)](https://pypi.python.org/pypi/scrapy_model/)\n\n## What is Scrapy?\n\nScrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.\n\nhttp://scrapy.org/\n\n\n## What is scrapy_model ?\n\nIt is just a helper to create scrapers using the Scrapy Selectors allowing you to select elements by CSS or by XPATH and structuring your scraper via Models (just like an ORM model) and plugable to an ORM model via ``populate`` method.\n\nImport the BaseFetcherModel, CSSField or XPathField (you can use both)\n\n```python\nfrom scrapy_model import BaseFetcherModel, CSSField\n```\n\nGo to a webpage you want to scrap and use chrome dev tools or firebug to figure out the css paths then considering you want to get the following fragment from some page.\n\n```html\n Bruno Rocha website\n```\n\n```python\nclass MyFetcher(BaseFetcherModel):\n name = CSSField('span#person')\n website = CSSField('span#person a')\n # XPathField('//xpath_selector_here')\n```\n\nFields can receive ``auto_extract=True`` parameter which auto extracts values from selector before calling the parse or processors. Also you can pass the ``takes_first=True`` which will for auto_extract and also tries to get the first element of the result, because scrapy selectors returns a list of matched elements.\n\n\n### Multiple queries in a single field\n\nYou can use multiple queries for a single field\n\n```python\nname = XPathField(\n ['//*[@id=\"8\"]/div[2]/div/div[2]/div[2]/ul',\n '//*[@id=\"8\"]/div[2]/div/div[3]/div[2]/ul']\n)\n```\n\nIn that case, the parsing will try to fetch by the first query and returns if finds a match, else it will try the subsequent queries until it finds something, or it will return an empty selector.\n\n#### Finding the best match by a query validator\n\nIf you want to run multiple queries and also validates the best match you can pass a validator function which will take the scrapy selector an should return a boolean.\n\nExample, imagine you get the \"name\" field defined above and you want to validates each query to ensure it has a 'li' with a text \"Schblaums\" in there.\n\n```python\n\ndef has_schblaums(selector):\n for li in selector.css('li'): # takes each
  • inside the ul selector\n li_text = li.css('::text').extract() # Extract only the text\n if \"Schblaums\" in li_text: # check if \"Schblaums\" is there\n return True # returns that it is validated!\n return False # else all queries are invalid\n\nclass Fetcher(....):\n name = XPathField(\n ['//*[@id=\"8\"]/div[2]/div/div[2]/div[2]/ul',\n '//*[@id=\"8\"]/div[2]/div/div[3]/div[2]/ul'],\n query_validator=has_schblaums,\n default=\"undefined_name\" # optional\n )\n```\n\nIn the above example if both queries are invalid, the \"name\" field will be filled with an empty_selector, or the value defined in \"default\" parameter.\n\n> **NOTE:** if the field has a \"default\" and fails in all the matcher, the default value will be passed to \"processor\" and also to \"parse_\" methods.\n\nEvery method named ``parse_`` will run after all the fields are fetched for each field.\n\n```python\n def parse_name(self, selector):\n # here selector is the scrapy selector for 'span#person'\n name = selector.css('::text').extract()\n return name\n\n def parse_website(self, selector):\n # here selector is the scrapy selector for 'span#person a'\n website_url = selector.css('::attr(href)').extract()\n return website_url\n\n```\n\n\nafter defined need to run the scraper\n\n\n```python\n\nfetcher = Myfetcher(url='http://.....') # optionally you can use cached_fetch=True to cache requests on redis\nfetcher.parse()\n```\n\nNow you can iterate ``_data``, ``_raw_data`` and atributes in fetcher\n\n```python\n>>> fetcher.name\n\n>>> fetcher.name.value\nBruno Rocha\n>>> fetcher._data\n{\"name\": \"Bruno Rocha\", \"website\": \"http://brunorocha.org\"}\n```\n\nYou can populate some object\n\n```python\n>>> obj = MyObject()\n>>> fetcher.populate(obj) # fields optional\n\n>>> obj.name\nBruno Rocha\n```\n\nIf you do not want to define each field explicitly in the class, you can use a json file to automate the process\n\n```python\nclass MyFetcher(BaseFetcherModel):\n \"\"\" will load from json \"\"\"\n\nfetcher = MyFetcher(url='http://.....')\nfetcher.load_mappings_from_file('path/to/file.json')\nfetcher.parse()\n```\n\nIn that case file.json should be\n\n```json\n{\n \"name\": {\"css\", \"span#person\"},\n \"website\": {\"css\": \"span#person a\"}\n}\n```\n\nYou can use ``{\"xpath\": \"...\"}`` in case you prefer select by xpath\n\n\n### parse and processor\n\nThere are 2 ways of transforming or normalizing the data for each field\n\n#### Processors\n\nA processor is a function, or a list of functions which will be called in the given sequence against the field value, it receives the raw_selector or the value depending on auto_extract and takes_first arguments.\n\nIt can be used for Normalization, Clean, Transformation etc..\n\nExample:\n\n```python\n\ndef normalize_state(state_name):\n # query my database and return the first instance of state object\n return MyDatabase.State.Search(name=state_name).first()\n\ndef text_cleanup(state_name):\n return state_name.strip().replace('-', '').lower()\n\nclass MyFetcher(BaseFetcherModel):\n state = CSSField(\n \"#state::text\",\n takes_first=True,\n processor=[text_cleanup, normalize_state]\n )\n\nfetcher = MyFetcher(url=\"http://....\")\nfetcher.parse()\n\nfetcher._raw_data.state\n'Sao-Paulo'\nfetcher._data.state\n\n```\n\n#### Parse methods\n\nany method called ``parse_`` will run after all the process of selecting and parsing, it receives the selector or the value depending on auto_extract and takes_first argument in that field.\n\nexample:\n\n```python\ndef parse_name(self, selector):\n return selector.css('::text').extract()[0].upper()\n```\n\nIn the above case, the name field returns the raw_selector and in the parse method we can build extra queries using ``css`` or ``xpath`` and also we need to extract() the values from the selector and optionally select the first element and apply any transformation we need.\n\n### Caching the html fetch\n\nIn order to cache the html returned by the url fetching for future parsing and tests you specify a cache model, by default there is no cache but you can use the built in RedisCache passing\n\n```python\n from scrapy_model import RedisCache\n fetcher = TestFetcher(cache_fetch=True,\n cache=RedisCache,\n cache_expire=1800)\n```\n\nor specifying arguments to the Redis client.\n\n> it is a general Redis connection from python ``redis`` module\n\n```python\n fetcher = TestFetcher(cache_fetch=True,\n cache=RedisCache(\"192.168.0.12:9200\"),\n cache_expire=1800)\n```\n\nYou can create your own caching structure, e.g: to cache htmls in memcached or s3\n\nthe cache class just need to implement ``get`` and ``set`` methods.\n\n```python\nfrom boto import connect_s3\n\nclass S3Cache(object):\n def __init__(self, *args, **kwargs):\n connection = connect_s3(ACCESS_KEY, SECRET_KEY)\n self.bucket = connection.get_bucket(BUCKET_ID)\n\n def get(self, key):\n value = self.bucket.get_key(key)\n return value.get_contents_as_string() if key else None\n\n def set(self, key, value, expire=None):\n self.bucket.set_contents(key, value, expire=expire)\n\n\nfetcher = MyFetcher(url=\"http://...\",\n cache_fetch=True,\n cache=S3cache,\n cache_expire=1800)\n\n```\n\n### Instalation\n\neasy to install\n\nIf running ubuntu maybe you need to run:\n\n```bash\nsudo apt-get install python-scrapy\nsudo apt-get install libffi-dev\nsudo apt-get install python-dev\n```\n\nthen\n\n```bash\npip install scrapy_model\n```\n\nor\n\n\n```bash\ngit clone https://github.com/rochacbruno/scrapy_model\ncd scrapy_model\npip install -r requirements.txt\npython setup.py install\npython example.py\n```\n\nExample code to fetch the url http://en.m.wikipedia.org/wiki/Guido_van_Rossum\n\n```python\n#coding: utf-8\n\nfrom scrapy_model import BaseFetcherModel, CSSField, XPathField\n\n\nclass TestFetcher(BaseFetcherModel):\n photo_url = XPathField('//*[@id=\"content\"]/div[1]/table/tr[2]/td/a')\n\n nationality = CSSField(\n '#content > div:nth-child(1) > table > tr:nth-child(4) > td > a',\n )\n\n links = CSSField(\n '#content > div:nth-child(11) > ul > li > a.external::attr(href)',\n auto_extract=True\n )\n\n def parse_photo_url(self, selector):\n return \"http://en.m.wikipedia.org/{}\".format(\n selector.xpath(\"@href\").extract()[0]\n )\n\n def parse_nationality(self, selector):\n return selector.css(\"::text\").extract()[0]\n\n def parse_name(self, selector):\n return selector.extract()[0]\n\n def pre_parse(self, selector=None):\n # this method is executed before the parsing\n # you can override it, take a look at the doc string\n\n def post_parse(self):\n # executed after all parsers\n # you can load any data on to self._data\n # access self._data and self._fields for current data\n # self.selector contains original page\n # self.fetch() returns original html\n self._data.url = self.url\n\n\nclass DummyModel(object):\n \"\"\"\n For tests only, it can be a model in your database ORM\n \"\"\"\n\n\nif __name__ == \"__main__\":\n from pprint import pprint\n\n fetcher = TestFetcher(cache_fetch=True)\n fetcher.url = \"http://en.m.wikipedia.org/wiki/Guido_van_Rossum\"\n\n # Mappings can be loaded from a json file\n # fetcher.load_mappings_from_file('path/to/file')\n fetcher.mappings['name'] = {\n \"css\": (\"#section_0::text\")\n }\n\n fetcher.parse()\n\n print \"Fetcher holds the data\"\n print fetcher._data.name\n print fetcher._data\n\n # How to populate an object\n print \"Populating an object\"\n dummy = DummyModel()\n\n fetcher.populate(dummy, fields=[\"name\", \"nationality\"])\n # fields attr is optional\n print dummy.nationality\n pprint(dummy.__dict__)\n\n```\n\n# outputs\n\n\n```\nFetcher holds the data\nGuido van Rossum\n{'links': [u'http://www.python.org/~guido/',\n u'http://neopythonic.blogspot.com/',\n u'http://www.artima.com/weblogs/index.jsp?blogger=guido',\n u'http://python-history.blogspot.com/',\n u'http://www.python.org/doc/essays/cp4e.html',\n u'http://www.twit.tv/floss11',\n u'http://www.computerworld.com.au/index.php/id;66665771',\n u'http://www.stanford.edu/class/ee380/Abstracts/081105.html',\n u'http://stanford-online.stanford.edu/courses/ee380/081105-ee380-300.asx'],\n 'name': u'Guido van Rossum',\n 'nationality': u'Dutch',\n 'photo_url': 'http://en.m.wikipedia.org//wiki/File:Guido_van_Rossum_OSCON_2006.jpg',\n 'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}\nPopulating an object\nDutch\n{'name': u'Guido van Rossum', 'nationality': u'Dutch'}\n```", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/rochacbruno/scrapy_model", "keywords": "scrapy_model", "license": "BSD", "maintainer": null, "maintainer_email": null, "name": "scrapy_model", "package_url": "https://pypi.org/project/scrapy_model/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/scrapy_model/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/rochacbruno/scrapy_model" }, "release_url": "https://pypi.org/project/scrapy_model/0.1.5/", "requires_dist": null, "requires_python": null, "summary": "Scrapy helper to create scrapers from models", "version": "0.1.5" }, "last_serial": 1108229, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "96adcd259c9fb04412e5878d62f45ce4", "sha256": "fa2272a2d0d7f4c152dfc3770808499fcac250ccd109d6501f604650eb25ecd5" }, "downloads": -1, "filename": "scrapy_model-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "96adcd259c9fb04412e5878d62f45ce4", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 5124, "upload_time": "2014-05-18T01:57:33", "url": "https://files.pythonhosted.org/packages/b3/bc/2a0a94d56de7cde5c5b67cade6ced8789a1d6766340475c0f054ac97b7e5/scrapy_model-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "96d90b080fa7bc4214135ee72f5b511f", "sha256": "25cd49d39582d91b8bd845bb4eb9a4f9ace91a471527f5055999fb2ab542c4d9" }, "downloads": -1, "filename": "scrapy_model-0.1.0.tar.gz", "has_sig": false, "md5_digest": "96d90b080fa7bc4214135ee72f5b511f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4447, "upload_time": "2014-05-18T01:57:29", "url": "https://files.pythonhosted.org/packages/64/7a/ec1f9e3bbf2a61c8ea92f6ef5c17a1b54f268ba85cd92b4849a5f15733fb/scrapy_model-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "11f7ab7e3013ecc7d5e7ac1bf62a42c8", "sha256": "6f34ac4ac50422e5868f0572524aca364c7a318480bd7f7bf1d106cdbc922f60" }, "downloads": -1, "filename": "scrapy_model-0.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "11f7ab7e3013ecc7d5e7ac1bf62a42c8", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 5124, "upload_time": "2014-05-18T01:59:09", "url": "https://files.pythonhosted.org/packages/36/70/0ce739c1b0c6c33d50ccc1948bb1254dea841a50b716b01eed202b15ed35/scrapy_model-0.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b143b8357c5ae28a96877b22e1615739", "sha256": "291c378abd72da5b79ad90a467a86f3121f1ba96f908db762ce14f8784f9b50c" }, "downloads": -1, "filename": "scrapy_model-0.1.1.tar.gz", "has_sig": false, "md5_digest": "b143b8357c5ae28a96877b22e1615739", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4824, "upload_time": "2014-05-18T01:59:05", "url": "https://files.pythonhosted.org/packages/4f/1b/68309b2f6455f0636a4484068f0601bc5faaa2be91484bb67679de35d44a/scrapy_model-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "4c290028a97c5fdc201753bc0f7ab945", "sha256": "2d6373a85d45328ef5f2131b2ebe75e70e60f6eec187f793bef5cc3dfef5b197" }, "downloads": -1, "filename": "scrapy_model-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4c290028a97c5fdc201753bc0f7ab945", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 8481, "upload_time": "2014-05-18T04:26:41", "url": "https://files.pythonhosted.org/packages/d9/40/cf53821fc4098181d87117916ce5c7e7945bf059a22287ecb77e8072de46/scrapy_model-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2b97eec083871260bef9a3c7b409bb00", "sha256": "7290e43d7e286a80918abb69a464bb4ba63b3782de10fdbd0486bea9216cca61" }, "downloads": -1, "filename": "scrapy_model-0.1.2.tar.gz", "has_sig": false, "md5_digest": "2b97eec083871260bef9a3c7b409bb00", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7355, "upload_time": "2014-05-18T04:26:38", "url": "https://files.pythonhosted.org/packages/d1/f1/7b2acd215b66e885347d2c44cd7090ecc30513e10c73a2efb708760b8409/scrapy_model-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "c1a8d382e78ce23ad8f5ae8b5f0095af", "sha256": "f5d97e0e658210480b54fc8ed122a3fa57778f52f45178cda3fd1be7b85b2e1e" }, "downloads": -1, "filename": "scrapy_model-0.1.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c1a8d382e78ce23ad8f5ae8b5f0095af", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 8671, "upload_time": "2014-05-18T07:51:16", "url": "https://files.pythonhosted.org/packages/9d/84/cb8ead18a850072304c68d724c017b84843957a49f30f6ccbc4e8d60b765/scrapy_model-0.1.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5db8aa0edbce0b70934f591b8105e8b7", "sha256": "b760cd2c1d06939e5c3b41f1875c1c756158080fbef57a67e72ac9ca5cb1e730" }, "downloads": -1, "filename": "scrapy_model-0.1.3.tar.gz", "has_sig": false, "md5_digest": "5db8aa0edbce0b70934f591b8105e8b7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7693, "upload_time": "2014-05-18T07:51:12", "url": "https://files.pythonhosted.org/packages/15/bb/7d7ca62fc2718381e631dce9aa1e470c59b425842a301e9525213faabc1a/scrapy_model-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "da081cd753a0e2b03fc4f93264a18493", "sha256": "888fb19440e9df0481054d8da13c2cf8560e8da733aff3f771ba790a8889a993" }, "downloads": -1, "filename": "scrapy_model-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "da081cd753a0e2b03fc4f93264a18493", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 11517, "upload_time": "2014-05-23T13:37:47", "url": "https://files.pythonhosted.org/packages/0c/12/ba0bcf5310cca88133f470ff50e7d7dae983a2c13b544bc3ab1d10173a53/scrapy_model-0.1.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fb9dcf273a88c4d4047b3b4a893942c7", "sha256": "8632db59ce196854bd5ac751cd74b377349189c8b9410e55e4b104adf900f1dc" }, "downloads": -1, "filename": "scrapy_model-0.1.4.tar.gz", "has_sig": false, "md5_digest": "fb9dcf273a88c4d4047b3b4a893942c7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9592, "upload_time": "2014-05-23T13:37:43", "url": "https://files.pythonhosted.org/packages/7d/1c/3362b7f50031a904e9ca0c3a150f624f690bf1ecfce3b879564a6be8aff6/scrapy_model-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "40908fea7007215d7bb6e2bb208c97e8", "sha256": "13fa30859570d35c2e1dc5cc543380a057f9decdb339fbe41126e8c77a8fed30" }, "downloads": -1, "filename": "scrapy_model-0.1.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "40908fea7007215d7bb6e2bb208c97e8", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 13121, "upload_time": "2014-05-29T22:39:55", "url": "https://files.pythonhosted.org/packages/91/28/ab5db86ee73a6f5a515131f27b59977147acb0365f930fb8b8cf61200f0a/scrapy_model-0.1.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9710f8dbbfd523421ef5cf871c8d6407", "sha256": "b93fa181a54c2b6ec4ba084a0496c0e5774d03b5040297dcbe49d874035ec458" }, "downloads": -1, "filename": "scrapy_model-0.1.5.tar.gz", "has_sig": false, "md5_digest": "9710f8dbbfd523421ef5cf871c8d6407", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10722, "upload_time": "2014-05-29T22:39:51", "url": "https://files.pythonhosted.org/packages/55/bf/162f87f887bdb5c5644155a181e429ef02f339c1125b4b3594603e389f81/scrapy_model-0.1.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "40908fea7007215d7bb6e2bb208c97e8", "sha256": "13fa30859570d35c2e1dc5cc543380a057f9decdb339fbe41126e8c77a8fed30" }, "downloads": -1, "filename": "scrapy_model-0.1.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "40908fea7007215d7bb6e2bb208c97e8", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 13121, "upload_time": "2014-05-29T22:39:55", "url": "https://files.pythonhosted.org/packages/91/28/ab5db86ee73a6f5a515131f27b59977147acb0365f930fb8b8cf61200f0a/scrapy_model-0.1.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9710f8dbbfd523421ef5cf871c8d6407", "sha256": "b93fa181a54c2b6ec4ba084a0496c0e5774d03b5040297dcbe49d874035ec458" }, "downloads": -1, "filename": "scrapy_model-0.1.5.tar.gz", "has_sig": false, "md5_digest": "9710f8dbbfd523421ef5cf871c8d6407", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10722, "upload_time": "2014-05-29T22:39:51", "url": "https://files.pythonhosted.org/packages/55/bf/162f87f887bdb5c5644155a181e429ef02f339c1125b4b3594603e389f81/scrapy_model-0.1.5.tar.gz" } ] }