{ "info": { "author": "Ryan Cerf", "author_email": "ryancerf@yahoo.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "Framework :: Scrapy", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Topic :: Utilities" ], "description": "scrapy-sqlitem\n==============\n\nscrapy-sqlitem allows you to define scrapy items using Sqlalchemy models\nor tables. It also provides an easy way to save to the database in\nchunks.\n\nThis project is in beta. Pull requests and feedback are welcome. The\nregular caveats of using a sql database backend for a write heavy\napplication still apply.\n\nInspiration from\n`scrapy-redis `__ and\n`scrapy-djangoitem `__\n\nQuickstart\n==========\n\n::\n\n pip install scrapy_sqlitem\n\n`Define items using Sqlalchemy ORM `__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from scrapy_sqlitem import SqlItem\n\n class MyModel(Base):\n __tablename__ = 'mytable'\n id = Column(Integer, primary_key=True)\n name = Column(String)\n\n class MyItem(SqlItem):\n sqlmodel = MyModel\n\n`Or Define Items using Sqlalchemy Core `__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from scrapy_sqlitem import SqlItem\n\n class MyItem(SqlItem):\n sqlmodel = Table('mytable', metadata\n Column('id', Integer, primary_key=True),\n Column('name', String, nullable=False))\n\nIf tables have not been created yet make sure to create them. See\nsqlalchemy docs and the example spider.\n\nUse SqlSpider to easily save scraped items to the database\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nsettings.py\n\n.. code:: python\n\n DATABASE_URI = \"sqlite:///\"\n\nDefine your spider\n~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n from scrapy_sqlitem import SqlSpider\n\n class MySpider(SqlSpider):\n name = 'myspider'\n\n start_urls = ('http://dmoz.org',)\n\n def parse(self, response):\n selector = Selector(response)\n item = MyItem()\n item['name'] = selector.xpath('//title[1]/text()').extract_first()\n yield item\n\nRun the spider\n~~~~~~~~~~~~~~\n\n.. code:: sh\n\n scrapy crawl myspider\n\nQuery the database\n~~~~~~~~~~~~~~~~~~\n\n.. code:: sql\n\n Select * from mytable;\n\n id | name |\n ----+-----------------------------------+\n 1 | DMOZ - the Open Directory Project |\n\nOther Information\n=================\n\nDo not want to use SqlSpider? Write a pipeline instead.\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n\n from sqlalchemy import create_engine\n\n class CommitSqlPipeline(object):\n \n def __init__(self):\n self.engine = create_engine(\"sqlite:///\")\n\n def process_item(self, item, spider):\n item.commit_item(engine=self.engine)\n\nDrop items missing required primary key data before saving to the db\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n\n from scrapy.exceptions import DropItem\n\n class DropMissingDataPipeline(object):\n def process_item(self, item, spider):\n if item.null_required_fields:\n raise DropItem\n else:\n return item\n # Watch out for Serial primary keys that are considered null.\n\nSave to the database in chunks rather than item by item\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nInherit from SqlSpider and..\n\nIn settings\n\n.. code:: python\n\n DEFAULT_CHUNKSIZE = 500\n\n CHUNKSIZE_BY_TABLE = {'mytable': 1000, 'othertable': 250}\n\nIf an error occurs while saving a chunk to the db it will try and save\neach item one at a time\n\nAccess the underlying sqlalchemy table to query the database\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: sql\n\n INSERT INTO mytable (id, name) VALUES ('1','ryan')\n\n.. code:: python\n\n myitem = MyItem()\n # bind the table to an engine (I could have done this when I created the table too)\n myitem.table.metadata.bind = self.engine\n myitem.table.select().where(item.table.c.id == 1).execute().fetchone() \n\n (1, 'ryan')\n\nWhat row in the database matches the data in my item?\n\n.. code:: python\n\n myitem = MyItem()\n myitem['id'] = 1\n myitem.get_matching_dbrow(bind=self.engine)\n\n (1, 'ryan')\n\nThis is same query as the one above!\n\nGotchas\n=======\n\nIf you subclass either item\\_scraped or spider\\_closed make sure to call super!\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n\n class MySpider(SqlSpider):\n \n def parse(self, response):\n pass\n\n def spider_closed(self, spider, reason):\n super(MySpider, self).spider_closed(spider, reason)\n self.log(\"Log some really important custom stats\")\n\nBe Careful with other Mixins. The inheritance structure can get a little\nmessy. If a class early in the mro subclasses item\\_scraped and does not\ncall super the item\\_scraped method of SqlSpider will never get called.\n\nOther Methods of sqlitem\n========================\n\nsqlitem.table\n~~~~~~~~~~~~~\n\n- returns the sqlalchemy core table that corresponds to that item.\n\nsqlitem.null\\_required\\_fields\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n- returns a set of the database key names that are are marked not\n nullable and the corresponding data in the item is null.\n\nsqlitem.null\\_primary\\_key\\_fields\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n- returns a set of the primary key names where the corresponding data\n in the item is null.\n\nsqlitem.primary\\_keys\n~~~~~~~~~~~~~~~~~~~~~\n\nsqlitem.required\\_keys\n~~~~~~~~~~~~~~~~~~~~~~\n\nsqlitem.get\\_matching\\_dbrow(bind=None, use\\_cache=True)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n- Find the data in the database that matches the primary key data in\n the item\n\nToDo\n====\n\n- Continuous integration Tests", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ryancerf/scrapy-sqlitem", "keywords": null, "license": "BSD", "maintainer": null, "maintainer_email": null, "name": "scrapy-sqlitem", "package_url": "https://pypi.org/project/scrapy-sqlitem/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/scrapy-sqlitem/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/ryancerf/scrapy-sqlitem" }, "release_url": "https://pypi.org/project/scrapy-sqlitem/0.1.2/", "requires_dist": null, "requires_python": null, "summary": "Scrapy extension to save items to a sql database", "version": "0.1.2" }, "last_serial": 1678920, "releases": { "0.1.0": [], "0.1.1": [ { "comment_text": "", "digests": { "md5": "1cbf1ee78a9cdc8060d564dfbbed67de", "sha256": "8966238cc65e5ad65db620ec95e9b4b7a5fec17a2548de58d91510d3a20ad681" }, "downloads": -1, "filename": "scrapy-sqlitem-0.1.1.tar.gz", "has_sig": false, "md5_digest": "1cbf1ee78a9cdc8060d564dfbbed67de", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5237, "upload_time": "2015-08-14T20:04:17", "url": "https://files.pythonhosted.org/packages/0e/42/b4d5a134465ad137ca7d7d43beed22d7a85f5ba0d0f059b0fd49148b3dfc/scrapy-sqlitem-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "863293e72f9e3978bb503b4ae848b1c3", "sha256": "971a7bd812548f93e9c154bdef6b35290d959951c64ceda61e7e8ade06f9c456" }, "downloads": -1, "filename": "scrapy-sqlitem-0.1.2.tar.gz", "has_sig": false, "md5_digest": "863293e72f9e3978bb503b4ae848b1c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6031, "upload_time": "2015-08-15T21:46:35", "url": "https://files.pythonhosted.org/packages/2b/6f/862414a4948c2ded04c4381ed616dbc9790ab7b404566fc7961d5bb186b7/scrapy-sqlitem-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "863293e72f9e3978bb503b4ae848b1c3", "sha256": "971a7bd812548f93e9c154bdef6b35290d959951c64ceda61e7e8ade06f9c456" }, "downloads": -1, "filename": "scrapy-sqlitem-0.1.2.tar.gz", "has_sig": false, "md5_digest": "863293e72f9e3978bb503b4ae848b1c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6031, "upload_time": "2015-08-15T21:46:35", "url": "https://files.pythonhosted.org/packages/2b/6f/862414a4948c2ded04c4381ed616dbc9790ab7b404566fc7961d5bb186b7/scrapy-sqlitem-0.1.2.tar.gz" } ] }