{ "info": { "author": "Adrian Ghizaru", "author_email": "adrian.ghizaru@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Topic :: Internet", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: Markup :: HTML", "Topic :: Text Processing :: Markup :: XML" ], "description": "Struct-o-Miner is an elegant Python library for extracting structured data from HTML or XML documents.\nIt's ideal for situations where you have your document in a string and just want the data out of it,\nsomething like a fancy type casting operation.\n\n\nFeatures\n--------\n\n**Declarative syntax.** The format of data is static, so any imperative code you have to write to\nextract it is just boilerplate. Instead, declare the structures you're interested in much in the same\nway you define models in Django or SQLAlchemy, and let Struct-o-Miner take care of the boring parts.\n\n**Rich data types.** Obtain your data directly as Python types using fields like TextField, IntField\nor DateTimeField. You can even have lists of dictionaries using StructuredListField.\n\n**Organized.** The most cumbersome part of scraping is data cleanup. All the exceptional cases and\nreal-world considerations can rapidly degenerate into complicated and unmaintanable spaghetti.\nStruct-o-Miner provides tools to separate this code by field and by semantic concern.\n\n**Focused.** Struct-o-Miner adheres to the Unix philosophy of doing one thing and doing it well:\nyou give it a document and it gives you structured data. Scraping is not exclusively part of\nweb crawling, and Struct-o-Miner is a small library that enables you to do it in any project,\nwith no additional cruft.\n\n\nOverview\n--------\n\nFor a quick example, consider the following HTML snippet:\n\n.. code-block:: html\n\n
\n Foo Example: Bar\n \n
\n\nHere is a document that targets some of the data we might be interested in:\n\n.. code-block:: python\n\n class Stuff(Document):\n foo = TextField('//div/span[@class=\"foo\"]')\n bar_name = TextField('//div/a')\n bar_url = URLField('//div/a') # Same xpath, but URLField extracts the href\n things = StructuredListField('//div/ul/li', structure=dict(\n # A StructuredField for each element selected by the xpath above\n # Sub-element xpaths are relative to their respective parent\n date = DateField('./span'),\n number = IntField('.')))\n\n @bar_name.postprocessor\n def _extract_the_bar_name(value, **kwargs):\n # Remove 'Example: ' after the field is parsed\n return value.split(' ')[-1]\n\n @bar_name.postprocessor\n def _uppercase_the_bar_name(value, **kwargs):\n # Handle the field after the previous processor ran\n return value.upper()\n\n @things.number.preprocessor\n def _clean_numbers(value, **kwargs):\n # Isolate the numeric part before the field is parsed as an int\n return value.strip(': ').split(' ')[0]\n\nNow we just pass the HTML to this object for parsing, and data is then available using typical Python element access.\nIn Struct-o-Miner, we call this **value access**.\n\n.. code-block:: pycon\n\n >>> data = Stuff(html)\n\n >>> pprint(dict(data))\n {'bar_name': 'Bar',\n 'bar_url': 'http://example.com/bar',\n 'foo': 'Foo',\n 'things': [{'date': datetime.date(2014, 3, 1), 'number': 1},\n {'date': datetime.date(2014, 3, 5), 'number': 3}]}\n\n >>> data['things'][0]['date']\n datetime.date(2014, 3, 1)\n\nYou can also reach the field object for each datum using parentheses (i.e. function calls).\n**Field access** may seem un-pythonic at first, but every field containing some kind of structure\n(ListField, DictField, StructuredField and variants) is also a callable that returns the\nrequested child object.\n\n.. code-block:: pycon\n\n >>> data('things')(0)['date']\n datetime.date(2014, 3, 1)\n\n >>> data('things')(0)('date')\n \n\nFinally, the third axis of access allows you to reach the objects used as structural\ntemplates in fields such as lists and dictionaries. **Structure access** is what enabled us\nto define the preprocessor on `things.number`. Notice how the following are distinct:\n\n.. code-block:: pycon\n\n >>> data.things.date\n \n\n >>> data('things')(0)('date')\n \n\n\nAlternatives\n------------\n\nThe Python ecosystem is rich in solutions for or related to data scraping and web crawling.\nThis is a survey of possible alternatives, highlighting the unique ways Struct-o-Miner contributes to the scene.\n\n`lxml `_ and `Beautifoul Soup `_ are the\nstandard building blocks of Python scrapers: they both parse markup documents and provide an interface\nto query and manipulate them. Using them directly can be cumbersome though, as data needs to be selected\nmanually. Struct-o-Miner provides a declarative interface for targetting the elements, then uses lxml\nunder the hood to select all the data.\n\n`pyquery `_ wraps lxml.etree with a jQuery-inspired API more familiar to web developers.\nApart from the convenience of selecting elements using CSS, pyquery provides little advantage in scraping over lxml.\nSimilarly, `cssselect `_ converts CSS selectors to XPath queries\nwhich can then be used with lxml. There are plans to support it directly within Struct-o-Miner so that\nfields can be specified using CSS.\n\n`Scrapy `_ is a complete web crawling framework.\nIt can be used to build a reliable crawling operation and benefits from a large community as well as\ncommercial support from `ScrapingHub `_, including a PaaS for running massive Scrapy projects.\nDespite differences in stylistic approach, Struct-o-Miner is comparable in purpose to Scrapy Items and ItemLoaders.\nIt was however designed to provide this functionality as a standalone library,\nwith an arguably more pythonic flavour.\n\n\nInstall\n-------\n\nYou can install Struct-o-Miner from PyPI with `pip `_:\n\n.. code-block:: sh\n\n $ pip install structominer\n\nor from `GitHub `_ with git:\n\n.. code-block:: sh\n\n $ git clone https://github.com/aGHz/structominer.git\n $ cd structominer && python setup.py install", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/aGHz/structominer/archive/0.2.0.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/aGHz/structominer", "keywords": "data,scraping,web,scraper", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "structominer", "package_url": "https://pypi.org/project/structominer/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/structominer/", "project_urls": { "Download": "https://github.com/aGHz/structominer/archive/0.2.0.tar.gz", "Homepage": "https://github.com/aGHz/structominer" }, "release_url": "https://pypi.org/project/structominer/0.2.0/", "requires_dist": null, "requires_python": null, "summary": "Data scraping for a more civilized age", "version": "0.2.0" }, "last_serial": 1065001, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "e2e69d13fe223f0aff4bc685a611113f", "sha256": "0e3a246930be404d9565e38e6155e7dbbb56bc0c365d20742d6f0b8c3460df38" }, "downloads": -1, "filename": "structominer-0.2.0.tar.gz", "has_sig": false, "md5_digest": "e2e69d13fe223f0aff4bc685a611113f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16119, "upload_time": "2014-04-19T02:17:54", "url": "https://files.pythonhosted.org/packages/0f/81/a0dfae2f47d224f2b04298eb7eda558bac01dacbfe64e1ba8b7b6d086973/structominer-0.2.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e2e69d13fe223f0aff4bc685a611113f", "sha256": "0e3a246930be404d9565e38e6155e7dbbb56bc0c365d20742d6f0b8c3460df38" }, "downloads": -1, "filename": "structominer-0.2.0.tar.gz", "has_sig": false, "md5_digest": "e2e69d13fe223f0aff4bc685a611113f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16119, "upload_time": "2014-04-19T02:17:54", "url": "https://files.pythonhosted.org/packages/0f/81/a0dfae2f47d224f2b04298eb7eda558bac01dacbfe64e1ba8b7b6d086973/structominer-0.2.0.tar.gz" } ] }