{ "info": { "author": "Adrian Ghizaru", "author_email": "adrian.ghizaru@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Topic :: Internet", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: Markup :: HTML", "Topic :: Text Processing :: Markup :: XML" ], "description": "Struct-o-Miner is an elegant Python library for extracting structured data from HTML or XML documents.\nIt's ideal for situations where you have your document in a string and just want the data out of it,\nsomething like a fancy type casting operation.\n\n\nFeatures\n--------\n\n**Declarative syntax.** The format of data is static, so any imperative code you have to write to\nextract it is just boilerplate. Instead, declare the structures you're interested in much in the same\nway you define models in Django or SQLAlchemy, and let Struct-o-Miner take care of the boring parts.\n\n**Rich data types.** Obtain your data directly as Python types using fields like TextField, IntField\nor DateTimeField. You can even have lists of dictionaries using StructuredListField.\n\n**Organized.** The most cumbersome part of scraping is data cleanup. All the exceptional cases and\nreal-world considerations can rapidly degenerate into complicated and unmaintanable spaghetti.\nStruct-o-Miner provides tools to separate this code by field and by semantic concern.\n\n**Focused.** Struct-o-Miner adheres to the Unix philosophy of doing one thing and doing it well:\nyou give it a document and it gives you structured data. Scraping is not exclusively part of\nweb crawling, and Struct-o-Miner is a small library that enables you to do it in any project,\nwith no additional cruft.\n\n\nOverview\n--------\n\nFor a quick example, consider the following HTML snippet:\n\n.. code-block:: html\n\n
\n\nHere is a document that targets some of the data we might be interested in:\n\n.. code-block:: python\n\n class Stuff(Document):\n foo = TextField('//div/span[@class=\"foo\"]')\n bar_name = TextField('//div/a')\n bar_url = URLField('//div/a') # Same xpath, but URLField extracts the href\n things = StructuredListField('//div/ul/li', structure=dict(\n # A StructuredField for each element selected by the xpath above\n # Sub-element xpaths are relative to their respective parent\n date = DateField('./span'),\n number = IntField('.')))\n\n @bar_name.postprocessor\n def _extract_the_bar_name(value, **kwargs):\n # Remove 'Example: ' after the field is parsed\n return value.split(' ')[-1]\n\n @bar_name.postprocessor\n def _uppercase_the_bar_name(value, **kwargs):\n # Handle the field after the previous processor ran\n return value.upper()\n\n @things.number.preprocessor\n def _clean_numbers(value, **kwargs):\n # Isolate the numeric part before the field is parsed as an int\n return value.strip(': ').split(' ')[0]\n\nNow we just pass the HTML to this object for parsing, and data is then available using typical Python element access.\nIn Struct-o-Miner, we call this **value access**.\n\n.. code-block:: pycon\n\n >>> data = Stuff(html)\n\n >>> pprint(dict(data))\n {'bar_name': 'Bar',\n 'bar_url': 'http://example.com/bar',\n 'foo': 'Foo',\n 'things': [{'date': datetime.date(2014, 3, 1), 'number': 1},\n {'date': datetime.date(2014, 3, 5), 'number': 3}]}\n\n >>> data['things'][0]['date']\n datetime.date(2014, 3, 1)\n\nYou can also reach the field object for each datum using parentheses (i.e. function calls).\n**Field access** may seem un-pythonic at first, but every field containing some kind of structure\n(ListField, DictField, StructuredField and variants) is also a callable that returns the\nrequested child object.\n\n.. code-block:: pycon\n\n >>> data('things')(0)['date']\n datetime.date(2014, 3, 1)\n\n >>> data('things')(0)('date')\n