{ "info": { "author": "Thomas Trapp", "author_email": "hext@thomastrapp.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX :: Linux", "Programming Language :: C++", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "# Hext \u2014 Extract Data from HTML\n\n\n\nHext is a domain-specific language for extracting structured data from HTML. It can be thought of as a counterpart to templates, which are typically used by web developers to structure content on the web.\n\n## A Quick Example\nThe following Hext snippet collects all hyperlinks and extracts the href and the clickable text.\n```\n\n```\nHext does so by recursively trying to match every HTML element. In the case above, an element is required to have the tag a and an attribute called href. If the element matches, its attribute href and its textual representation are stored as link and title, respectively.\n\nIf the above Hext snippet is applied to this piece of HTML:\n```\n
\n Page 1\n Page 2\n Page 3\n\n```\nHext will produce the following values:\n```\n{ \"link\": \"one.html\", \"title\": \"Page 1\" },\n{ \"link\": \"two.html\", \"title\": \"Page 2\" },\n{ \"link\": \"three.html\", \"title\": \"Page 3\" }\n```\nYou can use this example in [Hext\u2019s live code editor](https://hext.thomastrapp.com/#anchor-tryit-hext).\nVisit [Hext\u2019s documentation](https://hext.thomastrapp.com/documentation) and its section \u201c[How Hext Matches Elements](https://hext.thomastrapp.com/documentation#matching-elements)\u201d for a more thorough explanation.\n\n## Components\nThis package includes:\n* The Hext Python module\n* The htmlext command-line utility\n\n### Using Hext with Python\nThe module exposes three interfaces:\n* `html = hext.Html(\"...\")` -> object\n* `rule = hext.Rule(\"...\")` -> object\n* `rule.extract(html)` -> dictionary of {string -> string}\n```\nimport hext\nimport requests\nimport json\n\nres = requests.get('https://news.ycombinator.com/')\nres.raise_for_status()\n\n# hext.Html's constructor expects a single argument\n# containing an UTF-8 encoded string of HTML.\nhtml = hext.Html(res.text)\n\n# hext.Rule's constructor expects a single argument\n# containing a Hext snippet.\n# Throws an exception of type ValueError on invalid syntax.\nrule = hext.Rule(\"\"\"\n