{ "info": { "author": "James Graham", "author_email": "james@hoppipolla.co.uk", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Markup :: HTML" ], "description": "html5lib\n========\n\n.. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master\n :target: https://travis-ci.org/html5lib/html5lib-python\n\nhtml5lib is a pure-python library for parsing HTML. It is designed to\nconform to the WHATWG HTML specification, as is implemented by all major\nweb browsers.\n\n\nUsage\n-----\n\nSimple usage follows this pattern:\n\n.. code-block:: python\n\n import html5lib\n with open(\"mydocument.html\", \"rb\") as f:\n document = html5lib.parse(f)\n\nor:\n\n.. code-block:: python\n\n import html5lib\n document = html5lib.parse(\"
Hello World!\")\n\nBy default, the ``document`` will be an ``xml.etree`` element instance.\nWhenever possible, html5lib chooses the accelerated ``ElementTree``\nimplementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).\n\nTwo other tree types are supported: ``xml.dom.minidom`` and\n``lxml.etree``. To use an alternative format, specify the name of\na treebuilder:\n\n.. code-block:: python\n\n import html5lib\n with open(\"mydocument.html\", \"rb\") as f:\n lxml_etree_document = html5lib.parse(f, treebuilder=\"lxml\")\n\nWhen using with ``urllib2`` (Python 2), the charset from HTTP should be\npass into html5lib as follows:\n\n.. code-block:: python\n\n from contextlib import closing\n from urllib2 import urlopen\n import html5lib\n\n with closing(urlopen(\"http://example.com/\")) as f:\n document = html5lib.parse(f, transport_encoding=f.info().getparam(\"charset\"))\n\nWhen using with ``urllib.request`` (Python 3), the charset from HTTP\nshould be pass into html5lib as follows:\n\n.. code-block:: python\n\n from urllib.request import urlopen\n import html5lib\n\n with urlopen(\"http://example.com/\") as f:\n document = html5lib.parse(f, transport_encoding=f.info().get_content_charset())\n\nTo have more control over the parser, create a parser object explicitly.\nFor instance, to make the parser raise exceptions on parse errors, use:\n\n.. code-block:: python\n\n import html5lib\n with open(\"mydocument.html\", \"rb\") as f:\n parser = html5lib.HTMLParser(strict=True)\n document = parser.parse(f)\n\nWhen you're instantiating parser objects explicitly, pass a treebuilder\nclass as the ``tree`` keyword argument to use an alternative document\nformat:\n\n.. code-block:: python\n\n import html5lib\n parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder(\"dom\"))\n minidom_document = parser.parse(\"
Hello World!\")\n\nMore documentation is available at https://html5lib.readthedocs.io/.\n\n\nInstallation\n------------\n\nhtml5lib works on CPython 2.6+, CPython 3.3+ and PyPy. To install it,\nuse:\n\n.. code-block:: bash\n\n $ pip install html5lib\n\n\nOptional Dependencies\n---------------------\n\nThe following third-party libraries may be used for additional\nfunctionality:\n\n- ``datrie`` can be used under CPython to improve parsing performance\n (though in almost all cases the improvement is marginal);\n\n- ``lxml`` is supported as a tree format (for both building and\n walking) under CPython (but *not* PyPy where it is known to cause\n segfaults);\n\n- ``genshi`` has a treewalker (but not builder); and\n\n- ``chardet`` can be used as a fallback when character encoding cannot\n be determined.\n\n\nBugs\n----\n\nPlease report any bugs on the `issue tracker\n