GitHub

{ "info": { "author": "William Forde", "author_email": "willforde@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Markup :: HTML" ], "description": ".. image:: https://badge.fury.io/py/htmlement.svg\n :target: https://pypi.python.org/pypi/htmlement\n\n.. image:: https://readthedocs.org/projects/python-htmlement/badge/?version=stable\n :target: http://python-htmlement.readthedocs.io/en/stable/?badge=stable\n\n.. image:: https://travis-ci.org/willforde/python-htmlement.svg?branch=master\n :target: https://travis-ci.org/willforde/python-htmlement\n\n.. image:: https://coveralls.io/repos/github/willforde/python-htmlement/badge.svg?branch=master\n :target: https://coveralls.io/github/willforde/python-htmlement?branch=master\n\n.. image:: https://api.codacy.com/project/badge/Grade/6b46406e1aa24b95947b3da6c09a4ab5\n :target: https://www.codacy.com/app/willforde/python-htmlement?utm_source=github.com&utm_medium=referral&utm_content=willforde/python-htmlement&utm_campaign=Badge_Grade\n\n.. image:: https://img.shields.io/pypi/pyversions/htmlement.svg\n :target: https://pypi.python.org/pypi/htmlement\n\n.. image:: https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg\n :target: https://saythanks.io/to/willforde\n\nHTMLement\n---------\n\nHTMLement is a pure Python HTML Parser.\n\nThe object of this project is to be a \"pure-python HTML parser\" which is also \"faster\" than \"beautifulsoup\".\nAnd like \"beautifulsoup\", will also parse invalid html.\n\nThe most simple way to do this is to use ElementTree `XPath expressions`__.\nPython does support a simple (read limited) XPath engine inside its \"ElementTree\" module.\nA benefit of using \"ElementTree\" is that it can use a \"C implementation\" whenever available.\n\nThis \"HTML Parser\" extends `html.parser.HTMLParser`_ to build a tree of `ElementTree.Element`_ instances.\n\nInstall\n-------\nRun ::\n\n pip install htmlement\n\n-or- ::\n\n pip install git+https://github.com/willforde/python-htmlement.git\n\nParsing HTML\n------------\nHere I\u2019ll be using a sample \"HTML document\" that will be \"parsed\" using \"htmlement\": ::\n\n html = \"\"\"\n \n \n GitHub\n \n \n GitHub\n GitHub Project\n \n \n \"\"\"\n\n # Parse the document\n import htmlement\n root = htmlement.fromstring(html)\n\nRoot is an ElementTree.Element_ and supports the ElementTree API\nwith XPath expressions. With this I'm easily able to get both the title and all anchors in the document. ::\n\n # Get title\n title = root.find(\"head/title\").text\n print(\"Parsing: %s\" % title)\n\n # Get all anchors\n for a in root.iterfind(\".//a\"):\n print(a.get(\"href\"))\n\nAnd the output is as follows: ::\n\n Parsing: GitHub\n https://github.com/willforde\n https://github.com/willforde/python-htmlement\n\n\nParsing HTML with a filter\n--------------------------\nHere I\u2019ll be using a slightly more complex \"HTML document\" that will be \"parsed\" using \"htmlement with a filter\" to fetch\nonly the menu items. This can be very useful when dealing with large \"HTML documents\" since it can be a lot faster to\nonly \"parse the required section\" and to ignore everything else. ::\n\n html = \"\"\"\n \n \n Coffee shop\n \n \n \n

Sugar
Cream

\n \n \n \"\"\"\n\n # Parse the document\n import htmlement\n root = htmlement.fromstring(html, \"ul\", attrs={\"class\": \"menu\"})\n\nIn this case I'm not unable to get the title, since all elements outside the filter were ignored.\nBut this allows me to be able to extract all \"list_item elements\" within the menu list and nothing else. ::\n\n # Get all listitems\n for item in root.iterfind(\".//li\"):\n # Get text from listitem\n print(item.text)\n\nAnd the output is as follows: ::\n\n Coffee\n Tea\n Milk\n\n.. _html.parser.HTMLParser: https://docs.python.org/3.6/library/html.parser.html#html.parser.HTMLParser\n.. _ElementTree.Element: https://docs.python.org/3.6/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element\n.. _Xpath: https://docs.python.org/3.6/library/xml.etree.elementtree.html#xpath-support\n__ XPath_\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/willforde/python-htmlement", "keywords": "html html5 parsehtml htmlparser elementtree dom", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "htmlement", "package_url": "https://pypi.org/project/htmlement/", "platform": "OS Independent", "project_url": "https://pypi.org/project/htmlement/", "project_urls": { "Homepage": "https://github.com/willforde/python-htmlement" }, "release_url": "https://pypi.org/project/htmlement/1.0.0/", "requires_dist": null, "requires_python": "", "summary": "Pure-Python HTML parser with ElementTree support.", "version": "1.0.0" }, "last_serial": 3382475, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "06f0d31854b00f664f4edc405d29c532", "sha256": "92fa3411440556fbac0f5dcca8b39850af8773731831e7c74d4c0c6c78538caf" }, "downloads": -1, "filename": "htmlement-0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "06f0d31854b00f664f4edc405d29c532", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10139, "upload_time": "2017-01-23T06:15:55", "url": "https://files.pythonhosted.org/packages/1d/58/6941222f06433d35a08a7152f446f31dec0dbd724c8b3bbd0dde2d75296c/htmlement-0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2ff014607cb3d2f2b626a478cc52311c", "sha256": "c1d6233215553a546332f80311831d3cc057d6699d58a67fc217141f3e106500" }, "downloads": -1, "filename": "htmlement-0.2.tar.gz", "has_sig": false, "md5_digest": "2ff014607cb3d2f2b626a478cc52311c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7610, "upload_time": "2017-01-23T06:15:57", "url": "https://files.pythonhosted.org/packages/f8/f6/b71d1e7c45f77b1b1ff4a08201b764a026e4cbd7d5dd828afeff077b53d0/htmlement-0.2.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "d169cb13008d4c1aa1489ca2e30d1c9b", "sha256": "62507d04e828dd35b689e33e7c2804e8574127d28e9fc97445a59c3a4f4ebf24" }, "downloads": -1, "filename": "htmlement-0.2.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d169cb13008d4c1aa1489ca2e30d1c9b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 10156, "upload_time": "2017-01-24T04:44:37", "url": "https://files.pythonhosted.org/packages/2f/b4/2a05ccc1a51ad2939697bdf509e6234e63103d9482ab895720ed96d48e1b/htmlement-0.2.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a3a303df0f5f08e403ba137df6be55a8", "sha256": "b5c40ac22821ababa596aec61285b41d367fb122d0fe81166b8272f76109a020" }, "downloads": -1, "filename": "htmlement-0.2.1.tar.gz", "has_sig": false, "md5_digest": "a3a303df0f5f08e403ba137df6be55a8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7599, "upload_time": "2017-01-24T04:44:39", "url": "https://files.pythonhosted.org/packages/ad/8e/d0e45fad89f68908b95d4610f3538e60dc632f67b602093e5e71afeb9e14/htmlement-0.2.1.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "2097d7b48e05dd064cdd450415a31272", "sha256": "c8fd630bdfc5f0f5cf3ccf7779bf737a4874794fd63f0a49d08e4ff9270c0854" }, "downloads": -1, "filename": "htmlement-0.2.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "2097d7b48e05dd064cdd450415a31272", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 11845, "upload_time": "2017-04-15T03:45:54", "url": "https://files.pythonhosted.org/packages/b0/99/8ce1fe29a48c38a2696f3e43ace4852b645c1b78a5672037d4ae23cafddf/htmlement-0.2.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5a8068786087e79d812715c55a68de57", "sha256": "3c7fd079d6a1718cae22941b023bc8f251e998f61fc25c3900f9f10303ddbadc" }, "downloads": -1, "filename": "htmlement-0.2.3.tar.gz", "has_sig": false, "md5_digest": "5a8068786087e79d812715c55a68de57", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8267, "upload_time": "2017-04-15T03:45:56", "url": "https://files.pythonhosted.org/packages/03/c3/b2cc64b0dc5a39c550312c400160eaaa9c9cce92787d1fee45e1418df1dd/htmlement-0.2.3.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "1c7b1a52e4ee1089000c49985e0caaa3", "sha256": "a6fdf36f2f54c5ebf17dde12a24e16fed3e68a84c0cbef8c08e666199cfec319" }, "downloads": -1, "filename": "htmlement-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1c7b1a52e4ee1089000c49985e0caaa3", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 11316, "upload_time": "2017-12-02T12:38:30", "url": "https://files.pythonhosted.org/packages/2d/b9/4de72344e6d471650f179f11a4ea60d11dbd1672548a50e3159cb3cacecb/htmlement-1.0.0-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1c7b1a52e4ee1089000c49985e0caaa3", "sha256": "a6fdf36f2f54c5ebf17dde12a24e16fed3e68a84c0cbef8c08e666199cfec319" }, "downloads": -1, "filename": "htmlement-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1c7b1a52e4ee1089000c49985e0caaa3", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 11316, "upload_time": "2017-12-02T12:38:30", "url": "https://files.pythonhosted.org/packages/2d/b9/4de72344e6d471650f179f11a4ea60d11dbd1672548a50e3159cb3cacecb/htmlement-1.0.0-py2.py3-none-any.whl" } ] }