{ "info": { "author": "Tim Savannah", "author_email": "kata198@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Markup :: HTML" ], "description": "AdvancedHTMLParser\n==================\n\nAdvancedHTMLParser is an Advanced HTML Parser, with support for adding, removing, modifying, and formatting HTML. \n\nIt aims to provide the same interface as you would find in a compliant browser through javascript ( i.e. all the getElement methods, appendChild, etc), as well as many more complex and sophisticated features not available through a browser. And most importantly, it's in python!\n\n\nThere are many potential applications, not limited to:\n\n * Webpage Scraping / Data Extraction\n\n * Testing and Validation\n\n * HTML Modification/Insertion\n\n * Outputting your website\n\n * Debugging\n\n * HTML Document generation\n\n * Web Crawling\n\n * Formatting HTML documents or web pages\n\n\nIt is especially good for servlets/webpages. It is quick to take an expertly crafted page in raw HTML / css, and have your servlet's ingest with AdvancedHTMLParser and create/insert data elements into the existing view using a simple and well-known interface ( javascript-like + HTML DOM ).\n\nAnother useful scenario is creating automated testing suites which can operate much more quickly and reliably (and at a deeper function-level), unlike in-browser testing suites.\n\n\n\nFull API\n--------\n\nCan be found http://htmlpreview.github.io/?https://github.com/kata198/AdvancedHTMLParser/blob/master/doc/AdvancedHTMLParser.html?vers=8.1.2 .\n\n\nExamples\n--------\n\nVarious examples can be found in the \"tests\" directory. A very old, simple example can also be found as \"example.py\" in the root directory.\n\n\nShort Doc\n---------\n\n\n**AdvancedHTMLParser**\n\nThink of this like \"document\" in a browser.\n\n\nThe AdvancedHTMLParser can read in a file (or string) of HTML, and will create a modifiable DOM tree from it. It can also be constructed manually from AdvancedHTMLParser.AdvancedTag objects.\n\n\nTo populate an AdvancedHTMLParser from existing HTML:\n\n\tparser = AdvancedHTMLParser.AdvancedHTMLParser()\n\n\t# Parse an HTML string into the document\n\n\tparser.parseStr(htmlStr)\n\n\t# Parse an HTML file into the document\n\n\tparser.parseFile(filename)\n\n\n\nThe parser then exposes many \"standard\" functions as you'd find on the web for accessing the data, and some others:\n\n\tgetElementsByTagName \\- Returns a list of all elements matching a tag name\n\n\tgetElementsByName \\- Returns a list of all elements with a given name attribute\n\n\tgetElementById \\- Returns a single AdvancedTag (or None) if found an element matching the provided ID\n\n\tgetElementsByClassName \\- Returns a list of all elements containing one or more space\\-separated class names\n\n\tgetElementsByAttr \\- Returns a list of all elements matching a paticular attribute/value pair.\n\n\tgetElementsWithAttrValues \\- Returns a list of all elements with a specific attribute name containing one of a list of values\n\n\tgetElementsCustomFilter \\- Provide a function/lambda that takes a tag argument, and returns True to \"match\" it. Returns all matched objects\n\n\tgetRootNodes \\- Get a list of nodes at root level (0)\n\n\tgetAllNodes \\- Get all the nodes contained within this document\n\n\tgetHTML \\- Returns string of HTML representing this DOM\n\n\tgetFormattedHTML \\- Returns a formatted string (using AdvancedHTMLFormatter; see below) of the HTML. Takes as argument an indent (defaults to four spaces)\n\n\tgetMiniHTML \\- Returns a \"mini\" HTML representation which disregards all whitespace and indentation beyond the functional single\\-space\n\n\nThe results of all of these getElement\\* functions are TagCollection objects. This is a special kind of list which contains additional functions. See the \"TagCollection\" section below for more info.\n\nThese objects can be modified, and will be reflected in the parent DOM.\n\n\nThe parser also contains some expected properties, like\n\n\n\thead \\- The \"head\" tag associated with this document, or None\n\n\tbody \\- The \"body\" tag associated with this document, or None\n\n\tforms \\- All \"forms\" on this document as a TagCollection\n\n\n**General Attributes**\n\nIn general, attributes can be accessed with dot-syntax, i.e.\n\n\ttagEm.id = \"Hello\"\n\nwill set the \"id\" attribute. If it works in HTML javascript on a tag element, it should work on an AdvancedTag element with python.\n\nsetAttribute, getAttribute, and removeAttribute are more explicit and recommended ways of getting/setting/deleting attributes on elements.\n\nThe same names are used in python as in the javascript/DOM, such as 'className' corrosponding to a space-separated string of the 'class' attribute, 'classList' corrosponding to a list of classes, etc.\n\n\n**Style Attribute**\n\nStyle attributes can be manipulated just like in javascript, so element.style.position = 'relative' for setting, or element.style.position for access.\n\nYou can also assign the tag.style as a string, like:\n\n\tmyTag.style = \"display: block; float: right; font\\-weight: bold\"\n\nin addition to individual properties:\n\n\tmyTag.style.display = 'block'\n\n\tmyTag.style.float = 'right'\n\n\tmyTag.style.fontWeight = 'bold'\n\nYou can remove style properties by setting its value to an empty string.\n\nFor example, to clear \"display\" property:\n\n\tmyTag.style.display = ''\n\nA standard method *setProperty* can also obe used to set or remove individual properties\n\nFor example:\n\n\tmyTag.style.setProperty(\"display\", \"block\") # Set display: block\n\n\tmyTag.style.setProperty(\"display\", '') # Clear display: property\n\n\nThe naming conventions are the same as in javascript, like \"element.style.paddingTop\" for \"padding-top\" attribute.\n\n\n**TagCollection**\n\nA TagCollection can be used like a list. Every element has a unique uuid associated with it, and a TagCollection will ensure that the same element does not appear twice within its list (so it acts like an ordered set)\n\nIt also exposes the various getElement\\* functions which operate on the elements within the list (and their children).\n\nFor example:\n\n\t\n\t# Filter off the parser all tags with \"item\" in class\n\n\ttagCollection = document.getElementsByClassName('item')\n\n\t# Return all nodes which are nested within any class=\"item\" object\n\n\t# and also contains the class name \"onsale\"\n\n\titemsWithOnSaleClass = tagCollection.getElementsByClassName('onsale')\n\n\nTo operate just on items in the list, you can use the TagCollection method, *filterCollection*, which takes a lambda/function and returns True to retain that tag in the return.\n\nFor example:\n\n\t# Filter off the parser all tags with \"item\" in class\n\n\ttagCollection = document.getElementsByClassName('item')\n\n\t# Provide a lambda to filter this collection, returning in tagCollection2\n\n\t# those items which have a \"value\" attribute > 20 and contains at least\n\n\t# 1 child element with \"specialPrice\" class\n\n\ttagCollection2 = tagCollection.filterCollection( lambda node : int(node.getAttribute('value') or 0) > 20 and len(node.getElementsByClassName('specialPrice')) > 1 )\n\n\nTagCollections also support advanced filtering (find/filter methods), see \"Advanced Filtering\" section below.\n\n\n**AdvancedTag**\n\nThe AdvancedTag represents a single tag and its inner text. It exposes many of the functions and properties you would expect to be present if using javascript.\n\neach AdvancedTag also supports the same getElementsBy\\* functions as the parser.\n\nIt adds several additional that are not found in javascript, such as peers and arbitrary attribute searching.\n\nsome of these include:\n\n\tappendText \\- Append text to this element\n\n\tappendChild \\- Append a child to this element\n\n\tappendBlock \\- Append a block (text or AdvancedTag) to this element\n\n\tappend \\- alias of appendBlock\n\n\tremoveChild \\- Removes a child\n\n\tremoveText \\- Removes first occurance of some text from any text nodes\n\n\tremoveTextAll \\- Removes ALL occurances of some text from any text nodes\n\n\tinsertBefore \\- Inserts a child before an existing child\n\n\tinsertAfter \\- Inserts a child after an existing child\n\n\tgetChildren \\- Returns the children as a list\n\n\tgetStartTag \\- Start Tag, with attributes\n\n\tgetEndTag \\- End Tag\n\n\tgetPeersByName \\- Gets \"peers\" (elements with same parent, at same level in tree) with a given name\n\n\tgetPeersByAttr \\- Gets peers by an arbitrary attribute/value combination\n\n\tgetPeersWithAttrValues \\- Gets peers by an arbitrary attribute/values combination. \n\n\tgetPeersByClassName \\- Gets peers that contain a given class name\n\n\tgetElement\\\\\\* \\- Same as above, but act on the children of this element.\n\n\tgetParentElementCustomFilter \\- Takes a lambda/function and applies on all parents of this element upward until the document root. Returns the first node that when passed to this function returns True, or None if no matches on any parent nodes\n\n\tgetHTML / toHTML / asHTML \\- Get the HTML representation using this node as a root (so start tag and attributes, innerHTML (text and child nodes), and end tag)\n\n\tfirstChild \\- Get the first child of this node, be it text or an element (AdvancedTag)\n\n\tfirstElementChild \\- Get the first child of this node that is an element\n\n\tlastChild \\- Get the last child of this node, be it text or an element (AdvancedTag)\n\n\tlastElementChild \\- Get the last child of this node that is an element\n\n\tnextSibling \\- Get next sibling, be it text or an element\n\n\tnextElementSibling \\- Get next sibling, that is an element\n\n\tpreviousSibling \\- Get previous sibling, be it text or an element\n\n\tpreviousElementSibling \\- Get previous sibling, that is an element\n\n\t{get,set,has,remove}Attribute \\- get/set/test/remove an attribute\n\n\t{add,remove}Class \\- Add/remove a class from the list of classes\n\n\tsetStyle \\- Set a specific style property [like: setStyle(\"font\\-weight\", \"bold\") ]\n\n\tisTagEqual \\- Compare if two tags have the same attributes. Using the == operator will compare if they are the same exact tag (by uuid)\n\n\tgetUid \\- Get a unique ID for this tag (internal)\n\n\tgetAllChildNodes \\- Gets all nodes beneath this node in the document (its children, its children's children, etc)\n\n\tgetAllNodes \\- Same as getAllChildNodes, but also includes this node\n\n\tcontains \\- Check if a provided node appears anywhere beneath this node (as child, child\\-of\\-child, etc)\n\n\tremove \\- Remove this node from its parent element, and disassociates this and all sub\\-nodes from the associated document\n\n\t\\_\\_str\\_\\_ \\- str(tag) will show start tag with attributes, inner text, and end tag\n\n\t\\_\\_repr\\_\\_ \\- Shows a reconstructable representation of this tag\n\n\t\\_\\_getitem\\_\\_ \\- Can be indexed like tag[2] to access second child.\n\n\nAnd some properties:\n\n\tchildren/childNodes \\- The children (tags) as a list NOTE: This returns only AdvancedTag objects, not text.\n\n\tchildBlocks \\- All direct child blocks. This includes both AdvnacedTag objects and text nodes (str)\n\n\tinnerHTML \\- The innerHTML including the html of all children\n\n\tinnerText \\- The text nodes, in order, as they appear as direct children to this node as a string\n\n\ttextContent \\- All the text nodes, in order, as they appear within this node or any children (or their children, etc.)\n\n\touterHTML \\- innerHTML wrapped in this tag\n\n\tclassNames/classList \\- a list of the classes\n\n\tparentNode/parentElement \\- The parent tag\n\n\ttagName \\- The tag name\n\n\townerDocument \\- The document associated with this node, if any\n\n\nAnd many others. See the pydocs for a full list, and associated docstrings.\n\n\n**Appending raw HTML**\n\nYou can append raw HTML to a tag by calling:\n\n\ttagEm.appendInnerHTML('