{ "info": { "author": "Chuancong Gao", "author_email": "chuancong@gmail.com", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 3" ], "description": "[![PyPi version](https://img.shields.io/pypi/v/html2json.svg)](https://pypi.python.org/pypi/html2json/)\n[![PyPi pyversions](https://img.shields.io/pypi/pyversions/html2json.svg)](https://pypi.python.org/pypi/html2json/)\n[![PyPi license](https://img.shields.io/pypi/l/html2json.svg)](https://pypi.python.org/pypi/html2json/)\n\nConvert a HTML webpage to JSON data using a template defined in JSON.\n\nInstallation\n----\n\nThis package is available on PyPi. Just use `pip install -U html2json` to install it. Then you can import it using `from html2json import collect`.\n\nAPI\n----\n\nThe method is `collect(html, template)`. `html` is the HTML of page loaded as string, and `template` is the JSON of template loaded as Python objects.\n\nNote that the HTML must contain the root node, like `...` or `
...
`.\n\nTemplate Syntax\n----\n\n- The basic syntax is `keyName: [selector, attr, [listOfRegexes]]`.\n 1. `selector` is a CSS selector (supported by [lxml](http://lxml.de/)).\n - When the selector is `null`, the root node itself is matched.\n - When the selector cannot be matched, `null` is returned.\n 2. `attr` matches the attribute value. It can be `null` to match either the inner text or the outer text when the inner text is empty.\n 3. The list of regexes `[listOfRegexes]` supports two forms of regex operations. The operations with in the list are executed sequentially.\n - Replacement: `s/regex/replacement/g`. `g` is optional for multiple replacements.\n - Extraction: `/regex/`.\n\nFor example:\n\n```json\n{\n \"Color\": [\"head link:nth-of-type(1)\", \"href\", [\"/\\\\w+(?=\\\\.css)/\"]],\n}\n```\n\n- As JSON, nested structure can be easily constructed.\n\n```json\n{\n \"Cover\": {\n \"URL\": [\".cover img\", \"src\", []],\n \"Number of Favorites\": [\".cover .favorites\", \"value\", []]\n },\n}\n```\n\n- An alternative simplified syntax `keyName: [subRoot, subTemplate]` can be used.\n 1. `subRoot` a CSS selector of the new root for each sub entry.\n 2. `subTemplate` is a sub-template for each entry, recursively.\n\nFor example, the previous example can be simplified as follow.\n\n```json\n{\n \"Cover\": [\".cover\", {\n \"URL\": [\"img\", \"src\", []],\n \"Number of Favorites\": [\".favorites\", \"value\", []]\n }],\n}\n```\n\n- To extract a list of sub-entries following the same sub-template, the list syntax is `keyName: [[subRoot, subTemplate]]`. Please note the difference (surrounding `[` and `]`) from the previous syntax above.\n 1. `subRoot` is the CSS selector of the new root for each sub entry.\n 2. `subTemplate` is the sub-template for each entry, recursively.\n\nFor example:\n\n```json\n{\n \"Comments\": [[\".comments\", {\n \"From\": [\".from\", null, []],\n \"Content\": [\".content\", null, []],\n \"Photos\": [[\"img\", {\n \"URL\": [\"\", \"src\", []]\n }]]\n }]]\n}\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/chuanconggao/html2json/tarball/0.2.4.1", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/chuanconggao/html2json", "keywords": "parser,html,json", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "html2json", "package_url": "https://pypi.org/project/html2json/", "platform": "", "project_url": "https://pypi.org/project/html2json/", "project_urls": { "Download": "https://github.com/chuanconggao/html2json/tarball/0.2.4.1", "Homepage": "https://github.com/chuanconggao/html2json" }, "release_url": "https://pypi.org/project/html2json/0.2.4.1/", "requires_dist": null, "requires_python": "", "summary": "Parsing HTML to JSON", "version": "0.2.4.1" }, "last_serial": 4023348, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "37becba30dd2987b92ea93cd2c07f111", "sha256": "701f6ef7c42b455ccb1b673825e5adeea5527222b558af3ce7be01ba1c3f7729" }, "downloads": -1, "filename": "html2json-0.1.1.tar.gz", "has_sig": false, "md5_digest": "37becba30dd2987b92ea93cd2c07f111", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1657, "upload_time": "2017-05-27T20:25:10", "url": "https://files.pythonhosted.org/packages/f4/20/893ff2c3399d9126730009ab1ebeb90f0295bbabb3ebe72b6e0bffc5908d/html2json-0.1.1.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "6e46ad683d70a6391b2f209e72b72100", "sha256": "0676aeb185fd8d9b16797f04e7278a89251a9348ffd02fba05df9fa9f549489e" }, "downloads": -1, "filename": "html2json-0.2.1.tar.gz", "has_sig": false, "md5_digest": "6e46ad683d70a6391b2f209e72b72100", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2515, "upload_time": "2018-02-09T16:45:27", "url": "https://files.pythonhosted.org/packages/88/85/bf7efbfd6ad211c12609c1fed5b206b79113d9fe503459f28006f4bc9f65/html2json-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "5ce7e9153413f60515194be7fdeb3371", "sha256": "aebbbfaad2362b58165103b36b30b7c7bc017a723c1b32e8679220276f2161b9" }, "downloads": -1, "filename": "html2json-0.2.2.tar.gz", "has_sig": false, "md5_digest": "5ce7e9153413f60515194be7fdeb3371", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2752, "upload_time": "2018-02-09T17:20:35", "url": "https://files.pythonhosted.org/packages/b3/d3/01ac6ab4c7ea22802ba9820501dfb161eda6a5264d57b41149416848c982/html2json-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "a2966eae4aef6c92eb010e960a90aa36", "sha256": "abfaa467a90471cb8f28580013ecc4088bb993ea385df588af510291d012198a" }, "downloads": -1, "filename": "html2json-0.2.3.tar.gz", "has_sig": false, "md5_digest": "a2966eae4aef6c92eb010e960a90aa36", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3182, "upload_time": "2018-04-22T23:16:04", "url": "https://files.pythonhosted.org/packages/68/19/09c2c7e809eb381bb7fdeed9a1a90abd096f1be02051d23e50d23b3b0ec6/html2json-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "cd53862070aef69a18783508306c76be", "sha256": "ef008c4526f03bac83aacb55d38cc7d32a48e46994ea7a53576cee04d884a8c9" }, "downloads": -1, "filename": "html2json-0.2.4.tar.gz", "has_sig": false, "md5_digest": "cd53862070aef69a18783508306c76be", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3335, "upload_time": "2018-04-25T03:53:28", "url": "https://files.pythonhosted.org/packages/31/ed/37407038aff20c2e8d67162fc6cb849107f1809f9a7910267dd3b8594c17/html2json-0.2.4.tar.gz" } ], "0.2.4.1": [ { "comment_text": "", "digests": { "md5": "ff4134d541b2fef9bee0fc57a214e2f4", "sha256": "e35ab1e7a62938c990f59933a295066f4083be4404091259d9e06d493bc79a31" }, "downloads": -1, "filename": "html2json-0.2.4.1.tar.gz", "has_sig": false, "md5_digest": "ff4134d541b2fef9bee0fc57a214e2f4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4042, "upload_time": "2018-07-02T16:10:46", "url": "https://files.pythonhosted.org/packages/46/44/06f3b08dac69528c7d6c9ae5415863c26415cd916c4f0cc57e29610c1f02/html2json-0.2.4.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ff4134d541b2fef9bee0fc57a214e2f4", "sha256": "e35ab1e7a62938c990f59933a295066f4083be4404091259d9e06d493bc79a31" }, "downloads": -1, "filename": "html2json-0.2.4.1.tar.gz", "has_sig": false, "md5_digest": "ff4134d541b2fef9bee0fc57a214e2f4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4042, "upload_time": "2018-07-02T16:10:46", "url": "https://files.pythonhosted.org/packages/46/44/06f3b08dac69528c7d6c9ae5415863c26415cd916c4f0cc57e29610c1f02/html2json-0.2.4.1.tar.gz" } ] }