{ "info": { "author": "jhyao", "author_email": "yaojinhonggg@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Template Parser HTML\nThis tool can help get useful data from html web page. It parses html page with the template file which marks data that you need with special attributes. The template file positions html blocks that contain data, and describes types, names and structures of data. You can modify an example page file to get it, or write a basic html structure that can position your data. I suggest you to use the first method, the tool can delete irrelevant parts and organize html tree automatically.\n## install\n```\npip install tp_html\n```\n## How to use\n```python\nfrom tp_html import Template, ThtmlParser\n\n# get template\ntemplate = Template(template_file='samples/basic_template.html')\n\n# save template\ntemplate.save('samples/basic_template.min.html')\n\n# get parser\nparser = ThtmlParser(template_file='samples/basic_template.html')\nparser = ThtmlParser(template_text='...')\nparser = ThtmlParser(template=template)\n\n# parse data\ndata = parser.parse(page_file='samples/basic_sample.html', encoding='urf-8')\ndata = parser.parse(page_url='http://.....')\ndata = parser.parse(page_text='.....')\n```\n## Template file\n### string\nTo get data from content or attributes of element.\n```html\nlink\n```\nTo get content. This will get data {'name': 'link'} \n```html\n\n```\nTo get href. This will get data {'name': '...'}\n```html\n\n```\n### list\nFor HTML\n```html\n\n```\ntemplate:\n```html\n\n```\ndata:\n```json\n{\n \"images\": [\n \"/image/1\",\n \"/image/2\",\n \"/image/3\",\n \"/image/4\"\n ]\n}\n```\nIn list template, in the element which is marked with p-type=list, require one child p-value node and just one that is for selecting item data. If list item is dict or list, structure in item is also allowed.\n### dict\nFor HTML\n```html\n
\n xxx\n

20

\n
\n

10

\n

20

\n
\n
\n```\ntemplate:\n```html\n
\n \n

\n
\n

\n

\n
\n
\n```\ndata:\n```json\n{\n \"user_link\": {\n \"name\": \"xxx\",\n \"link\": \"/user/13456\",\n \"title\": \"user xxx\",\n \"age\": \"20\",\n \"fans_num\": \"10\",\n \"follow_num\": \"20\"\n }\n}\n```\nIn dict template, p-name is required for key of dictionary. Multiple p-item is allowed, split with space, and \"string\" means content of element, others items are attributies name.\n## complex nesting\nhtml\n```html\n\n\n\n \n Title\n\n\n
\n
\n \n \n
\n
\n \n
\n xxx\n

20

\n
\n

10

\n

20

\n
\n
    \n
  • \n
  • \n
  • \n
  • \n
\n
\n
\n
\n \n
\n
\n xxx\n

20

\n
\n
\n

10

\n

20

\n
\n
\n
\n
\n \n \n
\n
\n\n\n```\ntemplate:\n```html\n
\n
\n \n \n
\n
\n \n
\n \n

\n
\n

\n

\n
\n
    \n
  • \n \n
  • \n
\n
\n
\n
\n \n
\n
\n \n

\n
\n
\n

\n

\n
\n
\n
\n
\n \n \n
\n
\n```\ndata:\n```json\n{\n \"user_list\": [\n {\n \"name\": \"xxx\",\n \"link\": \"/user/1\",\n \"title\": \"user xxx\",\n \"age\": \"20\",\n \"fans_num\": \"10\",\n \"follow_num\": \"20\"\n },\n {\n \"name\": \"yyy\",\n \"link\": \"/user/2\",\n \"title\": \"user yyy\",\n \"age\": \"10\",\n \"fans_num\": \"10\",\n \"follow_num\": \"20\"\n }\n ],\n \"user_all\": {\n \"name\": \"xxx\",\n \"link\": \"/user/1\",\n \"title\": \"user xxx\",\n \"age\": \"20\",\n \"fans_num\": \"10\",\n \"follow_num\": \"20\",\n \"images\": [\n \"/image/1\",\n \"/image/2\",\n \"/image/3\",\n \"/image/4\"\n ]\n },\n \"user_info\": {\n \"profile\": {\n \"name\": \"xxx\",\n \"link\": \"/user/1\",\n \"title\": \"user xxx\",\n \"age\": \"20\"\n },\n \"counts\": {\n \"fans_num\": \"10\",\n \"follow_num\": \"20\"\n }\n },\n \"image_tags\": [\n [\n \"tag1-1\",\n \"tag1-2\",\n \"tag1-3\"\n ],\n [\n \"tag2-1\",\n \"tag2-2\",\n \"tag2-3\"\n ],\n [\n \"tag3-1\",\n \"tag3-2\",\n \"tag3-3\"\n ]\n ]\n}\n```\n# p-data tag in template\np-data is a empty html tag using in template to organize data structure. In the example below, four elements are in the same level. But with p-data tag, they can have more structure. \nHTML:\n```html\n
\n xxx\n

20

\n

10

\n

20

\n
\n```\ntemplate:\n```html\n
\n \n \n

\n
\n \n

\n

\n
\n
\n```\ndata:\n```json\n{\n \"user_info\": {\n \"profile\": {\n \"name\": \"xxx\",\n \"link\": \"/user/1\",\n \"title\": \"user xxx\",\n \"age\": \"20\"\n },\n \"counts\": {\n \"fans_num\": \"10\",\n \"follow_num\": \"20\"\n }\n }\n}\n```\n# Save template\nThe tool also provide a method to save minimal template into a file. It will have a faster template building speed with the minimal template file.\n```python\ntemplate = TemplateParser(template_file='samples/pixiv_user_template.html')\ntemplate.save('samples/pixiv_user_template.min.html')\nparser = WebPageParser(template_file='samples/pixiv_user_template.min.html')\ndata = parser.parse(page_file='samples/pixiv_user.html')\n```\nminimal template:\n```html\n body > div#wrapper > div.layout-a\">\n div._unit > div.works_area.profile > div.works_info > div.worksOption.profile-page > div.worksListOthers > div.works-illust > ul._image-items.no-user\" p-value=\"true\" p-type=\"list\" p-name=\"image-list\">\n \n \n img._thumbnail.ui-scroll-view\" p-value=\"true\" p-name=\"illust_id tags\" p-item=\"data-id data-tags\">\n \n \n \n div.ui-layout-west > div._user-profile-card > div.profile > a.user-name\" p-value=\"true\" p-type=\"string\" p-name=\"name\">\n\n```\n# Test\nTime test for templates in samples folder. Test tool is [timefunc](https://github.com/jhyao/functime) \ntest code:\n```python\nparser = ThtmlParser(template_file='samples/basic_template.html')\nfunctime.func_time(ThtmlParser, template_file='samples/basic_template.html')\nfunctime.func_time(parser.parse, page_file='samples/basic_sample.html')\n```\nresult:\n```\nbasic_template.html\nTemplateParser AVG(1000): 1.251ms\nparse AVG(100): 3.2599ms\n\ncomplex_template.html\nTemplateParser AVG(1000): 1.6005ms\nparse AVG(100): 6.265ms\n\npixiv_user_template.html\nTemplateParser AVG(10): 24.7ms\nparse AVG(10): 34.4997ms\n\npixiv_user_template.min.html\nTemplateParser AVG(1000): 760.001183us\nparse AVG(10): 31.2003ms\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jhyao/tp_html", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "tp-html", "package_url": "https://pypi.org/project/tp-html/", "platform": "", "project_url": "https://pypi.org/project/tp-html/", "project_urls": { "Homepage": "https://github.com/jhyao/tp_html" }, "release_url": "https://pypi.org/project/tp-html/0.1.2/", "requires_dist": [ "beautifulsoup4" ], "requires_python": "", "summary": "Get data from html", "version": "0.1.2" }, "last_serial": 4646319, "releases": { "0.0.2": [ { "comment_text": "", "digests": { "md5": "1f09c145693e521fd7617bf195736986", "sha256": "afd3eb56ecf4793b156a43a28ebe206be700b84ad314413e7fd7a872c3489daa" }, "downloads": -1, "filename": "tp_html-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "1f09c145693e521fd7617bf195736986", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9252, "upload_time": "2018-06-15T13:11:05", "url": "https://files.pythonhosted.org/packages/35/d6/52652f170fa80cbb480f417304060187930e19da5b5737dba5ff6fb360ea/tp_html-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "bfdfaee994655364f8b892427ad32fb3", "sha256": "0eeb5a37d84bc65ccf469b185e5403d250c263e44d70e7aad2a12d5fad328250" }, "downloads": -1, "filename": "tp_html-0.0.2.tar.gz", "has_sig": false, "md5_digest": "bfdfaee994655364f8b892427ad32fb3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8655, "upload_time": "2018-06-15T13:11:06", "url": "https://files.pythonhosted.org/packages/0f/de/32d169fa623efeb3e4db70380388453374dc09aeaa1dff2acf4ed1dbc598/tp_html-0.0.2.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "c98b221e94bc1b663ea2e6fc365d74cd", "sha256": "aeab9dd0c3dcc43e54ef857044eb0da62521034ab46f321ad1472b5536b5843a" }, "downloads": -1, "filename": "tp_html-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "c98b221e94bc1b663ea2e6fc365d74cd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7319, "upload_time": "2018-12-30T07:35:30", "url": "https://files.pythonhosted.org/packages/06/a9/15db4eed1d26c4d3b00de808eabd028c340d5e09d8a49f4a86b99332f033/tp_html-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3ca11d9c383695c7fcfa9c47781afbc7", "sha256": "d4a129c8e8941646412508f957394c1e3b29dd9b40112edeafe30cff07643e9f" }, "downloads": -1, "filename": "tp_html-0.1.0.tar.gz", "has_sig": false, "md5_digest": "3ca11d9c383695c7fcfa9c47781afbc7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8895, "upload_time": "2018-12-30T07:35:31", "url": "https://files.pythonhosted.org/packages/32/be/434f16c128abe517c247ef628f27561a094ae9f26bd3418505c8c29a1916/tp_html-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "a3cf3cff6a4b0defd5a0aa5404949feb", "sha256": "093db04dd4c207e4d68a50adc47561f12457167d5723084e201ef133f2cd0ba4" }, "downloads": -1, "filename": "tp_html-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "a3cf3cff6a4b0defd5a0aa5404949feb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7330, "upload_time": "2018-12-30T08:28:03", "url": "https://files.pythonhosted.org/packages/c9/49/9c427a0da029709aabd346fad1661c7d948442fbca5407bbee1984082cd0/tp_html-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7c68938950f658d8944e6c965e3e70c0", "sha256": "2e905d8b22905523fa308a6fce23697561d67cf5a98d66d09e066e2795b0ffc2" }, "downloads": -1, "filename": "tp_html-0.1.1.tar.gz", "has_sig": false, "md5_digest": "7c68938950f658d8944e6c965e3e70c0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8901, "upload_time": "2018-12-30T08:28:06", "url": "https://files.pythonhosted.org/packages/a2/68/8b454909fdce8543245fe9f5fa64577c72029529df07363cb8afb1b78f43/tp_html-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "b1ca2c55474fde8e25de531a7ffd1343", "sha256": "979ca7fe71425248cd8c8ade7897617191531883601fd5cbf5616331d986526b" }, "downloads": -1, "filename": "tp_html-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b1ca2c55474fde8e25de531a7ffd1343", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7350, "upload_time": "2018-12-30T14:34:14", "url": "https://files.pythonhosted.org/packages/9b/07/bcb5708414d6c8c139a4e17f9ed95b2622ca8cc1e42c753427a9bd85136b/tp_html-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9c17b26da8cc3fa93a569fff780a775f", "sha256": "e814fea9529b7960e2933af3717210bea8e5bcbf0fdfcaacb7395cfccebd62aa" }, "downloads": -1, "filename": "tp_html-0.1.2.tar.gz", "has_sig": false, "md5_digest": "9c17b26da8cc3fa93a569fff780a775f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8931, "upload_time": "2018-12-30T14:34:15", "url": "https://files.pythonhosted.org/packages/9f/b8/9b8dc50a4aefce729825a33d45115185e853ff191d5863a9613d202171d5/tp_html-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b1ca2c55474fde8e25de531a7ffd1343", "sha256": "979ca7fe71425248cd8c8ade7897617191531883601fd5cbf5616331d986526b" }, "downloads": -1, "filename": "tp_html-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b1ca2c55474fde8e25de531a7ffd1343", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7350, "upload_time": "2018-12-30T14:34:14", "url": "https://files.pythonhosted.org/packages/9b/07/bcb5708414d6c8c139a4e17f9ed95b2622ca8cc1e42c753427a9bd85136b/tp_html-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9c17b26da8cc3fa93a569fff780a775f", "sha256": "e814fea9529b7960e2933af3717210bea8e5bcbf0fdfcaacb7395cfccebd62aa" }, "downloads": -1, "filename": "tp_html-0.1.2.tar.gz", "has_sig": false, "md5_digest": "9c17b26da8cc3fa93a569fff780a775f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8931, "upload_time": "2018-12-30T14:34:15", "url": "https://files.pythonhosted.org/packages/9f/b8/9b8dc50a4aefce729825a33d45115185e853ff191d5863a9613d202171d5/tp_html-0.1.2.tar.gz" } ] }