{ "info": { "author": "Ethan Huang", "author_email": "wfgydbu@163.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License (GPL)", "Operating System :: OS Independent", "Programming Language :: Python :: 3.6" ], "description": "# ohHTML2Markdown\n\n\u5f53\u722c\u53d6\u4e00\u4e9b\u5982\u65b0\u95fb\u3001\u7b80\u4e66\u3001\u77e5\u4e4e\u4e4b\u7c7b\u7684\u9762\u5411\u6587\u6863\u7684\u7f51\u7ad9\u65f6\uff0c\u901a\u5e38\u9700\u8981\u5c06\u6587\u7ae0\u4fdd\u5b58\u4e0b\u6765\uff0c\u6700\u5408\u9002\u7684\u4ecb\u8d28\u83ab\u8fc7\u4e8eMarkdown\u6587\u6863\uff08\u6709\u4e9b\u7f51\u7ad9\u672c\u6765\u5c31\u662f\u7528Markdown\u6e32\u67d3\u51fa\u6765\u7684\uff09\u3002`ohHTML2Markdown`\u53ef\u4ee5\u5c06\u4e00\u90e8\u5206\u6216\u6574\u4e2a\u7f51\u9875\u8f6c\u5316\u4e3aMarkdown\u6587\u6863\uff0c`ohHTML2Markdown`\u4f1a\u5c3d\u91cf\u5904\u7406\u4e00\u4e9b\u4e0d\u89c4\u8303\u7684HTML\u6587\u6863\uff0c\u4f46\u662f\u4e0d\u4fdd\u8bc1\u8f93\u51fa\u6587\u6863\u7684\u8d28\u91cf\u3002\n\nWhen crawling doc-oriented websites contains news, knowledge market like quora, or even Github, sometimes, you might like to save the article. The best media for storage would be .md file, some articles are even generated from .md files. `ohHTML2Markdown` is able to convert HTML fragment or a complete html file to a human-friendly .md file. Although there would be some irresponsible use in HTML tags, `ohHTML2Markdown` will do its best to ensure the output stays in its best quality.\n\n\u76ee\u524d\u652f\u6301\u7684HTML\u6807\u7b7e\u6709\uff1a`h1~h6`, `p`, `a`, `img`, `del`, `b`, `strong`, `i`, `em`, `hr`, `br`, `ul`, `ol`, `table`, `blockquote`, `code`, `pre`, `span`, `title`, `time`, `iframe`, `section`, `div`, `html`, `body`, `head`\u3002\n\nCurrently, supported HTML tags include: `h1~h6`, `p`, `a`, `img`, `del`, `b`, `strong`, `i`, `em`, `hr`, `br`, `ul`, `ol`, `table`, `blockquote`, `code`, `pre`, `span`, `title`, `time`, `iframe`, `section`, `div`, `html`, `body`, `head`\u3002\n\n\u5904\u7406\u4e3b\u8981\u9760\u7684\u662f`BeautifulSoup`\u5e93\u7684`html.parser`\u89e3\u6790\u5668\u548c`.descendants`\u5bf9\u8c61\u3002\u524d\u8005\u53ef\u4ee5\u628ahtml\u6587\u6863\u89e3\u6790\u6210\u4e00\u4e2a\u6811\u72b6\u6a21\u578b\uff0c\u8bbf\u95ee`.contents`\u53ef\u4ee5\u8bbf\u95ee\u5230\u5f53\u524d\u5c42\u7ea7\u7684\u6240\u6709\u5b50\u6811\u7684\u6839\u8282\u70b9\uff1b\u540e\u8005\u53ef\u4ee5\u8fd4\u56de\u4e00\u4e2a\u751f\u6210\u5668\uff0c\u5b83\u5305\u542b\u4e86\u5f53\u524d\u5b50\u6811\u7684\u6240\u6709\u8282\u70b9\uff08\u5305\u542b\u672c\u8eab\uff09\u3002\n\n`ohHTML2Markdown` is mainly built on `BeautifulSoup` Library, to be more specific, its `html.parser` parser and the `.descendants` object. The parser is able to convert a html file to a tree structure, the using `.content` we can access all sub-trees on the same level; `.descendants` returns an generator which contains all nodes in the current (sub)tree (including itself).\n\n\u901a\u8fc7\u5224\u65ad`descendants`\u7684\u957f\u5ea6\u53ef\u4ee5\u77e5\u9053\u8be5\u5b50\u6811\u6709\u6ca1\u6709\u5b50\u8282\u70b9\uff0c\u7136\u540e \u6839\u636e\u5f53\u524dtag\u7684\u7c7b\u578b\uff0c\u53ef\u4ee5\u9009\u62e9\u7ee7\u7eed\u8fdb\u884c\u6df1\u5ea6\u904d\u5386\u8fd8\u662f\u5c06\u5b83\u8f6c\u6362\u4e3aMarkdown\u7684\u8bed\u4e49\u3002\n\nBy checking the length of `descendants` generator, we can know whether this tree has descendants or not(not that this tree could also be one of descendants of another tree on its topper level). Then based the type of tags included, we make the decision to go further, or convert it to markdown semantics.\n\n\u5176\u4ed6\u5173\u4e8e\u672c\u5e93\u7684\u4e00\u4e9b\u7ec6\u8282\uff1a[\u53d1\u5e03\u81ea\u5df1\u7684Python\u5305 - ohHTML2Markdown](https://journal.ethanshub.com/post/category/gong-cheng-shi/-python-ohhtme2mardown)\u3002\n\nMore about this library, [\u53d1\u5e03\u81ea\u5df1\u7684Python\u5305 - ohHTML2Markdown](https://journal.ethanshub.com/post/category/gong-cheng-shi/-python-ohhtme2mardown). This post is Chinese.\n\n\n\n## \u5b89\u88c5 Installation\n\n```\npip install ohHTML2Markdown\n```\n\n\n\n## \u4f7f\u7528 Usage\n\n```\nimport ohHtml2Markdown as h2m\n\n# \u4ece\u5b57\u7b26\u4e32\u8bfb\u53d6 Read from string\nresult = h2m.Parser(\"

h1

\", h2m.Parser.STRING).convert()\n\n# \u6216\u4ece\u6587\u4ef6\u8bfb\u53d6 Read from file\nresult = h2m.Parser(\"test/test.html\", h2m.Parser.FILE).convert()\n\nwith open(\"test/out.md\", 'w', encoding='utf-8') as file:\n file.write(result)\n```\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/wfgydbu/ohHTML2Markdown", "keywords": "", "license": "GNU GPL 3", "maintainer": "", "maintainer_email": "", "name": "ohHtmlToMarkdown", "package_url": "https://pypi.org/project/ohHtmlToMarkdown/", "platform": "", "project_url": "https://pypi.org/project/ohHtmlToMarkdown/", "project_urls": { "Homepage": "https://github.com/wfgydbu/ohHTML2Markdown" }, "release_url": "https://pypi.org/project/ohHtmlToMarkdown/1.0.0/", "requires_dist": [ "beautifulsoup4 (==4.6.0)" ], "requires_python": "", "summary": "A Python Library for converting HTML to Markdown.", "version": "1.0.0" }, "last_serial": 3916377, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "cf39d56be44c550664da356dd12a5673", "sha256": "cdfba31a76126b39ade003bb7368a37b522efa6b0806d44d203f379c8c0c5c27" }, "downloads": -1, "filename": "ohHtmlToMarkdown-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "cf39d56be44c550664da356dd12a5673", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5210, "upload_time": "2018-05-31T09:49:57", "url": "https://files.pythonhosted.org/packages/ac/82/65f7eea0d331b81cacf03d28fc27f617b5d4baba0971b61a9047e36dec43/ohHtmlToMarkdown-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3e5c5ed24bec1acb19057b01f23d63e9", "sha256": "4b2d81e31a1d269ff5668b6ef3b787b486aae0970f33260b6fbbddc8c518f314" }, "downloads": -1, "filename": "ohHtmlToMarkdown-1.0.0.tar.gz", "has_sig": false, "md5_digest": "3e5c5ed24bec1acb19057b01f23d63e9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5068, "upload_time": "2018-05-31T09:49:58", "url": "https://files.pythonhosted.org/packages/d3/81/37e1d77f0e12ae6f76db7d2ef4057acb13ffc4d897ae4843519f342d10b8/ohHtmlToMarkdown-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cf39d56be44c550664da356dd12a5673", "sha256": "cdfba31a76126b39ade003bb7368a37b522efa6b0806d44d203f379c8c0c5c27" }, "downloads": -1, "filename": "ohHtmlToMarkdown-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "cf39d56be44c550664da356dd12a5673", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5210, "upload_time": "2018-05-31T09:49:57", "url": "https://files.pythonhosted.org/packages/ac/82/65f7eea0d331b81cacf03d28fc27f617b5d4baba0971b61a9047e36dec43/ohHtmlToMarkdown-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3e5c5ed24bec1acb19057b01f23d63e9", "sha256": "4b2d81e31a1d269ff5668b6ef3b787b486aae0970f33260b6fbbddc8c518f314" }, "downloads": -1, "filename": "ohHtmlToMarkdown-1.0.0.tar.gz", "has_sig": false, "md5_digest": "3e5c5ed24bec1acb19057b01f23d63e9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5068, "upload_time": "2018-05-31T09:49:58", "url": "https://files.pythonhosted.org/packages/d3/81/37e1d77f0e12ae6f76db7d2ef4057acb13ffc4d897ae4843519f342d10b8/ohHtmlToMarkdown-1.0.0.tar.gz" } ] }