{ "info": { "author": "Albert Weichselbraun, Fabian Odoni", "author_email": "albert.weichselbraun@fhgr.ch, fabian.odoni@fhgr.ch", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python :: 3", "Topic :: Text Processing :: Markup :: HTML" ], "description": "# inscriptis\n\n[![Build Status](https://www.travis-ci.org/weblyzard/inscriptis.png?branch=master)](https://www.travis-ci.org/weblyzard/inscriptis)\n\nA python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS.\nPlease take a look at the [Rendering](https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md) document for a demonstration of inscriptis' conversion quality.\n\n##### Table of Contents\n1. [Requirements and installation](#requirements-and-installation)\n2. [Command line client](#command-line-client)\n3. [Python library](#python-library)\n4. [Web service](#flask-web-service)\n5. [Fine tuning](#fine-tuning)\n6. [Testing, benchmarking and evaluation](#testing-benchmarking-and-evaluation)\n7. [Changelog](#changelog)\n\n## Requirements and installation\n\n### Requirements\n* Python 3.5+ (preferred) or Python 2.7+\n* lxml\n* requests\n\n### Installation\n``` {.sourceCode .bash}\nsudo python3 setup.py install\n``` \n## Command line client\nThe command line client converts text files or text retrieved from Web pages to the\ncorresponding text representation.\n\n\n### Command line parameters\n\n``` {.sourceCode .bash}\nusage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input\n\nConverts HTML from file or url to a clean text version\n\npositional arguments:\n input Html input either from a file or an url (default:stdin)\n\noptional arguments:\n -h, --help show this help message and exit\n -o OUTPUT, --output OUTPUT\n Output file (default:stdout).\n -e ENCODING, --encoding ENCODING\n Content encoding for files (default:utf-8)\n -i, --display-image-captions\n Display image captions (default:false).\n -l, --display-link-targets\n Display link targets (default:false).\n -d, --deduplicate-image-captions\n Deduplicate image captions (default:false).\n --indentation\n How to handle indentation (extended or standard; default: extended)\n```\n\n### Examples\n\n```\n# convert the given page to text and output the result to the screen\ninscript.py https://www.fhgr.ch\n\n# convert the file to text and save the output to output.txt\ninscript.py fhgr.html -o fhgr.txt\n\n# convert the text provided via stdin and save the output to output.txt\necho '

Make it so!

>' | inscript.py -o output.txt \n```\n\n\n## Python library\n\nEmbedding inscriptis into your code is easy, as outlined below:\n\n```python\nimport urllib.request\nfrom inscriptis import get_text\n\nurl = \"http://www.informationscience.ch\"\nhtml = urllib.request.urlopen(url).read().decode('utf-8')\n\ntext = get_text(html)\n\nprint(text)\n```\n\n## Flask Web Service\n\nThe Flask Web Service translates HTML pages to the corresponding plain text. \n\n### Additional Requirements\n\n* python3-flask\n\n### Startup\n\n``` {.sourceCode .bash}\nexport FLASK_APP=\"web-service.py\"\npython3 -m flask run\n```\n\n### Usage\nThe Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified \nin the `Content-Type` header (`UTF-8` in the example below).\n\n``` {.sourceCode .bash}\ncurl -X POST -H \"Content-Type: text/html; encoding=UTF8\" -d @test.html http://localhost:5000/get_text\n```\n\n## Fine tuning\n\nThe following options are available for fine tuning the way inscriptis translates HTML to text.\n\n1. **More rigorous indentation:** call `inscriptis.get_text()` with the parameter `indentation='extended'` to also use indentation for tags such as `
` and `` that do not provide indentation in their standard definition. This strategy is the default in `inscript.py` and many other tools such as lynx. If you do not want extended indentation you can use the parameter `indentation='standard'` instead.\n\n2. **Overwriting the default CSS definition:** inscriptis uses CSS definitions that are maintained in `inscriptis.css.CSS` for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below:\n\n ```python\n from lxml.html import fromstring\n\n from inscriptis.css import DEFAULT_CSS, HtmlElement\n from inscriptis.html_properties import Display\n\n # create a custom CSS based on the default style sheet and change the rendering of `div` and `span` elements\n css = DEFAULT_CSS.copy()\n css['div'] = HtmlElement('div', display=Display.block, padding=2)\n css['span'] = HtmlElement('span', prefix=' ', suffix=' ')\n\n html_tree = fromstring(html)\n # create a parser using the custom css\n parser = Inscriptis(html_tree,\n display_images=display_images,\n deduplicate_captions=deduplicate_captions,\n display_links=display_links,\n css=css)\n text = parser.get_text()\n ```\n\n## Testing, benchmarking and evaluation\n\n### Unit tests\n\nTest cases concerning the html to text conversion are located in the `tests/html` directory and consist of two files:\n\n 1. `test-name.html` and\n 2. `test-name.txt`\n\nthe latter one containing the reference text output for the given html file.\n\n### Text conversion output comparison and speed benchmarking\ninscriptis offers a small benchmarking script that can compare different HTML to text conversion approaches.\nThe script will run the different approaches on a list of URLs, `url_list.txt`, and save the text output into a time stamped folder in `benchmarking/benchmarking_results` for manual comparison.\nAdditionally the processing speed of every approach per URL is measured and saved in a text file called `speed_comparisons.txt` in the respective time stamped folder.\n\nTo run the benchmarking script execute `run_benchmarking.py` from within the folder `benchmarking`.\nIn `def pipeline()` set the which HTML -> Text algorithms to be executed by modifying\n```python\nrun_lynx = True\nrun_justext = True\nrun_html2text = True\nrun_beautifulsoup = True\nrun_inscriptis = True\n```\n\nIn `url_list.txt` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://)\ne.g.\n```\nhttp://www.informationscience.ch\nhttps://en.wikipedia.org/wiki/Information_science\n...\n```\n\n## Changelog\n\nsee [Release notes](https://github.com/weblyzard/inscriptis/releases).\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/weblyzard/inscriptis", "keywords": "", "license": "GPL2", "maintainer": "", "maintainer_email": "", "name": "inscriptis", "package_url": "https://pypi.org/project/inscriptis/", "platform": "", "project_url": "https://pypi.org/project/inscriptis/", "project_urls": { "Homepage": "http://github.com/weblyzard/inscriptis" }, "release_url": "https://pypi.org/project/inscriptis/0.0.4.1.1/", "requires_dist": [ "lxml", "requests" ], "requires_python": "", "summary": "inscriptis - HTML to text converter.", "version": "0.0.4.1.1" }, "last_serial": 5885264, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "a337999507192693b00482a427d6c01d", "sha256": "1f84a84cbb9a74a8091ab3ecb2356bd04ba31b45a7e7267f40e5a49333fe0192" }, "downloads": -1, "filename": "inscriptis-0.0.1-py2.7.egg", "has_sig": false, "md5_digest": "a337999507192693b00482a427d6c01d", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 944, "upload_time": "2018-12-21T14:37:54", "url": "https://files.pythonhosted.org/packages/2c/1b/70a805d939b56c0a3a7f95330f935851dc83cd5e74627cbad53488c78ab2/inscriptis-0.0.1-py2.7.egg" }, { "comment_text": "", "digests": { "md5": "2b742aaf48012c1c8a6c0649106fc516", "sha256": "1a876c9e4d4747184276569d16dfc62593fd8d7308382247a7e1575321ae005e" }, "downloads": -1, "filename": "inscriptis-0.0.1-py3.4.egg", "has_sig": false, "md5_digest": "2b742aaf48012c1c8a6c0649106fc516", "packagetype": "bdist_egg", "python_version": "3.4", "requires_python": null, "size": 11631, "upload_time": "2018-12-21T14:37:56", "url": "https://files.pythonhosted.org/packages/4f/64/baa652472f59336aa96d5ffd86464a67e147bdcbf813710ea797c16c378d/inscriptis-0.0.1-py3.4.egg" }, { "comment_text": "", "digests": { "md5": "f94c7f19d8fde6e50489209f55e7ba29", "sha256": "1014674aa71d13b1aaaf917cda1b2a995a6b25f029305c927be1127d2e530833" }, "downloads": -1, "filename": "inscriptis-0.0.1.tar.gz", "has_sig": false, "md5_digest": "f94c7f19d8fde6e50489209f55e7ba29", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5888, "upload_time": "2016-03-02T18:21:56", "url": "https://files.pythonhosted.org/packages/33/d3/5e7f06798f252564d28deec223b396682c5d40327fa77fa0b2886513b2ed/inscriptis-0.0.1.tar.gz" } ], "0.0.2.1": [ { "comment_text": "", "digests": { "md5": "4fb38ea3688cb33ec71f95def5e9e177", "sha256": "16d0a63b5de47f05e43b65046712a9199fe0978e535a54503e53864b1fc72486" }, "downloads": -1, "filename": "inscriptis-0.0.2.1.tar.gz", "has_sig": false, "md5_digest": "4fb38ea3688cb33ec71f95def5e9e177", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5938, "upload_time": "2017-09-20T13:12:50", "url": "https://files.pythonhosted.org/packages/de/95/a2e4e2d7f972dd3903348556b90c9f6a5ee88283b70b6bd22d1070d5c3bc/inscriptis-0.0.2.1.tar.gz" } ], "0.0.3.0": [ { "comment_text": "", "digests": { "md5": "fb9679d8b5867284df7e5385a0eeee9c", "sha256": "4e820ad1ea37daaf3c1a647d96d63b95f584b7aa464111e4144a000cfe5deed9" }, "downloads": -1, "filename": "inscriptis-0.0.3.0-py3.6.egg", "has_sig": false, "md5_digest": "fb9679d8b5867284df7e5385a0eeee9c", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 18223, "upload_time": "2018-12-21T14:37:57", "url": "https://files.pythonhosted.org/packages/06/09/2158edbc381278c8814088e63a54c49558fef7184170b71ddca2972a8e53/inscriptis-0.0.3.0-py3.6.egg" }, { "comment_text": "", "digests": { "md5": "7855cc39b00a76d88199f785a9cc7691", "sha256": "9a635045f13ea03ef28f84d6f55e2c3af7c12fe5f65313f79bf06707a7fd8dc8" }, "downloads": -1, "filename": "inscriptis-0.0.3.0.tar.gz", "has_sig": false, "md5_digest": "7855cc39b00a76d88199f785a9cc7691", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6980, "upload_time": "2017-11-02T09:22:18", "url": "https://files.pythonhosted.org/packages/7e/42/8fb9f2cd9533f46c3911785709e13885c223bf4abde1c09645b95dc56a70/inscriptis-0.0.3.0.tar.gz" } ], "0.0.3.1": [ { "comment_text": "", "digests": { "md5": "d509302f6c6410b739dfd8979b2e9757", "sha256": "d9fa5d65a3c3f9931989b245a21fc0a1befbefc4ee4fbaaec90c3b60e429cef6" }, "downloads": -1, "filename": "inscriptis-0.0.3.1.tar.gz", "has_sig": false, "md5_digest": "d509302f6c6410b739dfd8979b2e9757", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6987, "upload_time": "2017-11-08T15:32:39", "url": "https://files.pythonhosted.org/packages/ec/7d/80e40906fb32a8b737dc6e60e6f2775df145da17a2f9312b46d504487ae8/inscriptis-0.0.3.1.tar.gz" } ], "0.0.3.2": [ { "comment_text": "", "digests": { "md5": "e94c0ed87334ffe4ccb7df67fa29a9b6", "sha256": "a9657e59bd71d7424a667e319279e36d494827c1d561ddc01e2df834946814b3" }, "downloads": -1, "filename": "inscriptis-0.0.3.2.tar.gz", "has_sig": false, "md5_digest": "e94c0ed87334ffe4ccb7df67fa29a9b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7085, "upload_time": "2017-11-24T09:13:02", "url": "https://files.pythonhosted.org/packages/b8/60/16cc7861c1cc364d38869428618943e5ecce9bb0eec8195b00a46061cab7/inscriptis-0.0.3.2.tar.gz" } ], "0.0.3.5": [ { "comment_text": "", "digests": { "md5": "3e6167a45933320a04706ed706a08dfb", "sha256": "75116fd6c426b2052668de7936364803743b0adfa0f41b3fa7f8cef5d6ff3ba7" }, "downloads": -1, "filename": "inscriptis-0.0.3.5-py3.6.egg", "has_sig": false, "md5_digest": "3e6167a45933320a04706ed706a08dfb", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 21365, "upload_time": "2018-12-21T14:39:30", "url": "https://files.pythonhosted.org/packages/db/cf/e11dd7c7b1fda9ba686cfa2d2c0843ff20e9e920fb2d4af3346f66fb5f33/inscriptis-0.0.3.5-py3.6.egg" }, { "comment_text": "", "digests": { "md5": "3b85c425d7d216bf0abac7acace70cf5", "sha256": "a73a9e535f561c70152d3bf6137929fba0d2312962c017fa7e47bbe0da4750b2" }, "downloads": -1, "filename": "inscriptis-0.0.3.5-py3-none-any.whl", "has_sig": false, "md5_digest": "3b85c425d7d216bf0abac7acace70cf5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13756, "upload_time": "2018-12-12T11:36:30", "url": "https://files.pythonhosted.org/packages/d5/c7/16cce7340beb09c0fbf952dc5deefb1e612694b2f43733648674181c3b4a/inscriptis-0.0.3.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fc8ff103553398ef7272080817edf664", "sha256": "0bc03435a5091222d8a7e79d73efa2af3ba080cb188b49e3928d49ca497074ab" }, "downloads": -1, "filename": "inscriptis-0.0.3.5.tar.gz", "has_sig": false, "md5_digest": "fc8ff103553398ef7272080817edf664", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9501, "upload_time": "2018-12-12T11:39:07", "url": "https://files.pythonhosted.org/packages/bc/c7/c2ee33777ff9edd5f711eba1b5da6034477ca45323478b3c86cb6c3b21d2/inscriptis-0.0.3.5.tar.gz" } ], "0.0.3.7": [ { "comment_text": "", "digests": { "md5": "976f51f19ba94cf94be9b8c54ef58850", "sha256": "341890a56a695437103c02b2e40443099acb422808169bc7c350122da322de20" }, "downloads": -1, "filename": "inscriptis-0.0.3.7-py3-none-any.whl", "has_sig": false, "md5_digest": "976f51f19ba94cf94be9b8c54ef58850", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13760, "upload_time": "2018-12-21T14:37:53", "url": "https://files.pythonhosted.org/packages/7b/8a/99143844810acc084684d74adb6c812a6f822d7da08b4ab93689425a49bd/inscriptis-0.0.3.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6d2627ac82322c97f2e89882a0e6d097", "sha256": "2f2831468e46e2971180e8137eafeb4d694f9602d1d2b7370c3c31dd8f37d94e" }, "downloads": -1, "filename": "inscriptis-0.0.3.7.tar.gz", "has_sig": false, "md5_digest": "6d2627ac82322c97f2e89882a0e6d097", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11211, "upload_time": "2018-12-21T14:39:31", "url": "https://files.pythonhosted.org/packages/b6/cf/ac1248ef14bc54dd3522b5eb65a6ffe327460aa0604b396236a8ac60bef7/inscriptis-0.0.3.7.tar.gz" } ], "0.0.3.8": [ { "comment_text": "", "digests": { "md5": "24a0b74c41f059a53ba6c4ef509db0a8", "sha256": "c26ed6702910be261f4918b7fe5c657a295a2327d7216deaf4bb30b0a608fa98" }, "downloads": -1, "filename": "inscriptis-0.0.3.8-py3-none-any.whl", "has_sig": false, "md5_digest": "24a0b74c41f059a53ba6c4ef509db0a8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 17454, "upload_time": "2019-01-31T14:04:33", "url": "https://files.pythonhosted.org/packages/ed/08/08c8406897392afdefa23125274542263a2ea48e10f82c49a98e4eba0767/inscriptis-0.0.3.8-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f859a8fa52a7d03626602d0e5a7ff9d2", "sha256": "2088a1213e35d9994f02b11742633942cc59531a2fc134b21271d146ef87349c" }, "downloads": -1, "filename": "inscriptis-0.0.3.8.tar.gz", "has_sig": false, "md5_digest": "f859a8fa52a7d03626602d0e5a7ff9d2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11171, "upload_time": "2019-01-31T14:04:35", "url": "https://files.pythonhosted.org/packages/b1/52/7a5953b90acf71ba3d1e1e4dff9e3bf11bb4ce2b65f1f3cae8ad8656c291/inscriptis-0.0.3.8.tar.gz" } ], "0.0.4.0": [ { "comment_text": "", "digests": { "md5": "96f5dc107d1bc05ac960c5fbb29a0fef", "sha256": "5cb1e62687f6c3d5c413247fb4a0c431a299dddc919d1cd9bc7d9728bce7249d" }, "downloads": -1, "filename": "inscriptis-0.0.4.0-py3-none-any.whl", "has_sig": false, "md5_digest": "96f5dc107d1bc05ac960c5fbb29a0fef", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 17747, "upload_time": "2019-02-26T09:31:51", "url": "https://files.pythonhosted.org/packages/08/49/7be90738937e47edc283ac008ccd48927e307e4a79237e01ef05956d5d62/inscriptis-0.0.4.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "378d5675e6a0b9b2d692fccf0ef7e186", "sha256": "33dc14343d859fc0f1f0118145b6c630d77ffd822f51197c57613dbe91885717" }, "downloads": -1, "filename": "inscriptis-0.0.4.0.tar.gz", "has_sig": false, "md5_digest": "378d5675e6a0b9b2d692fccf0ef7e186", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11527, "upload_time": "2019-02-26T09:31:53", "url": "https://files.pythonhosted.org/packages/2b/55/96126110c052d1b3e5ad8e597dfd86d400a70b062d4bcbd3c1aac50a1784/inscriptis-0.0.4.0.tar.gz" } ], "0.0.4.1": [ { "comment_text": "", "digests": { "md5": "d16436dcf015f46fb2c12626b500ef98", "sha256": "94f311ea1f82cefe18a221fb39ef42c46c899770f966a63fe4bcca87ef9766f2" }, "downloads": -1, "filename": "inscriptis-0.0.4.1-py3-none-any.whl", "has_sig": false, "md5_digest": "d16436dcf015f46fb2c12626b500ef98", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18842, "upload_time": "2019-09-25T13:05:31", "url": "https://files.pythonhosted.org/packages/a5/f0/5caccb0f2ba77d0c1f88205d1220c7dc432089a47dd45160dbe92dc6d474/inscriptis-0.0.4.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ffcfc7a0ca909e111e3c493b36d41322", "sha256": "3e88d5c74506eecec73b103bebdab1d191c8d51b2657dbf7f86396c37c416a76" }, "downloads": -1, "filename": "inscriptis-0.0.4.1.tar.gz", "has_sig": false, "md5_digest": "ffcfc7a0ca909e111e3c493b36d41322", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13255, "upload_time": "2019-09-25T13:05:33", "url": "https://files.pythonhosted.org/packages/e4/ee/45c966a803cfc310a4a4be30615aea92930bc71748ca8c5e78e005b9faf4/inscriptis-0.0.4.1.tar.gz" } ], "0.0.4.1.1": [ { "comment_text": "", "digests": { "md5": "524a1747658ad89e5d6f07aa4694b3bd", "sha256": "b3f77c773bb0453c8f775c508ca1868650c6268dcfdd153528401fd2b051ea9b" }, "downloads": -1, "filename": "inscriptis-0.0.4.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "524a1747658ad89e5d6f07aa4694b3bd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18877, "upload_time": "2019-09-25T13:19:21", "url": "https://files.pythonhosted.org/packages/26/43/5668de7fe85bb1793285375cf3eeb24e0623b502c42c3f66e529480e8533/inscriptis-0.0.4.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9fac492129de5c47f1fa5cd0c9a11b51", "sha256": "45d1c423ad4fb003096d418acd6f858a25e4fedf86a8e8fb63288d941914bda1" }, "downloads": -1, "filename": "inscriptis-0.0.4.1.1.tar.gz", "has_sig": false, "md5_digest": "9fac492129de5c47f1fa5cd0c9a11b51", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13234, "upload_time": "2019-09-25T13:19:24", "url": "https://files.pythonhosted.org/packages/a6/83/225533b2755be6478bcf6022f4b4f6422fe7b1ad549c69cc12a32c6628f1/inscriptis-0.0.4.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "524a1747658ad89e5d6f07aa4694b3bd", "sha256": "b3f77c773bb0453c8f775c508ca1868650c6268dcfdd153528401fd2b051ea9b" }, "downloads": -1, "filename": "inscriptis-0.0.4.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "524a1747658ad89e5d6f07aa4694b3bd", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18877, "upload_time": "2019-09-25T13:19:21", "url": "https://files.pythonhosted.org/packages/26/43/5668de7fe85bb1793285375cf3eeb24e0623b502c42c3f66e529480e8533/inscriptis-0.0.4.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9fac492129de5c47f1fa5cd0c9a11b51", "sha256": "45d1c423ad4fb003096d418acd6f858a25e4fedf86a8e8fb63288d941914bda1" }, "downloads": -1, "filename": "inscriptis-0.0.4.1.1.tar.gz", "has_sig": false, "md5_digest": "9fac492129de5c47f1fa5cd0c9a11b51", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13234, "upload_time": "2019-09-25T13:19:24", "url": "https://files.pythonhosted.org/packages/a6/83/225533b2755be6478bcf6022f4b4f6422fe7b1ad549c69cc12a32c6628f1/inscriptis-0.0.4.1.1.tar.gz" } ] }