{ "info": { "author": "Albert Weichselbraun, Fabian Odoni", "author_email": "albert.weichselbraun@fhgr.ch, fabian.odoni@fhgr.ch", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python :: 3", "Topic :: Text Processing :: Markup :: HTML" ], "description": "# inscriptis\n\n[](https://www.travis-ci.org/weblyzard/inscriptis)\n\nA python based HTML to text conversion library, command line client and Web service with support for nested tables and a subset of CSS.\nPlease take a look at the [Rendering](https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md) document for a demonstration of inscriptis' conversion quality.\n\n##### Table of Contents\n1. [Requirements and installation](#requirements-and-installation)\n2. [Command line client](#command-line-client)\n3. [Python library](#python-library)\n4. [Web service](#flask-web-service)\n5. [Fine tuning](#fine-tuning)\n6. [Testing, benchmarking and evaluation](#testing-benchmarking-and-evaluation)\n7. [Changelog](#changelog)\n\n## Requirements and installation\n\n### Requirements\n* Python 3.5+ (preferred) or Python 2.7+\n* lxml\n* requests\n\n### Installation\n``` {.sourceCode .bash}\nsudo python3 setup.py install\n``` \n## Command line client\nThe command line client converts text files or text retrieved from Web pages to the\ncorresponding text representation.\n\n\n### Command line parameters\n\n``` {.sourceCode .bash}\nusage: inscript.py [-h] [-o OUTPUT] [-e ENCODING] [-i] [-l] [-d] input\n\nConverts HTML from file or url to a clean text version\n\npositional arguments:\n input Html input either from a file or an url (default:stdin)\n\noptional arguments:\n -h, --help show this help message and exit\n -o OUTPUT, --output OUTPUT\n Output file (default:stdout).\n -e ENCODING, --encoding ENCODING\n Content encoding for files (default:utf-8)\n -i, --display-image-captions\n Display image captions (default:false).\n -l, --display-link-targets\n Display link targets (default:false).\n -d, --deduplicate-image-captions\n Deduplicate image captions (default:false).\n --indentation\n How to handle indentation (extended or standard; default: extended)\n```\n\n### Examples\n\n```\n# convert the given page to text and output the result to the screen\ninscript.py https://www.fhgr.ch\n\n# convert the file to text and save the output to output.txt\ninscript.py fhgr.html -o fhgr.txt\n\n# convert the text provided via stdin and save the output to output.txt\necho '
Make it so!
>' | inscript.py -o output.txt \n```\n\n\n## Python library\n\nEmbedding inscriptis into your code is easy, as outlined below:\n\n```python\nimport urllib.request\nfrom inscriptis import get_text\n\nurl = \"http://www.informationscience.ch\"\nhtml = urllib.request.urlopen(url).read().decode('utf-8')\n\ntext = get_text(html)\n\nprint(text)\n```\n\n## Flask Web Service\n\nThe Flask Web Service translates HTML pages to the corresponding plain text. \n\n### Additional Requirements\n\n* python3-flask\n\n### Startup\n\n``` {.sourceCode .bash}\nexport FLASK_APP=\"web-service.py\"\npython3 -m flask run\n```\n\n### Usage\nThe Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified \nin the `Content-Type` header (`UTF-8` in the example below).\n\n``` {.sourceCode .bash}\ncurl -X POST -H \"Content-Type: text/html; encoding=UTF8\" -d @test.html http://localhost:5000/get_text\n```\n\n## Fine tuning\n\nThe following options are available for fine tuning the way inscriptis translates HTML to text.\n\n1. **More rigorous indentation:** call `inscriptis.get_text()` with the parameter `indentation='extended'` to also use indentation for tags such as `