{ "info": { "author": "Amandeep Singh", "author_email": "amandeep.s.saggu@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# ETK: Information Extraction Toolkit\n\nETK is a Python library for high precision information extraction from many document formats.\nIt proivdes a flexible framework of **composable extractors** that enables you to combine a host of **predefined extractors** provided in ETK with custom extractors that you may need to develop for your application.\nIt supports extraction from HTML pages, text documents, CSV and Excel files and JSON documents.\nETK is open-source software, released under the MIT license.\n\n\n\n \n\n## Documentation\nRead the documentation [here](https://usc-isi-i2.github.io/etk/)\n\n## Features\n\n* Extraction from HTML, text, CSV, Excel, JSON\n* High-precision predefined extractors for common entities (dates, phones, email, cities, ...)\n* Extraction of microdata, schema.org and RDFa markup\n* Integration with [spaCy](https://github.com/explosion/spaCy) for text processing\n* Automatic identification and extraction of HTML tables containing data\n* Automatic identification and extraction of time series\n* Semi-automatic generation of Web wrappers\n* Scalable execution and management of extraction pipelines\n* Automatic provenance recording\n\n# Releases\n\n- [Source code](https://github.com/usc-isi-i2/etk/releases)\n- [Docker images](https://hub.docker.com/r/uscisii2/etk/tags/)\n\n## Installation\n\n
| Operating system: | macOS / OS X, Linux, Windows |
| Python version: | Python 3.6+ |