{ "info": { "author": "Stephen Larroque", "author_email": "lrq3000@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Environment :: MacOS X", "Environment :: Win32 (MS Windows)", "Environment :: X11 Applications", "Intended Audience :: End Users/Desktop", "License :: OSI Approved :: MIT License", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX", "Operating System :: POSIX :: BSD", "Operating System :: POSIX :: BSD :: FreeBSD", "Operating System :: POSIX :: Linux", "Operating System :: POSIX :: SunOS/Solaris", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Software Development :: Libraries", "Topic :: Utilities" ], "description": "easytextract\n======================\n\n|PyPI-Status| |PyPI-Versions| |LICENCE|\n\nEasy to use text extractor, from PDF, DOC, DOCX and other documents, including if necessary using OCR (via Tesseract).\n\nThis library can extract text from any type supported by Textract.\n\nThis library only exists because of the awesome work of the Textract team and Tesseract.\n\n|Screenshot|\n\nIt runs under Python 2.7 (it was not tested nor developped with compatibility with Python 3 in mind, although it might work with some slight changes).\n\nINSTALL\n-------\n\nIn general, please refer to Textract documentation to install the appropriate softwares needed to extract text from the filetypes you need.\n\nThe rest of this section will describe the details for a basic setup.\n\nPYTHON (all platforms: Linux, MacOSX, Windows)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nTo run Easytextract from Python, you need Python > 2.7 and to pip install textract.\n\nThen install the following libraries to support the filetypes you want:\n\n* For PDF, pip install PDFMiner. To get additional features and better PDF extraction, you can install pdftotext, part of poppler or Xpdf.\n* For OCR, you need to install Tesseract >= 3.02 (but not 3.0 nor 4!) and pdftoppm.\n* For DOCX, pip install python-docx2txt.\n* For DOC, install antiword in the location on Windows: C:\\antiword\\antiword.exe , for Linux and Mac you will need to change the path inside the script.\n* to support other types such as audio, see https://textract.readthedocs.io/en/stable/#currently-supporting\n\nWINDOWS\n~~~~~~~\nBy using the Windows binary (only for Windows 64-bits), PDF and DOCX are directly supported.\n\nTo enable OCR, and install tesseract >= v3.02 (not v4!) for your platform beforehand. You also need to install pdftoppm.exe.\n\nFor DOC support (not DOCX as it is already supported natively), you will also need antiword installed in C:\\antiword\\antiword.exe.\n\nLICENSE\n-------------\neasytextract was initially made by Stephen Larroque for the Coma Science Group - GIGA Consciousness - CHU de Liege, Belgium. The application is licensed under MIT License.\n\n\n.. |LICENCE| image:: https://img.shields.io/pypi/l/easytextract.svg\n :target: https://raw.githubusercontent.com/lrq3000/easytextract/master/LICENCE\n.. |PyPI-Status| image:: https://img.shields.io/pypi/v/easytextract.svg\n :target: https://pypi.python.org/pypi/easytextract\n.. |PyPI-Versions| image:: https://img.shields.io/pypi/pyversions/easytextract.svg\n :target: https://pypi.python.org/pypi/easytextract\n.. |Screenshot| image:: https://raw.githubusercontent.com/lrq3000/easytextract/master/img/easytextract_gui.png", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/LRQ3000/easytextract", "keywords": "text extractor pdf doc docx word utility ocr", "license": "MIT Licence", "maintainer": "", "maintainer_email": "", "name": "easytextract", "package_url": "https://pypi.org/project/easytextract/", "platform": "any", "project_url": "https://pypi.org/project/easytextract/", "project_urls": { "Homepage": "https://github.com/LRQ3000/easytextract" }, "release_url": "https://pypi.org/project/easytextract/1.1.5/", "requires_dist": null, "requires_python": "", "summary": "Easy to use text extractor, from PDF, DOC, DOCX and other document types, using the awesome Textract, including if necessary using OCR (via Tesseract).", "version": "1.1.5" }, "last_serial": 3327128, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "84044c3c4764e2125b6be1c3eb1e3039", "sha256": "c074446a1bf4c2fa7ce38b006c4fd05ddeb6ac6411653fe91b80f76ac5ad9fe9" }, "downloads": -1, "filename": "easytextract-1.0.0.tar.gz", "has_sig": false, "md5_digest": "84044c3c4764e2125b6be1c3eb1e3039", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2786782, "upload_time": "2017-11-12T00:42:57", "url": "https://files.pythonhosted.org/packages/d9/ef/c9b07f34efa2174997a2349ace8daf7a8eddd085864e5f1189612be0024a/easytextract-1.0.0.tar.gz" } ], "1.1.5": [ { "comment_text": "", "digests": { "md5": "a6936691da3cb9b8d1b9b8607a69561e", "sha256": "d94f74ba1f1db653d05c70097be43dea016184ef747144522b9d4c5682c9c9f2" }, "downloads": -1, "filename": "easytextract-1.1.5.tar.gz", "has_sig": false, "md5_digest": "a6936691da3cb9b8d1b9b8607a69561e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2789822, "upload_time": "2017-11-12T23:21:14", "url": "https://files.pythonhosted.org/packages/fb/25/4417e03841cbc0fa4c716a2677ed64004dded0860df5487af2e1b36060be/easytextract-1.1.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a6936691da3cb9b8d1b9b8607a69561e", "sha256": "d94f74ba1f1db653d05c70097be43dea016184ef747144522b9d4c5682c9c9f2" }, "downloads": -1, "filename": "easytextract-1.1.5.tar.gz", "has_sig": false, "md5_digest": "a6936691da3cb9b8d1b9b8607a69561e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2789822, "upload_time": "2017-11-12T23:21:14", "url": "https://files.pythonhosted.org/packages/fb/25/4417e03841cbc0fa4c716a2677ed64004dded0860df5487af2e1b36060be/easytextract-1.1.5.tar.gz" } ] }