{ "info": { "author": "Joseph L. Sutherland", "author_email": "joe@clvr.tech", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Image Recognition" ], "description": "doc2text\n========\n\n.. image:: https://travis-ci.org/jlsutherland/doc2text.svg?branch=master\n :target: https://travis-ci.org/jlsutherland/doc2text\n\n.. image:: https://badge.fury.io/py/doc2text.svg\n :target: https://badge.fury.io/py/doc2text\n\n|\n\n.. image:: docs/assets/images/news-button.png\n :alt: Signup for Announcements\n :target: http://eepurl.com/celDRz\n :width: 250px\n\n\n`doc2text` extracts higher quality text by fixing common scan errors\n--------------------------------------------------------------------\nDeveloping text corpora can be a massive pain in the butt. Much of the text data we are interested in as scientists are locked away in pdfs that are poorly scanned. These scans can be off kilter, poor resolution, have a hand in them... and if you OCR these scans without fixing these errors, the OCR doesn't turn out so well. `doc2text` was created to help researchers fix these errors and extract the highest quality text from\ntheir pdfs as possible.\n\n\n`doc2text` is super duper alpha atm\n-----------------------------------\n`doc2text` is developed and tested on Ubuntu 16.04 LTS Xenial Xerus. We do not pretend to serve all operating systems at the moment because that would be irresponsible. Please use this software with a huge grain of salt. We are currently working on:\n\n- Increasing the responsiveness of the text block identifier.\n- Optimizing the binarization for tesseract detection.\n- Identifying text in multiple columns (right now, treats as one big column).\n- Handling tables.\n- Many other optimizations.\n\nSupport and Contributions\n-------------------------\nIf you have feedback or would like to contribute, *please, please* submit a pull request or contact me at `joseph dot sutherland at columbia dot edu`.\n\n\nInstallation\n------------\nTo install the `doc2text` package, simply:\n\n.. code-block:: python\n\n pip install doc2text\n\n`doc2text` relies on the `OpenCV `_, `tesseract `_, and `PythonMagick` libraries. To execute the quick-install script, which installs OpenCV, tesseract, and PythonMagick:\n\n.. code-block:: bash\n\n curl https://raw.githubusercontent.com/jlsutherland/doc2text/master/install_deps.sh | bash\n\nManual installation\n~~~~~~~~~~~~~~~~~~~\nTo install OpenCV manually:\n\n.. code-block:: bash\n\n sudo apt-get install -y build-essential\n sudo apt-get install -y cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev\n sudo apt-get install -y python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev\n git clone https://github.com/opencv/opencv.git opencv\n git clone https://github.com/opencv/opencv_contrib.git opencv_contrib\n cd opencv\n git checkout 3.1.0\n cd ../opencv_contrib\n git checkout 3.1.0\n cd ../opencv\n mkdir build\n cd build\n cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_C_EXAMPLES=OFF -D INSTALL_PYTHON_EXAMPLES=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules -D BUILD_EXAMPLES=ON ..\n make -j4\n sudo make install\n sudo ldconfig\n\nTo install tesseract manually:\n\n.. code-block:: bash\n\n sudo apt-get install tesseract-ocr\n\nTo install PythonMagick manually:\n\n.. code-block:: bash\n\n sudo apt-get install python-pythonmagick\n\nExample usage\n-------------\n\n.. code-block:: python\n\n import doc2text\n\n # Initialize the class.\n doc = doc2text.Document()\n\n # You can pass the lang (as 3 letters code) to the class to improve accuracy\n # On ubuntu it requires the package tesseract-ocr-$lang$\n # On other OS, see https://github.com/tesseract-ocr/langdata\n doc = doc2text.Document(lang=\"eng\")\n\n # Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.\n # If reading a PDF, doc2text will split the PDF into its component pages.\n doc.read('./path/to/my/file')\n\n # Crop the pages down to estimated text regions, deskew, and optimize for OCR.\n doc.process()\n\n # Extract text from the pages.\n doc.extract_text()\n text = doc.get_text()\n\nBig thanks\n----------\n\ndoc2text would be nothing without the open-source contributions of:\n\n- `@danvk `_\n- `@jrosebr1 `_\n- Countless stackoverflow posts and comments.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jlsutherland/doc2text", "keywords": "ocr text detection machine learning computer vision", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "doc2text", "package_url": "https://pypi.org/project/doc2text/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/doc2text/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/jlsutherland/doc2text" }, "release_url": "https://pypi.org/project/doc2text/0.2.4/", "requires_dist": null, "requires_python": null, "summary": "doc2text drastically improves the extraction of text from images by fixing resolution, text area (crop), and skew.", "version": "0.2.4" }, "last_serial": 2328274, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "db08e5be7142ce70f9142903fad215d7", "sha256": "ce3ea49b3bde0c205591bff9c1c6165f2808fb1f9076cc067414344cbdfd18f0" }, "downloads": -1, "filename": "doc2text-0.1.0.tar.gz", "has_sig": false, "md5_digest": "db08e5be7142ce70f9142903fad215d7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8610, "upload_time": "2016-08-29T13:43:36", "url": "https://files.pythonhosted.org/packages/06/7b/6b32e775f4996a125d85d83ef5c80272a6c7c11d329a6e038d67ca0ec33a/doc2text-0.1.0.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "eb7b7822080de011f4ecda05a766e547", "sha256": "522beb0dc31bc8b6a6d3e29648d8ec547b3dec66fef7bb3686c42e5351eceaec" }, "downloads": -1, "filename": "doc2text-0.2.0.tar.gz", "has_sig": false, "md5_digest": "eb7b7822080de011f4ecda05a766e547", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16682, "upload_time": "2016-08-30T11:39:18", "url": "https://files.pythonhosted.org/packages/6c/80/a9410a10d3ada7a7bcb04d3f646360f19bcaf23c86013e9b4641905d1a74/doc2text-0.2.0.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "80a9b97e8bb2c48d14e8a05b0ed893bd", "sha256": "c190d286694ec78d6dde41096c31c29c6e3420a900f50bd211a7a4d79b59555f" }, "downloads": -1, "filename": "doc2text-0.2.2.tar.gz", "has_sig": false, "md5_digest": "80a9b97e8bb2c48d14e8a05b0ed893bd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16672, "upload_time": "2016-08-30T20:09:08", "url": "https://files.pythonhosted.org/packages/3f/eb/288cf796f4050def281e2f6fc96c064e0a9b6376d0699f924afa05dfcaa6/doc2text-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "3df4c3eb473ef98d4b6afd1f42bbdba9", "sha256": "f0d46897401c01379c37beaa51d68f7bf2f0eed90a95ad6d8ea01ee90667dccc" }, "downloads": -1, "filename": "doc2text-0.2.3.tar.gz", "has_sig": false, "md5_digest": "3df4c3eb473ef98d4b6afd1f42bbdba9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9596, "upload_time": "2016-08-30T22:45:28", "url": "https://files.pythonhosted.org/packages/5c/f7/aeadbbc90fbb95681619b71a9024dae2c7e42a860e5bf62763c2dc4e5c93/doc2text-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "40cee3b9d71a6105ee2937115641ed4b", "sha256": "7a450a265a9ddf52bdc498df2eb4af8a22c05018794c16ca423671a5e57b1a96" }, "downloads": -1, "filename": "doc2text-0.2.4.tar.gz", "has_sig": false, "md5_digest": "40cee3b9d71a6105ee2937115641ed4b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21695, "upload_time": "2016-09-06T21:59:21", "url": "https://files.pythonhosted.org/packages/1a/8c/79f2abf15af2f90b38fa78558470ed7f566e29a362ee2329a6266cae3dc7/doc2text-0.2.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "40cee3b9d71a6105ee2937115641ed4b", "sha256": "7a450a265a9ddf52bdc498df2eb4af8a22c05018794c16ca423671a5e57b1a96" }, "downloads": -1, "filename": "doc2text-0.2.4.tar.gz", "has_sig": false, "md5_digest": "40cee3b9d71a6105ee2937115641ed4b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21695, "upload_time": "2016-09-06T21:59:21", "url": "https://files.pythonhosted.org/packages/1a/8c/79f2abf15af2f90b38fa78558470ed7f566e29a362ee2329a6266cae3dc7/doc2text-0.2.4.tar.gz" } ] }