{ "info": { "author": "Georg Sauthoff", "author_email": "mail@gms.tf", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: End Users/Desktop", "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Multimedia :: Graphics :: Capture :: Scanners" ], "description": "adf2pdf - a tool that turns a batch of paper pages into a PDF\nwith a text layer. By default, it detects empty pages (as they\nmay easily occur during duplex scanning) and excludes them from\nthe OCR and the resulting PDF.\n\nFor that, it uses [Sane's][5] [scanimage][6] for the scanning,\n[Tesseract][4] for the [optical character recognition] (OCR), and\nthe Python packages [img2pdf][9], [Pillow (PIL)][10] and\n[PyPDF2][11] for some image-processing tasks and PDF mangling.\n\n\nExample:\n\n $ adf2pdf contract-xyz.pdf\n\n2017, Georg Sauthoff \n\n## Features\n\n- Automatic document feed (ADF) support\n- Fast empty page detection\n- Overlaying of scanning, image processing, OCR and PDF creation\n to minimize the total runtime\n- Fast creation of small PDFs using the fine [img2pdf][9] package\n- Only use of safe compression methods, i.e. no error-prone\n symbol segmentation style compression like [JBIG2][12] or JB2\n that is used in [Xerox photocopiers][12] and the DjVu format.\n\n## Install Instructions\n\nAdf2pdf can be directly installed with [`pip`][13], e.g.\n\n $ pip3 install --user adf2pdf\n\nor\n\n $ pip3 install adf2pdf\n\nSee also the [PyPI adf2pdf project page][14].\n\nAlternatively, the Python file `adf2pdf.py` can be directly\nexecuted in a cloned repository, e.g.:\n\n $ ./adf2pdf.py report.pdf\n\nIn addition to that, one can install the development version from\na cloned work-tree like this:\n\n $ pip3 install --user .\n\n## Hardware Requirements\n\nA scanner with automatic document feed (ADF) that is supported by\nSane. For example, the [Fujitsu ScanSnap S1500][1] works\nwell. That model supports duplex scanning, which is quite\nconvenient.\n\n## Example continued\n\nRunning _adf2pdf_ for a 7 page example document takes 150 seconds\non an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the\nFujitsu ScanSnap S1500). With the defaults, _adf2pdf_ calls\n`scanimage` for duplex scanning into 600 dpi lineart (black and\nwhite) images. In this example, 6 pages are empty and thus\nautomatically excluded, i.e. the resulting PDF then just contains\n8 pages.\n\nThe resulting PDF contains a text layer from the OCR such that\none can search and copy'n'paste some text. It is 1.1 MiB big,\ni.e. a page is stored in 132 KiB, on average.\n\n## Software Requirements\n\nThe script assumes Tesseract version 4, by default. Version 3 can\nbe used as well, but the [new neural network system in Tesseract\n4][8] just performs magnitudes better than the old OCR model.\nAs of mid 2018, there is no stable version 4, yet, but since\nthe beta version is so much better at OCR I can't recommend it\nenough over the stable version 3.\n\nTesseract 4 notes:\n\n- [Build instructions][2] - warning: if you miss the\n `autoconf-archive` dependency you'll get weird autoconf error\n messages\n- [Data files][3] - you need the training data for your\n languages of choice and the OSD data\n\nPython packages:\n\n- [img2pdf][9] (not packaged for Fedora, yet) - version 0.2.4 works\n fine\n- [Pillow (PIL)][10] (Fedora package: python3-pillow-devel)\n- [PyPDF2][11] (Fedora package: python3-PyPDF2)\n\n[1]: http://www.fujitsu.com/us/products/computing/peripheral/scanners/product/eol/s1500/\n[2]: https://github.com/tesseract-ocr/tesseract/wiki/Compiling-\u2013-GitInstallation\n[3]: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files\n[4]: https://en.wikipedia.org/wiki/Tesseract_(software)\n[5]: https://en.wikipedia.org/wiki/Scanner_Access_Now_Easy\n[6]: http://www.sane-project.org/man/scanimage.1.html\n[7]: https://en.wikipedia.org/wiki/Optical_character_recognition\n[8]: https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00\n[9]: https://pypi.org/project/img2pdf/\n[10]: http://python-pillow.github.io/\n[11]: https://github.com/mstamy2/PyPDF2\n[12]: https://en.wikipedia.org/wiki/JBIG2\n[13]: https://en.wikipedia.org/wiki/Pip_(package_manager)\n[14]: https://pypi.org/project/adf2pdf/", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gsauthof/adf2pdf", "keywords": "adf scanning sane duplex-scanning ocr tesseract pdf", "license": "", "maintainer": "", "maintainer_email": "", "name": "adf2pdf", "package_url": "https://pypi.org/project/adf2pdf/", "platform": "", "project_url": "https://pypi.org/project/adf2pdf/", "project_urls": { "Bug Reports": "https://github.com/gsauthof/adf2pdf/issues", "Homepage": "https://github.com/gsauthof/adf2pdf", "Say Thanks!": "https://gms.tf", "Source": "https://github.com/gsauthof/adf2pdf" }, "release_url": "https://pypi.org/project/adf2pdf/0.8.1/", "requires_dist": null, "requires_python": ">=3", "summary": "Automate the workflow around ADF scanning, OCR and PDF creation", "version": "0.8.1" }, "last_serial": 4982717, "releases": { "0.8.0": [ { "comment_text": "", "digests": { "md5": "53fa2e1b86357c1cd258242c254144a3", "sha256": "cab4dcb1703ab66df35d18d99a6f2f814427b03fc2c6661504d7d50ad91947c6" }, "downloads": -1, "filename": "adf2pdf-0.8.0.tar.gz", "has_sig": false, "md5_digest": "53fa2e1b86357c1cd258242c254144a3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 9061, "upload_time": "2018-05-08T06:59:50", "url": "https://files.pythonhosted.org/packages/5f/75/1ca25798e4caa257420a7375dbb6268e61d567c5cb01ee5a17e6d27a8fee/adf2pdf-0.8.0.tar.gz" } ], "0.8.1": [ { "comment_text": "", "digests": { "md5": "55f06eea80130e8a5da59d6711a740da", "sha256": "5740643c0e9d9df3e19069f6bf8da97f05d8e2add7e634a520dd5a688c0fdeeb" }, "downloads": -1, "filename": "adf2pdf-0.8.1.tar.gz", "has_sig": false, "md5_digest": "55f06eea80130e8a5da59d6711a740da", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 21072, "upload_time": "2019-03-25T14:10:47", "url": "https://files.pythonhosted.org/packages/27/ae/73df1d3f3a351a7fcdd2d010e88a1dee196459b26326e8826ef67eda759e/adf2pdf-0.8.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "55f06eea80130e8a5da59d6711a740da", "sha256": "5740643c0e9d9df3e19069f6bf8da97f05d8e2add7e634a520dd5a688c0fdeeb" }, "downloads": -1, "filename": "adf2pdf-0.8.1.tar.gz", "has_sig": false, "md5_digest": "55f06eea80130e8a5da59d6711a740da", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 21072, "upload_time": "2019-03-25T14:10:47", "url": "https://files.pythonhosted.org/packages/27/ae/73df1d3f3a351a7fcdd2d010e88a1dee196459b26326e8826ef67eda759e/adf2pdf-0.8.1.tar.gz" } ] }