{ "info": { "author": "Plone Collective", "author_email": "product-developers@lists.plone.org", "bugtrack_url": null, "classifiers": [ "Framework :: Plone", "Programming Language :: Python", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "Introduction\n============\nPDFtoOCR processes text in PDF documents using OCR. This is needed\nwhen text cannot be extracted from a (scanned) PDF. PDFtoOCR uses content rules to \nschedule the OCR processing. The processing cannot be done one the fly, for \nexample with a custom TextIndexNG plugin. Processing large PDF documents using \nOCR is a time/processor consuming task.\n\nConfiguration\n=============\n\nOn the operating system\n-----------------------\nPDF to Text uses three tools that are available for under Linux. The \ncooperation with the tools is only tested in Debian. But it the will probably \nwork in in other \\*NIX enviroments.\n\nInstall requirements, PDF to OCR uses the following programs:\n\n- pdftotext, checks if OCR processing is necessary\n- ghostscript, converts the pdf documents to tiff images\n- tesseract, does the OCR processing (make sure you've got all language packs!*)\n\nSet the environment variables:\n\n- The environment variable *$GS* must be set and point to the ghostscript binary.\n- The environment variable *$TESSERACT* must be set and point to the tesseract binary.\n\n\nOn the Plone site\n-----------------\n\nAdd a content rule\n\n- Event trigger: Object modified and object added\n- Condition: Content type is file\n- Actions: Store OCR output from a PDF in searchable text\n\nAssign content rule to a Plone site or a folder\n\nInstall cron4plone and add the following cronjob: portal/@@do_pdf_ocr_index\n\nUsage\n=====\nJust add a file with a PDF document. Optionally you can select the language so the OCR\nengine can use dictionaries when indexing. Only a limited amount of languages are\nsupported by Tesseract.\n\nAn overview of indexed documents is found in the control panel, 'PDF to OCR status'.\nIn this status page (re)indexing of documents is possible.\n\n\nPDF Processing\n==============\n\nEach time a file is added or modified the unique id (uid) of the file is added\nto a queue. This queue is persistent and has two functions, for indexing en reindexing. \nThe indexing function uses the queue to process the documents. When reindexing is used all\nfiles in the queue history are processed.\n\nIf the text from a PDF document is extracted using pdftotext no OCR is done. Else the\nOCR extracts the text and stores it the content type file. The ATFile is patched with an \nextra field to accommodate the extracted text and the language of the PDF.\n\nPage views:\n\n- @@do_pdf_ocr_index - indexes documents in the queue\n- @@do_pdf_ocr_reindex - reindexes all pdf documents in the Plone site\n- @@pdf_ocr_status - Show the queue and a history 10 documents\n\n\nFuther reading:\n===============\n\nhttp://plone.org/documentation/how-to/ocr-in-plone-using-tesseract-ocr/\nhttp://code.google.com/p/tesseract-ocr/\n\n* Make sure you don't got empty language files in /usr/local/share/tessdata/\n\nMaybe a good alternative in the future, uses tesseract but hard to setup and\nstill too much beta: \nhttp://sites.google.com/site/ocropus/\n\n \n\nChangelog\n=========\n\n1.1\n----------------------\n\n* Compatible with Plone 4\n* Added a control panel page\n* Field 'text from ocr' is added using archetypes.schemaextender instead of a monkey patch\n* No more old style external method for doing things on the filesystem.\n* Added doc tests\n\n\n1.0 - First release\n-------------------\n\n* Initial release", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://svn.plone.org/svn/collective/", "keywords": "web zope plone theme", "license": "GPL", "maintainer": null, "maintainer_email": null, "name": "Products.PDFtoOCR", "package_url": "https://pypi.org/project/Products.PDFtoOCR/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/Products.PDFtoOCR/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://svn.plone.org/svn/collective/" }, "release_url": "https://pypi.org/project/Products.PDFtoOCR/1.1/", "requires_dist": null, "requires_python": null, "summary": "PDFtoOCR does OCR processing on PDF documents. The text from OCR is used in the search results.", "version": "1.1" }, "last_serial": 785062, "releases": { "1.0": [], "1.0dev": [ { "comment_text": "", "digests": { "md5": "efb60d4eba0a7ef2603af2960741bb00", "sha256": "5bef161fe7099a345a464927eb5bd4c88f1c10c5e8eabf18d44fb8f030c1870e" }, "downloads": -1, "filename": "Products.PDFtoOCR-1.0dev.tar.gz", "has_sig": false, "md5_digest": "efb60d4eba0a7ef2603af2960741bb00", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16641, "upload_time": "2009-06-17T16:16:50", "url": "https://files.pythonhosted.org/packages/a6/d7/48b9ba5ff300034fba39438644634bc05bf3d29fae1a27230ddebf26d9e2/Products.PDFtoOCR-1.0dev.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "f536114bbba215151cf77f7c64a67762", "sha256": "e3954b64b6f79303e715904924930d6a5c8f593d8b6797e51a2cebb79d09352e" }, "downloads": -1, "filename": "Products.PDFtoOCR-1.1.tar.gz", "has_sig": false, "md5_digest": "f536114bbba215151cf77f7c64a67762", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17875, "upload_time": "2010-03-16T12:52:20", "url": "https://files.pythonhosted.org/packages/2a/9b/e05b5c48c220adec9491ccad94d35a69ad554f83f30788afe43493a5019e/Products.PDFtoOCR-1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f536114bbba215151cf77f7c64a67762", "sha256": "e3954b64b6f79303e715904924930d6a5c8f593d8b6797e51a2cebb79d09352e" }, "downloads": -1, "filename": "Products.PDFtoOCR-1.1.tar.gz", "has_sig": false, "md5_digest": "f536114bbba215151cf77f7c64a67762", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17875, "upload_time": "2010-03-16T12:52:20", "url": "https://files.pythonhosted.org/packages/2a/9b/e05b5c48c220adec9491ccad94d35a69ad554f83f30788afe43493a5019e/Products.PDFtoOCR-1.1.tar.gz" } ] }