{ "info": { "author": "Kees Hink", "author_email": "kees@fourdigits.nl", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "Framework :: Django", "Framework :: Django :: 2.0", "Framework :: Django :: 2.1", "Framework :: Django :: 2.2", "Framework :: Wagtail", "Framework :: Wagtail :: 2", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Internet :: WWW/HTTP :: Site Management" ], "description": "[![Build Status](https://travis-ci.org/fourdigits/wagtail_textract.svg?branch=master)](https://travis-ci.org/fourdigits/wagtail_textract)\n[![Coverage Report](http://codecov.io/github/fourdigits/wagtail_textract/coverage.svg?branch=master)](http://codecov.io/github/fourdigits/wagtail_textract?branch=master)\n\n# Text extraction for Wagtail document search\n\nThis package is for replacing [Wagtail][1]'s Document class with one\nthat allows searching in Document file contents using [textract][2].\n\nTextract can extract text from (among [others][6]) PDF, Excel and Word files.\n\nThe package was inspired by the [\"Search: Extract text from documents\" issue][3] in Wagtail.\n\nDocuments will work as before, except that Document search in Wagtail's admin interface\nwill also find search terms in the files' contents.\n\nSome screenshots to illustrate.\n\nIn our fresh Wagtail site with `wagtail_textract` installed,\nwe uploaded a [file called `test_document.pdf`](./src/wagtail_textract/tests/testfiles/test_document.pdf) with handwritten text in it.\nIt is listed in the admin interface under Documents:\n\n![Document List](/docs/screenshot_document_list_test_document.png)\n\nIf we now search in Documents for the word `correct`, which is one of the handwritten words,\nthe live search finds it:\n\n![Document Search finds PDF by searching for \"staple\"](/docs/screenshot_document_search_correct.png)\n\nThe assumption is that this search should not only be available in Wagtail's admin interface,\nbut also in a public-facing search view, for which we provide a code example.\n\n\n## Requirements\n\n- Wagtail 2 (see [tox.ini](./tox.ini))\n- The [Textract dependencies][8]\n\n\n## Maturity\n\nWe have been using this package in production since August 2018 on https://nuffic.nl.\n\n\n## Installation\n\n- Install the [Textract dependencies][8]\n- Add `wagtail_textract` to your requirements and/or `pip install wagtail_textract`\n- Add to your Django `INSTALLED_APPS`.\n- Put `WAGTAILDOCS_DOCUMENT_MODEL = \"wagtail_textract.document\"` in your Django settings.\n\nNote: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):\n\n```\nrequests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.\ntextract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.\n```\n\nWe haven't seen this leading to problems, but it's something to keep in mind.\n\n\n### Tesseract\n\nIn order to make `textract` use [Tesseract][4], which happens if regular\n`textract` finds no text, you need to add the data files that Tesseract can\nbase its word matching on.\n\nCreate a `tessdata` directory in your project directory, and download the\n[languages][5] you want.\n\n\n## Transcribing\n\nTranscription is done automatically after Document save,\nin an [`asyncio`][7] executor to prevent blocking the response during processing.\n\nTo transcribe all existing Documents, run the management command::\n\n ./manage.py transcribe_documents\n\nThis may take a long time, obviously.\n\n\n## Usage in custom view\n\nHere is a code example for a search view (outside Wagtail's admin interface)\nthat shows both Page and Document results.\n\n```python\nfrom itertools import chain\n\nfrom wagtail.core.models import Page\nfrom wagtail.documents.models import get_document_model\n\n\ndef search(request):\n # Search\n search_query = request.GET.get('query', None)\n if search_query:\n page_results = Page.objects.live().search(search_query)\n document_results = Document.objects.search(search_query)\n search_results = list(chain(page_results, document_results))\n\n # Log the query so Wagtail can suggest promoted results\n Query.get(search_query).add_hit()\n else:\n search_results = Page.objects.none()\n\n # Render template\n return render(request, 'website/search_results.html', {\n 'search_query': search_query,\n 'search_results': search_results,\n })\n```\n\nYour template should allow for handling Documents differently than Pages,\nbecause you can't do `pageurl result` on a Document:\n\n```jinja2\n{% if result.file %}\n {{ result }}\n{% else %}\n {{ result }}\n{% endif %}\n```\n\n\n## What if you already use a custom Document model?\n\nIn order to use wagtail_textract, your `CustomizedDocument` model should do\nthe same as [wagtail_textract's Document](./src/wagtail_textract/models.py):\n\n- subclass `TranscriptionMixin`\n- alter `search_fields`\n\n```python\nfrom wagtail_textract.models import TranscriptionMixin\n\n\nclass CustomizedDocument(TranscriptionMixin, ...):\n \"\"\"Extra fields and methods for Document model.\"\"\"\n search_fields = ... + [\n index.SearchField(\n 'transcription',\n partial_match=False,\n ),\n ]\n```\n\nNote that the first class to subclass should be `TranscriptionMixin`,\nso its `save()` takes precedence over that of the other parent classes.\n\n\n## Tests\n\nTo run tests, checkout this repository and:\n\n make test\n\n\n### Coverage\n\nA coverage report will be generated in `./coverage_html_report/`.\n\n\n## Contributors\n\n- Karl Hobley\n- Bertrand Bordage\n- Kees Hink\n- Tom Hendrikx\n- Coen van der Kamp\n- Mike Overkamp\n- Thibaud Colas\n- Dan Braghis\n- Dan Swain\n\n\n[1]: https://wagtail.io/\n[2]: https://github.com/deanmalmgren/textract\n[3]: https://github.com/wagtail/wagtail/issues/542\n[4]: https://github.com/tesseract-ocr\n[5]: https://github.com/tesseract-ocr/tessdata\n[6]: http://textract.readthedocs.io/en/stable/#currently-supporting\n[7]: https://docs.python.org/3/library/asyncio.html\n[8]: http://textract.readthedocs.io/en/latest/installation.html", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/fourdigits/wagtail_textract", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "wagtail-textract", "package_url": "https://pypi.org/project/wagtail-textract/", "platform": "", "project_url": "https://pypi.org/project/wagtail-textract/", "project_urls": { "Homepage": "https://github.com/fourdigits/wagtail_textract" }, "release_url": "https://pypi.org/project/wagtail-textract/1.2/", "requires_dist": null, "requires_python": "", "summary": "Allow searching for text in Documents in the Wagtail content management system", "version": "1.2" }, "last_serial": 5791659, "releases": { "0.1a0": [ { "comment_text": "", "digests": { "md5": "faaf4d1e1a9050e1ccebe0f2d038479f", "sha256": "e5599048a814e02714387a069175aba427c195785c37e08e6079ba65b651473c" }, "downloads": -1, "filename": "wagtail-textract-0.1a0.tar.gz", "has_sig": false, "md5_digest": "faaf4d1e1a9050e1ccebe0f2d038479f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1031096, "upload_time": "2018-05-08T11:52:40", "url": "https://files.pythonhosted.org/packages/a3/51/2ccae1e851088b624d21708045595b8db2c9a756abc1dd97129c80d7194d/wagtail-textract-0.1a0.tar.gz" } ], "0.1a1": [ { "comment_text": "", "digests": { "md5": "d7687bd3a693bbf20c001333c24a4829", "sha256": "fb571c96df6e9a30150a2c992411716d45da61d0f6b756ae5cb5c986bb49b8aa" }, "downloads": -1, "filename": "wagtail_textract-0.1a1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d7687bd3a693bbf20c001333c24a4829", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 919873, "upload_time": "2018-05-08T12:25:28", "url": "https://files.pythonhosted.org/packages/ad/23/cab93dc9b21223faba2a3216517fd799eadca7e7223de7dd8f8ab5242d41/wagtail_textract-0.1a1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f07de886a3b5755e0c4678afa5fb5182", "sha256": "88f81bd7f7ed99d3176654d16b525842ddbcd0eb4bd8cc24949fce4c2c36afa6" }, "downloads": -1, "filename": "wagtail-textract-0.1a1.tar.gz", "has_sig": false, "md5_digest": "f07de886a3b5755e0c4678afa5fb5182", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1031164, "upload_time": "2018-05-08T12:25:30", "url": "https://files.pythonhosted.org/packages/1b/bc/3ca298eb78d778ab7ffac1fa820397af6efe406a2ba5d7129f011a22fa16/wagtail-textract-0.1a1.tar.gz" } ], "0.1a2": [ { "comment_text": "", "digests": { "md5": "be3211eacc4048070b0eaa5d4d370e52", "sha256": "e88190d576e49da6dd0bc0944ae979859e2eb54929493c5f3fe71889b143d4b5" }, "downloads": -1, "filename": "wagtail-textract-0.1a2.tar.gz", "has_sig": false, "md5_digest": "be3211eacc4048070b0eaa5d4d370e52", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1031195, "upload_time": "2018-05-08T12:39:36", "url": "https://files.pythonhosted.org/packages/96/7b/72dab603ed9c9a8f946a953abd25ce246888ccd00f70576846d95044b1d9/wagtail-textract-0.1a2.tar.gz" } ], "0.1b1": [ { "comment_text": "", "digests": { "md5": "04d07cb0348f12bf2b69f325fa393d0a", "sha256": "6f5a62001e8366d56f4782a9133181a9b0a4478cd392c29f4070db38c0ccb449" }, "downloads": -1, "filename": "wagtail-textract-0.1b1.tar.gz", "has_sig": false, "md5_digest": "04d07cb0348f12bf2b69f325fa393d0a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1031957, "upload_time": "2018-06-11T15:45:15", "url": "https://files.pythonhosted.org/packages/df/f4/1d2e1d910f1b33e9f069f2e7e6c1e23e8f1ecbad04e90ebff3537a76d068/wagtail-textract-0.1b1.tar.gz" } ], "1.0": [ { "comment_text": "", "digests": { "md5": "4d29490092fc9019e7184cce89002d9d", "sha256": "f2121a1915873603c9eaa836acee02824c0ffa831505c395307e08296a7493d9" }, "downloads": -1, "filename": "wagtail-textract-1.0.tar.gz", "has_sig": false, "md5_digest": "4d29490092fc9019e7184cce89002d9d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1031912, "upload_time": "2018-09-05T09:12:44", "url": "https://files.pythonhosted.org/packages/4c/ac/9e83ecd531c568e916578366d88e9f9400017c3aec075316b3d8bac21b84/wagtail-textract-1.0.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "54f1e3790b493ab4b762cc134e8f0b5d", "sha256": "5f92a3fa0d657f309f69edf9a61a10b4300da725155e16e8637c1aca08dff497" }, "downloads": -1, "filename": "wagtail-textract-1.1.tar.gz", "has_sig": false, "md5_digest": "54f1e3790b493ab4b762cc134e8f0b5d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1029783, "upload_time": "2019-04-15T15:03:30", "url": "https://files.pythonhosted.org/packages/cb/1e/da045d78f5cb03f01430f36347952e49e5d8343f66aef00a237ad186ea8e/wagtail-textract-1.1.tar.gz" } ], "1.2": [ { "comment_text": "", "digests": { "md5": "9b028f1ca94af2fcb8dfa6af72cd0349", "sha256": "1a2329590d4e5033db70e472ec4023dbf4afe8056087e9db9162b08077749893" }, "downloads": -1, "filename": "wagtail-textract-1.2.tar.gz", "has_sig": false, "md5_digest": "9b028f1ca94af2fcb8dfa6af72cd0349", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1029901, "upload_time": "2019-09-06T11:15:30", "url": "https://files.pythonhosted.org/packages/dc/08/121863fd4f72df226fe7063a9ed8fcdc14ec46f08cc495cc61bd9d6f5f07/wagtail-textract-1.2.tar.gz" } ], "1.2.dev0": [ { "comment_text": "", "digests": { "md5": "e3d4f24a2b0a8de0d6a764f28e8b3d34", "sha256": "08b61c0965b18533e439c805fc49857fc9ae2eef0bb28bd5ca710bdbbcbf8eb6" }, "downloads": -1, "filename": "wagtail_textract-1.2.dev0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "e3d4f24a2b0a8de0d6a764f28e8b3d34", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 922236, "upload_time": "2019-09-06T11:08:08", "url": "https://files.pythonhosted.org/packages/0c/1f/ace6305662d8bcf29f4049b3428dd197e2f3a05ceebe33692946328753aa/wagtail_textract-1.2.dev0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5735bf36d2dfa2094750d2c74eac4e64", "sha256": "792a5f792659d8a376498304285c9096610c39ad2749532a8562b40477333ed4" }, "downloads": -1, "filename": "wagtail-textract-1.2.dev0.tar.gz", "has_sig": false, "md5_digest": "5735bf36d2dfa2094750d2c74eac4e64", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1029933, "upload_time": "2019-09-06T11:08:10", "url": "https://files.pythonhosted.org/packages/17/81/5a7dd69808f9a5af83f13df88bc7fc0a8de39f526f4bf3e395a8f742a57f/wagtail-textract-1.2.dev0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9b028f1ca94af2fcb8dfa6af72cd0349", "sha256": "1a2329590d4e5033db70e472ec4023dbf4afe8056087e9db9162b08077749893" }, "downloads": -1, "filename": "wagtail-textract-1.2.tar.gz", "has_sig": false, "md5_digest": "9b028f1ca94af2fcb8dfa6af72cd0349", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1029901, "upload_time": "2019-09-06T11:15:30", "url": "https://files.pythonhosted.org/packages/dc/08/121863fd4f72df226fe7063a9ed8fcdc14ec46f08cc495cc61bd9d6f5f07/wagtail-textract-1.2.tar.gz" } ] }