{
"info": {
"author": "Kees Hink",
"author_email": "kees@fourdigits.nl",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 4 - Beta",
"Environment :: Web Environment",
"Framework :: Django",
"Framework :: Django :: 2.0",
"Framework :: Django :: 2.1",
"Framework :: Django :: 2.2",
"Framework :: Wagtail",
"Framework :: Wagtail :: 2",
"Intended Audience :: Developers",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Topic :: Internet :: WWW/HTTP :: Site Management"
],
"description": "[](https://travis-ci.org/fourdigits/wagtail_textract)\n[](http://codecov.io/github/fourdigits/wagtail_textract?branch=master)\n\n# Text extraction for Wagtail document search\n\nThis package is for replacing [Wagtail][1]'s Document class with one\nthat allows searching in Document file contents using [textract][2].\n\nTextract can extract text from (among [others][6]) PDF, Excel and Word files.\n\nThe package was inspired by the [\"Search: Extract text from documents\" issue][3] in Wagtail.\n\nDocuments will work as before, except that Document search in Wagtail's admin interface\nwill also find search terms in the files' contents.\n\nSome screenshots to illustrate.\n\nIn our fresh Wagtail site with `wagtail_textract` installed,\nwe uploaded a [file called `test_document.pdf`](./src/wagtail_textract/tests/testfiles/test_document.pdf) with handwritten text in it.\nIt is listed in the admin interface under Documents:\n\n\n\nIf we now search in Documents for the word `correct`, which is one of the handwritten words,\nthe live search finds it:\n\n\n\nThe assumption is that this search should not only be available in Wagtail's admin interface,\nbut also in a public-facing search view, for which we provide a code example.\n\n\n## Requirements\n\n- Wagtail 2 (see [tox.ini](./tox.ini))\n- The [Textract dependencies][8]\n\n\n## Maturity\n\nWe have been using this package in production since August 2018 on https://nuffic.nl.\n\n\n## Installation\n\n- Install the [Textract dependencies][8]\n- Add `wagtail_textract` to your requirements and/or `pip install wagtail_textract`\n- Add to your Django `INSTALLED_APPS`.\n- Put `WAGTAILDOCS_DOCUMENT_MODEL = \"wagtail_textract.document\"` in your Django settings.\n\nNote: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):\n\n```\nrequests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.\ntextract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.\n```\n\nWe haven't seen this leading to problems, but it's something to keep in mind.\n\n\n### Tesseract\n\nIn order to make `textract` use [Tesseract][4], which happens if regular\n`textract` finds no text, you need to add the data files that Tesseract can\nbase its word matching on.\n\nCreate a `tessdata` directory in your project directory, and download the\n[languages][5] you want.\n\n\n## Transcribing\n\nTranscription is done automatically after Document save,\nin an [`asyncio`][7] executor to prevent blocking the response during processing.\n\nTo transcribe all existing Documents, run the management command::\n\n ./manage.py transcribe_documents\n\nThis may take a long time, obviously.\n\n\n## Usage in custom view\n\nHere is a code example for a search view (outside Wagtail's admin interface)\nthat shows both Page and Document results.\n\n```python\nfrom itertools import chain\n\nfrom wagtail.core.models import Page\nfrom wagtail.documents.models import get_document_model\n\n\ndef search(request):\n # Search\n search_query = request.GET.get('query', None)\n if search_query:\n page_results = Page.objects.live().search(search_query)\n document_results = Document.objects.search(search_query)\n search_results = list(chain(page_results, document_results))\n\n # Log the query so Wagtail can suggest promoted results\n Query.get(search_query).add_hit()\n else:\n search_results = Page.objects.none()\n\n # Render template\n return render(request, 'website/search_results.html', {\n 'search_query': search_query,\n 'search_results': search_results,\n })\n```\n\nYour template should allow for handling Documents differently than Pages,\nbecause you can't do `pageurl result` on a Document:\n\n```jinja2\n{% if result.file %}\n {{ result }}\n{% else %}\n {{ result }}\n{% endif %}\n```\n\n\n## What if you already use a custom Document model?\n\nIn order to use wagtail_textract, your `CustomizedDocument` model should do\nthe same as [wagtail_textract's Document](./src/wagtail_textract/models.py):\n\n- subclass `TranscriptionMixin`\n- alter `search_fields`\n\n```python\nfrom wagtail_textract.models import TranscriptionMixin\n\n\nclass CustomizedDocument(TranscriptionMixin, ...):\n \"\"\"Extra fields and methods for Document model.\"\"\"\n search_fields = ... + [\n index.SearchField(\n 'transcription',\n partial_match=False,\n ),\n ]\n```\n\nNote that the first class to subclass should be `TranscriptionMixin`,\nso its `save()` takes precedence over that of the other parent classes.\n\n\n## Tests\n\nTo run tests, checkout this repository and:\n\n make test\n\n\n### Coverage\n\nA coverage report will be generated in `./coverage_html_report/`.\n\n\n## Contributors\n\n- Karl Hobley\n- Bertrand Bordage\n- Kees Hink\n- Tom Hendrikx\n- Coen van der Kamp\n- Mike Overkamp\n- Thibaud Colas\n- Dan Braghis\n- Dan Swain\n\n\n[1]: https://wagtail.io/\n[2]: https://github.com/deanmalmgren/textract\n[3]: https://github.com/wagtail/wagtail/issues/542\n[4]: https://github.com/tesseract-ocr\n[5]: https://github.com/tesseract-ocr/tessdata\n[6]: http://textract.readthedocs.io/en/stable/#currently-supporting\n[7]: https://docs.python.org/3/library/asyncio.html\n[8]: http://textract.readthedocs.io/en/latest/installation.html",
"description_content_type": "text/markdown",
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/fourdigits/wagtail_textract",
"keywords": "",
"license": "BSD",
"maintainer": "",
"maintainer_email": "",
"name": "wagtail-textract",
"package_url": "https://pypi.org/project/wagtail-textract/",
"platform": "",
"project_url": "https://pypi.org/project/wagtail-textract/",
"project_urls": {
"Homepage": "https://github.com/fourdigits/wagtail_textract"
},
"release_url": "https://pypi.org/project/wagtail-textract/1.2/",
"requires_dist": null,
"requires_python": "",
"summary": "Allow searching for text in Documents in the Wagtail content management system",
"version": "1.2"
},
"last_serial": 5791659,
"releases": {
"0.1a0": [
{
"comment_text": "",
"digests": {
"md5": "faaf4d1e1a9050e1ccebe0f2d038479f",
"sha256": "e5599048a814e02714387a069175aba427c195785c37e08e6079ba65b651473c"
},
"downloads": -1,
"filename": "wagtail-textract-0.1a0.tar.gz",
"has_sig": false,
"md5_digest": "faaf4d1e1a9050e1ccebe0f2d038479f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1031096,
"upload_time": "2018-05-08T11:52:40",
"url": "https://files.pythonhosted.org/packages/a3/51/2ccae1e851088b624d21708045595b8db2c9a756abc1dd97129c80d7194d/wagtail-textract-0.1a0.tar.gz"
}
],
"0.1a1": [
{
"comment_text": "",
"digests": {
"md5": "d7687bd3a693bbf20c001333c24a4829",
"sha256": "fb571c96df6e9a30150a2c992411716d45da61d0f6b756ae5cb5c986bb49b8aa"
},
"downloads": -1,
"filename": "wagtail_textract-0.1a1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "d7687bd3a693bbf20c001333c24a4829",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 919873,
"upload_time": "2018-05-08T12:25:28",
"url": "https://files.pythonhosted.org/packages/ad/23/cab93dc9b21223faba2a3216517fd799eadca7e7223de7dd8f8ab5242d41/wagtail_textract-0.1a1-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "f07de886a3b5755e0c4678afa5fb5182",
"sha256": "88f81bd7f7ed99d3176654d16b525842ddbcd0eb4bd8cc24949fce4c2c36afa6"
},
"downloads": -1,
"filename": "wagtail-textract-0.1a1.tar.gz",
"has_sig": false,
"md5_digest": "f07de886a3b5755e0c4678afa5fb5182",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1031164,
"upload_time": "2018-05-08T12:25:30",
"url": "https://files.pythonhosted.org/packages/1b/bc/3ca298eb78d778ab7ffac1fa820397af6efe406a2ba5d7129f011a22fa16/wagtail-textract-0.1a1.tar.gz"
}
],
"0.1a2": [
{
"comment_text": "",
"digests": {
"md5": "be3211eacc4048070b0eaa5d4d370e52",
"sha256": "e88190d576e49da6dd0bc0944ae979859e2eb54929493c5f3fe71889b143d4b5"
},
"downloads": -1,
"filename": "wagtail-textract-0.1a2.tar.gz",
"has_sig": false,
"md5_digest": "be3211eacc4048070b0eaa5d4d370e52",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1031195,
"upload_time": "2018-05-08T12:39:36",
"url": "https://files.pythonhosted.org/packages/96/7b/72dab603ed9c9a8f946a953abd25ce246888ccd00f70576846d95044b1d9/wagtail-textract-0.1a2.tar.gz"
}
],
"0.1b1": [
{
"comment_text": "",
"digests": {
"md5": "04d07cb0348f12bf2b69f325fa393d0a",
"sha256": "6f5a62001e8366d56f4782a9133181a9b0a4478cd392c29f4070db38c0ccb449"
},
"downloads": -1,
"filename": "wagtail-textract-0.1b1.tar.gz",
"has_sig": false,
"md5_digest": "04d07cb0348f12bf2b69f325fa393d0a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1031957,
"upload_time": "2018-06-11T15:45:15",
"url": "https://files.pythonhosted.org/packages/df/f4/1d2e1d910f1b33e9f069f2e7e6c1e23e8f1ecbad04e90ebff3537a76d068/wagtail-textract-0.1b1.tar.gz"
}
],
"1.0": [
{
"comment_text": "",
"digests": {
"md5": "4d29490092fc9019e7184cce89002d9d",
"sha256": "f2121a1915873603c9eaa836acee02824c0ffa831505c395307e08296a7493d9"
},
"downloads": -1,
"filename": "wagtail-textract-1.0.tar.gz",
"has_sig": false,
"md5_digest": "4d29490092fc9019e7184cce89002d9d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1031912,
"upload_time": "2018-09-05T09:12:44",
"url": "https://files.pythonhosted.org/packages/4c/ac/9e83ecd531c568e916578366d88e9f9400017c3aec075316b3d8bac21b84/wagtail-textract-1.0.tar.gz"
}
],
"1.1": [
{
"comment_text": "",
"digests": {
"md5": "54f1e3790b493ab4b762cc134e8f0b5d",
"sha256": "5f92a3fa0d657f309f69edf9a61a10b4300da725155e16e8637c1aca08dff497"
},
"downloads": -1,
"filename": "wagtail-textract-1.1.tar.gz",
"has_sig": false,
"md5_digest": "54f1e3790b493ab4b762cc134e8f0b5d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1029783,
"upload_time": "2019-04-15T15:03:30",
"url": "https://files.pythonhosted.org/packages/cb/1e/da045d78f5cb03f01430f36347952e49e5d8343f66aef00a237ad186ea8e/wagtail-textract-1.1.tar.gz"
}
],
"1.2": [
{
"comment_text": "",
"digests": {
"md5": "9b028f1ca94af2fcb8dfa6af72cd0349",
"sha256": "1a2329590d4e5033db70e472ec4023dbf4afe8056087e9db9162b08077749893"
},
"downloads": -1,
"filename": "wagtail-textract-1.2.tar.gz",
"has_sig": false,
"md5_digest": "9b028f1ca94af2fcb8dfa6af72cd0349",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1029901,
"upload_time": "2019-09-06T11:15:30",
"url": "https://files.pythonhosted.org/packages/dc/08/121863fd4f72df226fe7063a9ed8fcdc14ec46f08cc495cc61bd9d6f5f07/wagtail-textract-1.2.tar.gz"
}
],
"1.2.dev0": [
{
"comment_text": "",
"digests": {
"md5": "e3d4f24a2b0a8de0d6a764f28e8b3d34",
"sha256": "08b61c0965b18533e439c805fc49857fc9ae2eef0bb28bd5ca710bdbbcbf8eb6"
},
"downloads": -1,
"filename": "wagtail_textract-1.2.dev0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "e3d4f24a2b0a8de0d6a764f28e8b3d34",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 922236,
"upload_time": "2019-09-06T11:08:08",
"url": "https://files.pythonhosted.org/packages/0c/1f/ace6305662d8bcf29f4049b3428dd197e2f3a05ceebe33692946328753aa/wagtail_textract-1.2.dev0-py2.py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "5735bf36d2dfa2094750d2c74eac4e64",
"sha256": "792a5f792659d8a376498304285c9096610c39ad2749532a8562b40477333ed4"
},
"downloads": -1,
"filename": "wagtail-textract-1.2.dev0.tar.gz",
"has_sig": false,
"md5_digest": "5735bf36d2dfa2094750d2c74eac4e64",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1029933,
"upload_time": "2019-09-06T11:08:10",
"url": "https://files.pythonhosted.org/packages/17/81/5a7dd69808f9a5af83f13df88bc7fc0a8de39f526f4bf3e395a8f742a57f/wagtail-textract-1.2.dev0.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "9b028f1ca94af2fcb8dfa6af72cd0349",
"sha256": "1a2329590d4e5033db70e472ec4023dbf4afe8056087e9db9162b08077749893"
},
"downloads": -1,
"filename": "wagtail-textract-1.2.tar.gz",
"has_sig": false,
"md5_digest": "9b028f1ca94af2fcb8dfa6af72cd0349",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 1029901,
"upload_time": "2019-09-06T11:15:30",
"url": "https://files.pythonhosted.org/packages/dc/08/121863fd4f72df226fe7063a9ed8fcdc14ec46f08cc495cc61bd9d6f5f07/wagtail-textract-1.2.tar.gz"
}
]
}