{"id":13531385,"url":"https://github.com/fourdigits/wagtail_textract","last_synced_at":"2025-10-06T21:06:22.703Z","repository":{"id":53534873,"uuid":"131723538","full_name":"fourdigits/wagtail_textract","owner":"fourdigits","description":"Text extraction for Wagtail document search","archived":false,"fork":false,"pushed_at":"2023-10-25T08:15:55.000Z","size":1066,"stargazers_count":34,"open_issues_count":14,"forks_count":14,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-09-04T00:58:00.923Z","etag":null,"topics":["django","search","tesseract","text-extraction","textract","wagtail"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fourdigits.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2018-05-01T14:36:46.000Z","updated_at":"2024-12-25T03:39:16.000Z","dependencies_parsed_at":"2024-04-14T16:01:37.927Z","dependency_job_id":"69ae3aff-0557-4a6c-bb0a-8dc5b7b46618","html_url":"https://github.com/fourdigits/wagtail_textract","commit_stats":{"total_commits":95,"total_committers":5,"mean_commits":19.0,"dds":"0.10526315789473684","last_synced_commit":"fce366764d77a6ec0cf88d5ead44c47b0e989f17"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/fourdigits/wagtail_textract","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fourdigits%2Fwagtail_textract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fourdigits%2Fwagtail_textract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fourdigits%2Fwagtail_textract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fourdigits%2Fwagtail_textract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fourdigits","download_url":"https://codeload.github.com/fourdigits/wagtail_textract/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fourdigits%2Fwagtail_textract/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278679568,"owners_count":26027105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["django","search","tesseract","text-extraction","textract","wagtail"],"created_at":"2024-08-01T07:01:02.560Z","updated_at":"2025-10-06T21:06:22.667Z","avatar_url":"https://github.com/fourdigits.png","language":"Python","funding_links":[],"categories":["Apps"],"sub_categories":["Media"],"readme":"[![Build Status](https://travis-ci.org/fourdigits/wagtail_textract.svg?branch=master)](https://travis-ci.org/fourdigits/wagtail_textract)\n[![Coverage Report](http://codecov.io/github/fourdigits/wagtail_textract/coverage.svg?branch=master)](http://codecov.io/github/fourdigits/wagtail_textract?branch=master)\n\n# ⚠️ Deprecation warning\n\nThis package is unmaintained, and we have no plans to maintain it.\n\nWe advise you to use it as an example, maybe copy the code into your own project, but don't install the package.  \n\n# Text extraction for Wagtail document search\n\nThis package is for replacing [Wagtail][1]'s Document class with one\nthat allows searching in Document file contents using [textract][2].\n\nTextract can extract text from (among [others][6]) PDF, Excel and Word files.\n\nThe package was inspired by the [\"Search: Extract text from documents\" issue][3] in Wagtail.\n\nDocuments will work as before, except that Document search in Wagtail's admin interface\nwill also find search terms in the files' contents.\n\nSome screenshots to illustrate.\n\nIn our fresh Wagtail site with `wagtail_textract` installed,\nwe uploaded a [file called `test_document.pdf`](./src/wagtail_textract/tests/testfiles/test_document.pdf) with handwritten text in it.\nIt is listed in the admin interface under Documents:\n\n![Document List](/docs/screenshot_document_list_test_document.png)\n\nIf we now search in Documents for the word `correct`, which is one of the handwritten words,\nthe live search finds it:\n\n![Document Search finds PDF by searching for \"staple\"](/docs/screenshot_document_search_correct.png)\n\nThe assumption is that this search should not only be available in Wagtail's admin interface,\nbut also in a public-facing search view, for which we provide a code example.\n\n\n## Requirements\n\n- Wagtail 2 (see [tox.ini](./tox.ini))\n- The [Textract dependencies][8]\n\n\n## Maturity\n\nWe have been using this package in production since August 2018 on https://nuffic.nl.\n\n\n## Installation\n\n- Install the [Textract dependencies][8]\n- Add `wagtail_textract` to your requirements and/or `pip install wagtail_textract`\n- Add to your Django `INSTALLED_APPS`.\n- Put `WAGTAILDOCS_DOCUMENT_MODEL = \"wagtail_textract.document\"` in your Django settings.\n\nNote: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):\n\n```\nrequests 2.18.4 has requirement chardet\u003c3.1.0,\u003e=3.0.2, but you'll have chardet 2.3.0 which is incompatible.\ntextract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.\n```\n\nWe haven't seen this leading to problems, but it's something to keep in mind.\n\n\n### Tesseract\n\nIn order to make `textract` use [Tesseract][4], which happens if regular\n`textract` finds no text, you need to add the data files that Tesseract can\nbase its word matching on.\n\nCreate a `tessdata` directory in your project directory, and download the\n[languages][5] you want.\n\n\n## Transcribing\n\nTranscription is done automatically after Document save,\nin an [`asyncio`][7] executor to prevent blocking the response during processing.\n\nTo transcribe all existing Documents, run the management command::\n\n    ./manage.py transcribe_documents\n\nThis may take a long time, obviously.\n\n\n## Usage in custom view\n\nHere is a code example for a search view (outside Wagtail's admin interface)\nthat shows both Page and Document results.\n\n```python\nfrom itertools import chain\n\nfrom wagtail.core.models import Page\nfrom wagtail.documents.models import get_document_model\n\n\ndef search(request):\n    # Search\n    search_query = request.GET.get('query', None)\n    if search_query:\n        page_results = Page.objects.live().search(search_query)\n        document_results = Document.objects.search(search_query)\n        search_results = list(chain(page_results, document_results))\n\n        # Log the query so Wagtail can suggest promoted results\n        Query.get(search_query).add_hit()\n    else:\n        search_results = Page.objects.none()\n\n    # Render template\n    return render(request, 'website/search_results.html', {\n        'search_query': search_query,\n        'search_results': search_results,\n    })\n```\n\nYour template should allow for handling Documents differently than Pages,\nbecause you can't do `pageurl result` on a Document:\n\n```jinja2\n{% if result.file %}\n   \u003ca href=\"{{ result.url }}\"\u003e{{ result }}\u003c/a\u003e\n{% else %}\n   \u003ca href=\"{% pageurl result %}\"\u003e{{ result }}\u003c/a\u003e\n{% endif %}\n```\n\n\n## What if you already use a custom Document model?\n\nIn order to use wagtail_textract, your `CustomizedDocument` model should do\nthe same as [wagtail_textract's Document](./src/wagtail_textract/models.py):\n\n- subclass `TranscriptionMixin`\n- alter `search_fields`\n\n```python\nfrom wagtail_textract.models import TranscriptionMixin\n\n\nclass CustomizedDocument(TranscriptionMixin, ...):\n    \"\"\"Extra fields and methods for Document model.\"\"\"\n    search_fields = ... + [\n        index.SearchField(\n            'transcription',\n            partial_match=False,\n        ),\n    ]\n```\n\nNote that the first class to subclass should be `TranscriptionMixin`,\nso its `save()` takes precedence over that of the other parent classes.\n\n\n## Tests\n\nTo run tests, checkout this repository and:\n\n    make test\n\n\n### Coverage\n\nA coverage report will be generated in `./coverage_html_report/`.\n\n\n## Contributors\n\n- Karl Hobley\n- Bertrand Bordage\n- Kees Hink\n- Tom Hendrikx\n- Coen van der Kamp\n- Mike Overkamp\n- Thibaud Colas\n- Dan Braghis\n- Dan Swain\n\n\n[1]: https://wagtail.io/\n[2]: https://github.com/deanmalmgren/textract\n[3]: https://github.com/wagtail/wagtail/issues/542\n[4]: https://github.com/tesseract-ocr\n[5]: https://github.com/tesseract-ocr/tessdata\n[6]: http://textract.readthedocs.io/en/stable/#currently-supporting\n[7]: https://docs.python.org/3/library/asyncio.html\n[8]: http://textract.readthedocs.io/en/latest/installation.html\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffourdigits%2Fwagtail_textract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffourdigits%2Fwagtail_textract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffourdigits%2Fwagtail_textract/lists"}