{"id":13468580,"url":"https://github.com/impira/docquery","last_synced_at":"2025-05-14T21:09:07.450Z","repository":{"id":57717154,"uuid":"522694449","full_name":"impira/docquery","owner":"impira","description":"An easy way to extract information from documents","archived":false,"fork":false,"pushed_at":"2023-05-03T07:43:26.000Z","size":98,"stargazers_count":1747,"open_issues_count":34,"forks_count":129,"subscribers_count":24,"default_branch":"main","last_synced_at":"2025-04-06T13:07:52.455Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/impira.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-08-08T20:17:37.000Z","updated_at":"2025-04-04T09:16:35.000Z","dependencies_parsed_at":"2024-01-15T03:59:17.597Z","dependency_job_id":"a02bf807-e334-47c2-a92c-8778ec7e7ba8","html_url":"https://github.com/impira/docquery","commit_stats":{"total_commits":75,"total_committers":11,"mean_commits":6.818181818181818,"dds":0.36,"last_synced_commit":"3744f08a22609c0df5a72f463911b47689eaa819"},"previous_names":["impira/docqa"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impira%2Fdocquery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impira%2Fdocquery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impira%2Fdocquery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/impira%2Fdocquery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/impira","download_url":"https://codeload.github.com/impira/docquery/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248741201,"owners_count":21154255,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T15:01:14.023Z","updated_at":"2025-04-13T16:10:24.924Z","avatar_url":"https://github.com/impira.png","language":"Python","funding_links":[],"categories":["Python","Programming","[:robot: machine-learning]([robot-machine-learning)](\u003chttps://github.com/stars/ketsapiwiq/lists/robot-machine-learning\u003e))"],"sub_categories":["Python  \u003cimg src=\"https://raw.github.com/pcgeek86/awesome-trevor/main/assets/Python.svg?sanitize=true\" height=18\u003e"],"readme":"\u003cdiv align=\"center\"\u003e\n\nNOTE: DocQuery is not actively maintained anymore. We still welcome contributions and discussions among the community!\n\n# DocQuery: Document Query Engine Powered by Large Language Models\n\n[![Demo](https://img.shields.io/badge/Demo-Gradio-brightgreen)](https://huggingface.co/spaces/impira/docquery)\n[![Demo](https://img.shields.io/badge/Demo-Colab-orange)](https://github.com/impira/docquery/blob/main/docquery_example.ipynb)\n[![PyPI](https://img.shields.io/pypi/v/docquery?color=green\u0026label=pip%20install%20docquery)](https://pypi.org/project/docquery/)\n[![Discord](https://img.shields.io/discord/1015684761471160402?label=Chat)](https://discord.gg/HucNfTtx7V)\n[![Downloads](https://static.pepy.tech/personalized-badge/docquery?period=total\u0026units=international_system\u0026left_color=grey\u0026right_color=green\u0026left_text=Downloads)](https://pepy.tech/project/docquery)\n\n\u003c/div\u003e\n\nDocQuery is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned\nimages, etc.) using large language models (LLMs). You simply point DocQuery at one or more documents and specify a\nquestion you want to ask. DocQuery is created by the team at [Impira](https://impira.com?utm_source=github\u0026utm_medium=referral\u0026utm_campaign=docquery).\n\n## Quickstart (CLI)\n\nTo install `docquery`, you can simply run `pip install docquery`. This will install the command line tool as well as the library.\nIf you want to run OCR on images, then you must also install the [tesseract](https://github.com/tesseract-ocr/tesseract) library:\n\n- Mac OS X (using [Homebrew](https://brew.sh/)):\n\n  ```sh\n  brew install tesseract\n  ```\n\n- Ubuntu:\n\n  ```sh\n  apt install tesseract-ocr\n  ```\n\n`docquery` scan allows you to ask one or more questions to a single document or directory of files. For example, you can\nfind the invoice number \u003chttps://templates.invoicehome.com/invoice-template-us-neat-750px.png\u003e with:\n\n```bash\ndocquery scan \"What is the invoice number?\" https://templates.invoicehome.com/invoice-template-us-neat-750px.png\n```\n\nIf you have a folder of documents on your machine, you can run something like\n\n```bash\ndocquery scan \"What is the effective date?\" /path/to/contracts/folder\n```\n\nto determine the effective date of every document in the folder.\n\n## Quickstart (Library)\n\nDocQuery can also be used as a library. It contains two basic abstractions: (1) a `DocumentQuestionAnswering` pipeline\nthat makes it simple to ask questions of documents and (2) a `Document` abstraction that can parse various types of documents\nto feed into the pipeline.\n\n```python\n\u003e\u003e\u003e from docquery import document, pipeline\n\u003e\u003e\u003e p = pipeline('document-question-answering')\n\u003e\u003e\u003e doc = document.load_document(\"/path/to/document.pdf\")\n\u003e\u003e\u003e for q in [\"What is the invoice number?\", \"What is the invoice total?\"]:\n...     print(q, p(question=q, **doc.context))\n```\n\n## Use cases\n\nDocQuery excels at a number of use cases involving structured, semi-structured, or unstructured documents. You can ask questions about\ninvoices, contracts, forms, emails, letters, receipts, and many more. You can also classify documents. We will continue evolving the model,\noffer more modeling options, and expanding the set of supported documents. We welcome feedback, requests, and of course contributions to\nhelp achieve this vision.\n\n## How it works\n\nUnder the hood, docquery uses a pre-trained zero-shot language model, based on [LayoutLM](https://arxiv.org/abs/1912.13318), that has been\nfine-tuned for a question-answering task. The model is trained using a combination of [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/)\nand [DocVQA](https://rrc.cvc.uab.es/?ch=17) which make it particularly well suited for complex visual question answering tasks on\na wide variety of documents. The underlying model is also published on HuggingFace as [impira/layoutlm-document-qa](https://huggingface.co/impira/layoutlm-document-qa)\nwhich you can access directly.\n\n## Limitations\n\nDocQuery is intended to have a small install footprint and be simple to work with. As a result, it has some limitations:\n\n- Models must be pre-trained. Although DocQuery uses a zero-shot model that can adapt based on the question you provide, it does not learn from your data.\n- Support for images and PDFs. Currently DocQuery supports images and PDFs, with or without embedded text. It does not support word documents, emails, spreadsheets, etc.\n- Scalar text outputs. DocQuery only produces text outputs (answers). It does not support richer scalar types (i.e. it treats numbers and dates as strings) or tables.\n\n## Advanced features\n\n### Using Donut 🍩\n\nIf you'd like to test `docquery` with [Donut](https://arxiv.org/abs/2111.15664), you must install the required extras:\n\n```bash\npip install docquery[donut]\n```\n\nYou can then run\n\n```bash\ndocquery scan \"What is the effective date?\" /path/to/contracts/folder --checkpoint 'naver-clova-ix/donut-base-finetuned-docvqa'\n```\n\n### Classifying documents\n\nTo classify documents, you simply add the `--classify` argument to `scan`. You can specify any [image classification](https://huggingface.co/models?pipeline_tag=image-classification\u0026sort=downloads)\nmodel on Hugging Face's hub. By default, the classification pipeline uses [Donut](https://huggingface.co/spaces/nielsr/donut-rvlcdip) (which requires\nthe installation instructions above):\n\n```bash\n\n# Classify documents\ndocquery scan --classify  /path/to/contracts/folder --checkpoint 'naver-clova-ix/donut-base-finetuned-docvqa'\n\n# Classify documents and ask a question too\ndocquery scan --classify \"What is the effective date?\" /path/to/contracts/folder --checkpoint 'naver-clova-ix/donut-base-finetuned-docvqa'\n```\n\n### Scraping webpages\n\nDocQuery can read files through HTTP/HTTPs out of the box. However, if you want to read HTML documents, you can do that too by installing the\n`[web]` extension. The extension uses the [webdriver-manager](https://pypi.org/project/webdriver-manager/) library which can install a Chrome\ndriver on your system automatically, but you'll need to make sure Chrome is installed globally.\n\n```\n# Find the top post on hacker news\ndocquery scan \"What is the #1 post's title?\" https://news.ycombinator.com\n```\n\n## Where to go from here\n\nDocQuery is a swiss army knife tool for working with documents and experiencing the power of modern machine learning. You can use it\njust about anywhere, including behind a firewall on sensitive data, and test it with a wide variety of documents. Our hope is that\nDocQuery enables many creative use cases for document understanding by making it simple and easy to ask questions from your documents.\n\nWhen you run DocQuery for the first time, it will download some files (e.g. the models and some library code from HuggingFace). However,\nnothing leaves your computer -- the OCR is done locally, models run locally, etc. This comes with the benefit of security and privacy;\nhowever, it comes at the cost of runtime performance and some accuracy.\n\nIf you find yourself wondering how to achieve higher accuracy, work with more file types, teach the model with your own data, have\na human-in-the-loop workflow, or query the data you're extracting, then do not fear -- you are running into the challenges that\nevery organization does while putting document AI into production. The [Impira](https://www.impira.com/) platform is designed to\nsolve these problems in an easy and intuitive way. Impira comes with a QA model that is additionally trained on proprietary datasets\nand can achieve 95%+ accuracy out-of-the-box for most use cases. It also has an intuitive UI that enables subject matter experts to label\nand improve the models, as well as an API that makes integration a breeze. Please [sign up for the product](https://www.impira.com/signup) or\n[reach out to us](info@impira.com) for more details.\n\n## Status\n\nDocQuery is a new project. Although the underlying models are running in production, we've just recently released our code in open source\nand are actively working with the OSS community to upstream some of the changes we've made (e.g. [the model](https://github.com/huggingface/transformers/pull/18407)\nand [pipeline](https://github.com/huggingface/transformers/pull/18414)). DocQuery is rapidly changing, and we are likely to make breaking\nAPI changes. If you would like to run it in production, then we suggest pinning a version or commit hash. Either way, please get in touch\nwith us at [oss@impira.com](mailto:oss@impira.com) with any questions or feedback.\n\n## Acknowledgements\n\nDocQuery would not be possible without the contributions of many open source projects:\n\n- [pdfplumber](https://github.com/jsvine/pdfplumber) / [pdfminer.six](https://github.com/pdfminer/pdfminer.six)\n- [Pillow](https://pillow.readthedocs.io/en/stable/)\n- [pytorch](https://pytorch.org/)\n- [tesseract](https://github.com/tesseract-ocr/tesseract) / [pytesseract](https://pypi.org/project/pytesseract/)\n- [transformers](https://github.com/impira/transformers)\n\nand many others!\n\n## License\n\nThis project is licensed under the [MIT license](LICENSE).\n\nIt contains code that is copied and adapted from transformers (\u003chttps://github.com/huggingface/transformers\u003e),\nwhich is [Apache 2.0 licensed](http://www.apache.org/licenses/LICENSE-2.0). Files containing this code have\nbeen marked as such in their comments.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimpira%2Fdocquery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimpira%2Fdocquery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimpira%2Fdocquery/lists"}