Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/PragmaticMachineLearning/docai
Structured information extraction from documents
https://github.com/PragmaticMachineLearning/docai
Last synced: about 1 month ago
JSON representation
Structured information extraction from documents
- Host: GitHub
- URL: https://github.com/PragmaticMachineLearning/docai
- Owner: PragmaticMachineLearning
- License: mit
- Created: 2024-09-07T13:56:54.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-09-26T08:01:46.000Z (5 months ago)
- Last Synced: 2025-01-03T03:36:28.701Z (about 2 months ago)
- Language: Python
- Size: 4.48 MB
- Stars: 294
- Watchers: 4
- Forks: 28
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DocAI
Extract structured data from unstructured documents using Answer.AI's [Byaldi](https://github.com/AnswerDotAI/byaldi), OpenAI [gpt-4o](https://platform.openai.com/docs/guides/vision), and [Langchain's structured output](https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/).
### Installation
```bash
pyenv virtualenv 3.10.6 docai
pyenv activate docai
poetry install
```### Environment vars
Ensure you have an OPENAI_API_KEY and HF_TOKEN set in your environment variables.```bash
export OPENAI_API_KEY=
export HF_TOKEN=
```### Sample usage
Build the index from the pdfs/ folder:
```bash
python scripts/build_index.py --folder "pdfs/" --index_name "application"
```Extraction structured information from the index (open extract.py to see queries and pydantic models):
```bash
python scripts/extract.py
```### Sample output
```
What losses have occurred in the past 5 years?
LossHistory(
losses=[
Loss(loss_date='2/20/21', loss_amount=7003.0, loss_description='Claimant was in his sleeper when his truck got hit by insured driver on the left', date_of_claim='4/19/21'),
Loss(loss_date='2/4/21', loss_amount=92584.0, loss_description='The IV was attempting to merge on the highway when the IV lost control and struck', date_of_claim='4/30/21'),
Loss(loss_date='9/14/21', loss_amount=5583.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21'),
Loss(loss_date='9/14/21', loss_amount=6299.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21')
]
)What is the basic application information?
Application(
insured_name='Greentown Burgers LLC',
insured_address='Not provided',
insured_phone='Not provided',
insured_email='Not provided',
effective_date='07/22/2024'
)
```