Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fkodom/document-rag
RAG application to answer questions about PDF documents using LLMs.
https://github.com/fkodom/document-rag
Last synced: about 2 months ago
JSON representation
RAG application to answer questions about PDF documents using LLMs.
- Host: GitHub
- URL: https://github.com/fkodom/document-rag
- Owner: fkodom
- License: mit
- Created: 2023-11-30T04:02:16.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-01T19:36:49.000Z (about 1 year ago)
- Last Synced: 2024-10-23T03:17:05.745Z (2 months ago)
- Language: Python
- Homepage:
- Size: 370 KB
- Stars: 11
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# document-rag
A simple retrieval-augmented generation (RAG) system for answering questions about PDF documents.
> **Note:** RAG responses do not depend on chat history -- only the provided documents. Each question should be self-contained, and not depend on previous questions or answers.
## Getting Started
Clone the repo and install with `pip`:
```bash
# clone
gh repo clone fkodom/document-rag
cd document-rag# install
pip install -e .# for contributors only:
# - install test dependencies
# - setup pre-commit hooks
pip install -e ".[test]"
pre-commit install
```Run the `chatbot.py` script:
```bash
python chatbot.py \
path/to/document-1.pdf \
path/to/document-2.pdf \
...
```This will start a simple Q&A session. Sample PDFs are provided in the `assets/` folder. For example:
```bash
python chatbot.py ./assets/alice-in-wonderland.pdf
```
```bash
Extracting PDFs: 100%|██████████| 1/1 [00:01<00:00, 1.63s/it]
Ingested PDF documents. Please ask your questions.
>>> What is Alice's cat's name?
Dinah
>>> What characters are present at the tea party?
The March Hare, the Hatter, and the Dormouse are present at the tea party.
```Responses are not fully deterministic, so you may get slightly different answers each time.
Specify the `--show-references` flag to see which documents/pages were used to answer each question. By default, 5 documents are used.
```bash
Extracting PDFs: 100%|██████████| 1/1 [00:01<00:00, 1.63s/it]
Ingested PDF documents. Please ask your questions.
>>> What is Alice's cat's name?
DinahReferences:
...
.../assets/alice-in-wonderland.pdf (pp 13-14)
passionate voice. ‘Would YOU like cats if you were me?’ ‘Well, perhaps not,’ said Alice in a soothing tone: ‘don’t be angry about it. And yet I wish I could show you our cat Dinah: I think you’d take a fancy to cats if you could only see her. She is such a dear quiet thing,’ Alice went on, half to herself, as she swam lazily about in the pool, ‘and she sits 23 purring so nicely by the fire, licking her paws and washing her face–and she is such a nice soft thing to nurse–and she’s such a capital one for catching mice–oh, I beg your pardon!’ cried Alice again, for this time the Mouse was bristling all over, and she felt certain it must be really.../assets/alice-in-wonderland.pdf (p 38)
right way to change them–’ when she was a little startled by seeing the Cheshire Cat sitting on a bough of a tree a few yards o↵. The Cat only grinned when it saw Alice. It looked good-natured, she thought: still it had VERY long claws and a great many teeth, so she felt that it ought to be treated with respect. ‘Cheshire Puss,’ she began, rather timidly, as she did not at all know whether it would like the name: however, it only grinned a little wider. ‘Come, it’s pleased so far,’ thought Alice, and she went on. ‘Would you tell me, please, which way I ought to go from here?’ ‘That depends a good deal on where you want to get to,’ said the Cat. ‘I
```## How It Works
First, PDF documents are ingested into the system:
1. Extract text from each page
2. Split text into (slightly overlapping) chunks
3. Store chunks and their respective embeddings in a vector database (Qdrant)Then for each query, the system performs the following steps:
1. Retrieve many (~100) text chunks from the vector DB that are similar to the query. This uses an ANN index under the hood. We collect many chunks to ensure that we have high recall.
2. Re-rank the retrieved chunks using a cross-encoder model (better at capturing detailed similarity). The top-scoring chunks will have much higher precision.
3. Pass the top chunks (~5) to an LLM along with our prompt for answer generation. Let the language model handle the rest.## Tests and Linting
| Tool | Description | Runs on |
| --- | --- | --- |
| [black](https://github.com/psf/black) | Code formatter | - `git commit` (through `pre-commit`)
- `git push`
- pull requests |
| [ruff](https://github.com/astral-sh/ruff) | Code linter | - `git commit` (through `pre-commit`)
- `git push`
- pull requests |
| [pytest](https://github.com/pytest-dev/pytest) | Unit testing framework | - `git push`
- pull requests |
| [mypy](https://github.com/python/mypy) | Static type checker | - `git push`
- pull requests |
| [pre-commit](https://github.com/pre-commit/pre-commit) | Pre-commit hooks | - `git commit` |