https://github.com/harshindcoder/rag_pipeline_panscience_innovations
Built a document parsing and LLM query application that extracts and structures information from PDFs, enabling natural language queries on the content using a vector database and local LLMs.
https://github.com/harshindcoder/rag_pipeline_panscience_innovations
Last synced: 8 months ago
JSON representation
Built a document parsing and LLM query application that extracts and structures information from PDFs, enabling natural language queries on the content using a vector database and local LLMs.
- Host: GitHub
- URL: https://github.com/harshindcoder/rag_pipeline_panscience_innovations
- Owner: harshindcoder
- License: mit
- Created: 2025-10-06T06:28:17.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-06T16:59:05.000Z (8 months ago)
- Last Synced: 2025-10-06T18:37:06.857Z (8 months ago)
- Language: Python
- Size: 7.81 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# rag_pipeline_PanScience_Innovations
Built a document parsing and LLM query application that extracts and structures information from PDFs, enabling natural language queries on the content using a vector database and local LLMs.

**Note:** To function properly, this pipeline requires a running **Mistral model on Ollama**, which serves as the local LLM backend for generating responses.
---
## Features
- Parse PDF documents and extract structured information.
- Store and query data using a vector database.
- Query PDFs using natural language via a local LLM (Mistral on Ollama).
- Fully containerized using Docker for easy setup.
- Works on **Windows** and **Mac**. For **Linux**, future modifications may be required.
---
## Prerequisites
- Docker installed on your system ([Docker Desktop](https://www.docker.com/products/docker-desktop)).
- Docker Compose installed (usually included with Docker Desktop).
- Ollama installed and the Mistral model running locally.
- (Optional) Git, if you want to clone the repository.
---
## Setup Instructions
### 1. Pull the Docker Images
The Docker Compose file will pull both the application and MongoDB images automatically. If you want to pull manually:
```bash
docker pull ghcr.io/harshindcoder/rag-pipeline-panscience-innovations:latest
docker pull mongo:latest
````
---
### 2. Run the Services via Docker Compose
Create a file named `docker-compose.yml` (or use the one provided) with the following content:
```yaml
version: "3.9"
services:
app:
image: ghcr.io/harshindcoder/rag-pipeline-panscience-innovations:latest
container_name: rrag_app
ports:
- "8000:8000"
depends_on:
- mongo
mongo:
image: mongo:latest
container_name: rrag_mongo
ports:
- "27017:27017"
```
Then, in the terminal, run:
```bash
docker-compose up
```
* To run in the background (detached mode):
```bash
docker-compose up -d
```
* To view logs:
```bash
docker-compose logs -f
```
* To stop all services:
```bash
docker-compose down
```
---
### 3. Start Ollama with Mistral
Open another terminal and run:
```bash
ollama run mistral
```
* The container expects to connect to Ollama at `http://localhost:11434`.
* Mistral must be running while using the API.
---
### 4. Upload a PDF
Open a **new terminal** and run:
```bash
curl -X POST -F "file=@/path/to/your/document.pdf" http://localhost:8000/upload
```
* Replace `/path/to/your/document.pdf` with the actual path of your PDF.
* This uploads the PDF to the pipeline for parsing and vectorization.
---
### 5. Ask Questions
Once the file is uploaded, in the same terminal (or a new one), run:
```bash
curl -X POST -F "question=What is this file about?" http://localhost:8000/ask
```
* The pipeline will query the parsed document via Mistral on Ollama and return the answer.
---
Here’s the 6th point refined for your README:
---
### 6. Quick Manual Checks
For now, you can verify that everything is running with simple commands:
* **Check Docker containers:**
```bash
docker ps
```
* **Check FastAPI is responding:**
```bash
curl -X GET http://localhost:8000/docs
```
* **Parser and VectorDB** are verified via pytest:
```bash
pytest -v tests/
```
Tests are available in the repo but are not included in the production Docker image. Run them locally using pytest.
---
## Notes
* **Terminals:** You need at least **two terminals**: one for Docker Compose (FastAPI + MongoDB), one for Ollama. The `curl` commands can run in the same terminal or separate ones.
* **File paths:** Replace `/path/to/your/document.pdf` with your PDF location.
* **Ports:** FastAPI runs on `8000` and MongoDB on `27017`. Adjust the `ports` in `docker-compose.yml` if needed.
---
## Future Improvements
* Add Linux compatibility.
* Multi-user support for larger document processing.
* Better error handling for missing Mistral/Ollama connections.
---