https://github.com/tejas-130704/pdf_assistant
Open-Source PDF Assistant: This tool allows users to ask questions based on the content of a PDF by simply providing a link to the document. It leverages Docker to create a vector database using pgvector for efficient text retrieval, ensuring unlimited queries without OpenAI embedder limitations. 🚀📄
https://github.com/tejas-130704/pdf_assistant
agentic-ai docker docker-compose mlop open-source pdf-assistant pgvector phidata-framework python streamlit
Last synced: 7 months ago
JSON representation
Open-Source PDF Assistant: This tool allows users to ask questions based on the content of a PDF by simply providing a link to the document. It leverages Docker to create a vector database using pgvector for efficient text retrieval, ensuring unlimited queries without OpenAI embedder limitations. 🚀📄
- Host: GitHub
- URL: https://github.com/tejas-130704/pdf_assistant
- Owner: tejas-130704
- Created: 2025-03-01T06:52:54.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-03-01T07:25:30.000Z (7 months ago)
- Last Synced: 2025-03-01T08:18:36.376Z (7 months ago)
- Topics: agentic-ai, docker, docker-compose, mlop, open-source, pdf-assistant, pgvector, phidata-framework, python, streamlit
- Language: Python
- Homepage:
- Size: 5.86 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# 🚀 AI-Powered Knowledge Base with pgvector and Sentence Transformer 🧠📚
## 🌟 Overview
This project utilizes **pgvector** for vector-based storage and retrieval, combined with an **open-source sentence transformer** for text embedding. The default OpenAI embedder requires an API key, which may exceed credit limits; thus, we configure the system to use a **sentence transformer model with 1024 dimensions** instead of the default 1536 dimensions.### 🔥 Open-Source PDF Assistant
I have created an **open-source PDF assistant** that eliminates the limitations imposed by OpenAI's embedder. This assistant can be used **without any restrictions**, allowing unlimited queries and document processing.
This project utilizes **pgvector** for vector-based storage and retrieval, combined with an **open-source sentence transformer** for text embedding. The default OpenAI embedder requires an API key, which may exceed credit limits; thus, we configure the system to use a **sentence transformer model with 1024 dimensions** instead of the default 1536 dimensions.## 🛠️ Setup Instructions
### ✅ Prerequisites
- 🐳 **Docker & Docker Compose** installed
- 🐍 **Python 3.8+** installed
- 🗄️ **PostgreSQL with pgvector extension** enabled
- 🌐 **Streamlit** for the frontend### 🚀 Running the Project
#### Step 0: 🔗 Clone the GitHub Repository
First, clone the project repository from GitHub:
```bash
git clone https://github.com/tejas-130704/PDF_Assistant.git
cd PDF_Assistant
```#### Step 1: 🏗️ Start pgvector with Docker
Ensure your `docker-compose.yaml` file is correctly set up, then run:
```bash
docker-compose up -d
```#### Step 2: 🗄️ Configure the Database
After the Docker container is running, execute the following commands:
```bash
docker exec -it psql -U root -d mydb
```
Replace `` with the actual container ID (can be found using `docker ps`).Connect to the database:
```sql
\c mydb
```Check existing tables:
```sql
\dt
```Drop the existing embeddings table if it exists:
```sql
DROP TABLE IF EXISTS ai.embeddings;
```Create the new table with **1024-dimensional embeddings**:
```sql
CREATE TABLE ai.embeddings (
id VARCHAR PRIMARY KEY,
name VARCHAR NOT NULL,
meta_data JSONB,
filters JSONB,
content TEXT NOT NULL,
embedding vector(1024), -- Adjusted to match the embedding model dimensions
usage JSONB,
content_hash VARCHAR UNIQUE
);
```Verify that the table was created successfully:
```sql
\dt ai.*
```#### Step 3: 📦 Install Dependencies
Navigate to your project directory and install required packages:
```bash
pip install -r requirements.txt
```#### Step 4: 🚀 Run the Application
Start the Streamlit application:
```bash
streamlit run app.py
```Once running, open your browser and go to:
```
http://localhost:8501/
```#### Step 5: 📚 Load the Knowledge Base
1. **Add `GROQ_API_KEY`** in the sidebar.
2. **Provide the PDF link** containing knowledge base content.
3. Click **"Load Knowledge Base"**.
4. Once you see the message **"Knowledge Base Loaded Successfully!"**, you can start asking questions. 🎉## Screenshots


## 🛠️ Troubleshooting
- ⚠️ If you get an error about mismatched vector dimensions, ensure that the **embedding dimension in PostgreSQL matches the sentence transformer (1024)**.
- 🛑 If OpenAI is still being used, check that your Python script is correctly configured to use **sentence transformers instead of OpenAI embeddings**.
- ✅ Ensure that all required dependencies are installed using `pip install -r requirements.txt`.## 🚀 Future Enhancements
- 🔒 Adding **user authentication** for secure access
- 🚀 Implementing **cache storage** to speed up repeated queries
- 🎨 Enhancing **UI/UX** for a more interactive experience## 🎖️ Contributors
- **Tejas Narayan Jadhav** - [GitHub](https://github.com/tejas-130704)🤝 Feel free to contribute by submitting pull requests or reporting issues! 🚀