An open API service indexing awesome lists of open source software.

https://github.com/whitris/open-rag-bot

Modular Retrieval-Augmented Generation (RAG) chatbot: ask questions about your own documents and get LLM-powered answers. Web and CLI included.
https://github.com/whitris/open-rag-bot

chatbot groq llm openai python rag streamlit vector-search

Last synced: 9 months ago
JSON representation

Modular Retrieval-Augmented Generation (RAG) chatbot: ask questions about your own documents and get LLM-powered answers. Web and CLI included.

Awesome Lists containing this project

README

          

# Open RAG Bot

![Python versions](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
![MIT License](https://img.shields.io/badge/license-MIT-green.svg)
![CI](https://github.com/Whitris/open-rag-bot/actions/workflows/ci.yml/badge.svg)

Open RAG Bot is a modular framework for building document-aware chatbots powered by Retrieval-Augmented Generation (RAG). Search, chat, and extract insights from your own files with state-of-the-art LLMs, via CLI or Streamlit webapp.

## Features

- **Retrieval-Augmented Generation (RAG):** Answers are grounded in your own documents, not just LLM training data.
- **Flexible document ingestion:** Easily add your PDFs, DOCX, or text files to the knowledge base.
- **Fast search:** Uses [ChromaDB](https://www.trychroma.com/) for efficient vector search and retrieval with metadata support.
- **Two interfaces:** Use from the command line (CLI) or via a modern Streamlit webapp.
- **Tested and extensible:** Includes unit tests and a clean codebase for easy customization.

## Folder structure

```text
open-rag-bot/

├── src/
│ └── open_rag_bot/
│ ├── app/ # Streamlit webapp
│ ├── config/ # Settings, .env, config files
│ ├── core/ # Core logic: retrieval, prompt, chatbot
│ ├── data/ # Data processing scripts and utilities
│ └── services/ # LLM & embedding clients

├── tests/ # Unit and integration tests
├── data/ # Local, non-versioned data (indexed files, outputs)
├── requirements.txt
├── pyproject.toml
├── CONTRIBUTING.md
├── README.md
├── LICENSE
├── .gitignore
└── .pre-commit-config.yaml
```

## Quickstart

### 1. Clone the repository

```bash
git clone https://github.com//open-rag-bot.git
cd open-rag-bot
```

### 2. Install dependencies (recommended: PDM)

```bash
pip install pdm
pdm install
```

If you prefer classic pip:

```bash
pip install -r requirements.txt
```

### 3. Configure environment
Copy `./.env.example` to `./.env` and add your API keys (see instructions in the next section).

### 4. Prepare your documents
Place your PDF, DOCX or text files in a folder of your choice.

### 5. Process your documents

Extracts text from documents, splits it into chunks, and saves the result as a CSV file.

```bash
pdm run python process_data.py [OPTIONS] INPUT_DIR OUTPUT_CSV_PATH
```

#### Required Arguments

- INPUT_DIR: Directory containing the documents to process (PDF, TXT, DOCX, etc.).
- OUTPUT_CSV_PATH: Path to the CSV file where the text chunks are to be stored.

#### Options

- `--chunk-size INTEGER`: Size of each text chunk, in characters. Default: 500.
- `--max-files INTEGER`: Maximum number of files to process. Default: -1, meaning all files.
- `--formats TEXT`: File formats to include, comma-separated. Default: pdf,txt,docx.

#### Example usage

```bash
pdm run python process_data.py ./documents ./data/example.csv --chunk-size 1000 --max-files 100 --formats pdf,txt
```

#### Note

- If the input directory does not exist or no matching files are found, the script exits with an error message.

- Supported formats are controlled by the --formats option (case-insensitive, e.g., pdf,docx,txt).

### 6. Generate embeddings

Creates a vector index of the text chunks using embeddings and stores the collection in ChromaDB.

```bash
pdm run python generate_embeddings.py [OPTIONS] CSV_PATH
```

#### Required Arguments

- CSV_PATH: Path to the CSV file containing text chunks generated by the previous step.

#### Optional Arguments

- --collection-dir: Directory to store the ChromaDB index (default: as defined in your settings).
- --collection-name: Name of the ChromaDB collection (default: as defined in your settings).

#### Example usage

```bash
pdm run python generate_embeddings.py ./data/example.csv --collection-dir ./data/example --collection-name example_collection
```

### 7. Start the chatbot

Command-line interface (CLI)
```bash
pdm run python cli.py [OPTIONS]
```

#### Optional Arguments

- --collection-dir: Directory to read the ChromaDB index (default: as defined in your settings).
- --collection-name: Name of the ChromaDB collection (default: as defined in your settings).
- --verbose: Whether to enable verbose (DEBUG) logging.

#### Example usage

```bash
pdm run python cli.py --collection-dir ./data/example --collection-name example_collection --verbose
```

Webapp
```bash
pdm run streamlit run src/open_rag_bot/app/main.py
```

All the options for the webapp are read from your enviroment. You should pay attention to set the collection_dir and collection_name environment variables to the same values you used to generate the embeddings.

## Environment variables

All sensitive settings (such as API keys and provider selection) are managed via environment variables. Most defaults can be set or overridden in `.env`.

1. Copy `.env.example` to project root as `.env`:

```bash
cp ./.env.example ./.env
```

2. Edit `.env` to add your credentials and adjust preferences as needed.

### Main variables
| Variable | Description | Example |
|----------------------|---------------------------------------------|-------------- |
| OPENAI_API_KEY | Your OpenAI API key | sk-... |
| EMBEDDING_PROVIDER | Embedding provider to use | openai |
| EMBEDDING_MODEL | Model name for embedding tasks | text-embedding-3-large |
| LLM_PROVIDER | LLM provider to use | openai |
| SMALL_LLM_MODEL | Model name for fast/re-writing tasks | gpt-4.1-nano |
| LLM_MODEL | Model name for main answer generation | gpt-4.1 |
| COLLECTION_DIR | Directory for the Chroma index | data/index |
| COLLECTION_NAME | Name for the Chroma collection | my_collection |

#### Note:

- If a provider is selected, the corresponding API key must be set. The application will fail if the required key is missing.
- If you do not specify a model, the code will select a default based on the chosen provider.

### Example

```env
OPENAI_API_KEY=sk-...
EMBEDDING_PROVIDER=openai
EMBEDDING_MODEL=text-embedding-3-large
LLM_PROVIDER=openai
SMALL_LLM_MODEL=gpt-4.1-nano
LLM_MODEL=gpt-4.1-mini
COLLECTION_DIR=data/collection
COLLECTION_NAME=default
```

## Test and development

This project includes unit and integration tests to ensure correctness and maintainability.

### Running tests

To run all tests with [pytest](https://pytest.org/):

```bash
pdm run pytest
```

or, if using classic pip:

```bash
pytest
```

### Pre-commit hooks

This project uses [pre-commit](https://pre-commit.com/) to automate code formatting and linting via [Ruff](https://docs.astral.sh/ruff/).

To enable automatic checks before every commit, install pre-commit and set up the hooks:

```bash
pdm add pre-commit
pdm run pre-commit install
```

After setup, every commit will automatically run ruff and ruff-format on your codebase.

- Configuration is in `.pre-commit-config.yaml`
- You can manually run all hooks on all files with:

```bash
pdm run pre-commit run --all-files
```

## Contributing Guidelines

We welcome contributions, suggestions, and pull requests!
Please follow these guidelines to keep the codebase clean, maintainable, and friendly for everyone.

### How to contribute

1. Fork the repository and create a new branch for your feature or fix.
2. Write clear, concise code following project conventions (see Code Style below).
3. Add tests for new features or bug fixes, when appropriate.
4. Run all tests before submitting.
5. Run code style checks and apply formatting.
6. Use pre-commit hooks (recommended).
7. Open a pull request with a clear title and description. Link related issues if relevant.

### Code style

- This project uses [Ruff](https://docs.astral.sh/ruff/) to enforce Python code style and linting.
- Please follow [PEP8](https://peps.python.org/pep-0008/) guidelines where possible.
- Organize imports and use descriptive names for functions, classes, and variables.

To check code style and run linting on the src/ and tests/ directories:

```bash
pdm run ruff check src/ tests/
```

For auto-formatting, you can also run:

```bash
pdm run ruff format src/ tests/
```

## License

This project is licensed under the MIT License.

## Contact

For questions, suggestions, or professional inquiries, contact:

- GitHub issues: [GitHub Issues](https://github.com/whitris/open-rag-bot/issues)
- Email:
- LinkedIn: [My LinkedIn Profile](https://www.linkedin.com/in/nicola-marcantognini/)

Feel free to open an issue or reach out directly!

## Credits

Maintained by [Whitris](https://github.com/Whitris).

This project makes use of:
- [Streamlit](https://streamlit.io/) for the web interface
- [ChromaDB](https://www.trychroma.com/) for vector search and storage
- [Typer](https://typer.tiangolo.com/) for the CLI
- [OpenAI](https://platform.openai.com/) APIs for LLM and embeddings

## FAQ & Troubleshooting

**Q:** The app can’t find my ChromaDB index or does not seem to answer on my data.
**A:** Double-check your `COLLECTION_DIR` and `COLLECTION_NAME` in `.env` match the values used in previous steps.

**Q:** “Missing API key” error?
**A:** Set the appropriate keys in your `.env` file.