{"id":29486688,"url":"https://github.com/whitris/open-rag-bot","last_synced_at":"2025-10-07T21:59:18.314Z","repository":{"id":303657553,"uuid":"1016231899","full_name":"Whitris/open-rag-bot","owner":"Whitris","description":"Modular Retrieval-Augmented Generation (RAG) chatbot: ask questions about your own documents and get LLM-powered answers. Web and CLI included.","archived":false,"fork":false,"pushed_at":"2025-07-11T07:27:43.000Z","size":138,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-11T10:32:35.654Z","etag":null,"topics":["chatbot","groq","llm","openai","python","rag","streamlit","vector-search"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Whitris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-08T17:29:32.000Z","updated_at":"2025-07-11T07:27:46.000Z","dependencies_parsed_at":"2025-07-11T10:34:11.135Z","dependency_job_id":null,"html_url":"https://github.com/Whitris/open-rag-bot","commit_stats":null,"previous_names":["whitris/open-rag-chatbot","whitris/open-rag-bot"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Whitris/open-rag-bot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Whitris%2Fopen-rag-bot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Whitris%2Fopen-rag-bot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Whitris%2Fopen-rag-bot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Whitris%2Fopen-rag-bot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Whitris","download_url":"https://codeload.github.com/Whitris/open-rag-bot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Whitris%2Fopen-rag-bot/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265419628,"owners_count":23761846,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","groq","llm","openai","python","rag","streamlit","vector-search"],"created_at":"2025-07-15T08:01:35.284Z","updated_at":"2025-10-07T21:59:18.308Z","avatar_url":"https://github.com/Whitris.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Open RAG Bot\n\n![Python versions](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)\n![MIT License](https://img.shields.io/badge/license-MIT-green.svg)\n![CI](https://github.com/Whitris/open-rag-bot/actions/workflows/ci.yml/badge.svg)\n\nOpen RAG Bot is a modular framework for building document-aware chatbots powered by Retrieval-Augmented Generation (RAG). Search, chat, and extract insights from your own files with state-of-the-art LLMs, via CLI or Streamlit webapp.\n\n## Features\n\n- **Retrieval-Augmented Generation (RAG):** Answers are grounded in your own documents, not just LLM training data.\n- **Flexible document ingestion:** Easily add your PDFs, DOCX, or text files to the knowledge base.\n- **Fast search:** Uses [ChromaDB](https://www.trychroma.com/) for efficient vector search and retrieval with metadata support.\n- **Two interfaces:** Use from the command line (CLI) or via a modern Streamlit webapp.\n- **Tested and extensible:** Includes unit tests and a clean codebase for easy customization.\n\n## Folder structure\n\n```text\nopen-rag-bot/\n│\n├── src/\n│   └── open_rag_bot/\n│       ├── app/         # Streamlit webapp\n│       ├── config/      # Settings, .env, config files\n│       ├── core/        # Core logic: retrieval, prompt, chatbot\n│       ├── data/        # Data processing scripts and utilities\n│       └── services/    # LLM \u0026 embedding clients\n│\n├── tests/               # Unit and integration tests\n├── data/                # Local, non-versioned data (indexed files, outputs)\n├── requirements.txt\n├── pyproject.toml\n├── CONTRIBUTING.md\n├── README.md\n├── LICENSE\n├── .gitignore\n└── .pre-commit-config.yaml\n```\n\n## Quickstart\n\n### 1. Clone the repository\n\n```bash\ngit clone https://github.com/\u003cyour-username\u003e/open-rag-bot.git\ncd open-rag-bot\n```\n\n### 2. Install dependencies (recommended: PDM)\n\n```bash\npip install pdm\npdm install\n```\n\nIf you prefer classic pip:\n\n```bash\npip install -r requirements.txt\n```\n\n### 3. Configure environment\nCopy `./.env.example` to `./.env` and add your API keys (see instructions in the next section).\n\n### 4. Prepare your documents\nPlace your PDF, DOCX or text files in a folder of your choice.\n\n### 5. Process your documents\n\nExtracts text from documents, splits it into chunks, and saves the result as a CSV file.\n\n```bash\npdm run python process_data.py [OPTIONS] INPUT_DIR OUTPUT_CSV_PATH\n```\n\n#### Required Arguments\n\n- INPUT_DIR: Directory containing the documents to process (PDF, TXT, DOCX, etc.).\n- OUTPUT_CSV_PATH: Path to the CSV file where the text chunks are to be stored.\n\n#### Options\n\n- `--chunk-size INTEGER`: Size of each text chunk, in characters. Default: 500.\n- `--max-files INTEGER`: Maximum number of files to process. Default: -1, meaning all files.\n- `--formats TEXT`: File formats to include, comma-separated. Default: pdf,txt,docx.\n\n#### Example usage\n\n```bash\npdm run python process_data.py ./documents ./data/example.csv --chunk-size 1000 --max-files 100 --formats pdf,txt\n```\n\n#### Note\n\n- If the input directory does not exist or no matching files are found, the script exits with an error message.\n\n- Supported formats are controlled by the --formats option (case-insensitive, e.g., pdf,docx,txt).\n\n### 6. Generate embeddings\n\nCreates a vector index of the text chunks using embeddings and stores the collection in ChromaDB.\n\n```bash\npdm run python generate_embeddings.py [OPTIONS] CSV_PATH\n```\n\n#### Required Arguments\n\n- CSV_PATH: Path to the CSV file containing text chunks generated by the previous step.\n\n#### Optional Arguments\n\n- --collection-dir: Directory to store the ChromaDB index (default: as defined in your settings).\n- --collection-name: Name of the ChromaDB collection (default: as defined in your settings).\n\n#### Example usage\n\n```bash\npdm run python generate_embeddings.py ./data/example.csv --collection-dir ./data/example --collection-name example_collection\n```\n\n### 7. Start the chatbot\n\nCommand-line interface (CLI)\n```bash\npdm run python cli.py [OPTIONS]\n```\n\n#### Optional Arguments\n\n- --collection-dir: Directory to read the ChromaDB index (default: as defined in your settings).\n- --collection-name: Name of the ChromaDB collection (default: as defined in your settings).\n- --verbose: Whether to enable verbose (DEBUG) logging.\n\n#### Example usage\n\n```bash\npdm run python cli.py --collection-dir ./data/example --collection-name example_collection --verbose\n```\n\nWebapp\n```bash\npdm run streamlit run src/open_rag_bot/app/main.py\n```\n\nAll the options for the webapp are read from your enviroment. You should pay attention to set the collection_dir and collection_name environment variables to the same values you used to generate the embeddings.\n\n## Environment variables\n\nAll sensitive settings (such as API keys and provider selection) are managed via environment variables. Most defaults can be set or overridden in `.env`.\n\n1. Copy `.env.example` to project root as `.env`:\n\n    ```bash\n    cp ./.env.example ./.env\n    ```\n\n2. Edit `.env` to add your credentials and adjust preferences as needed.\n\n### Main variables\n| Variable             | Description                                 | Example       |\n|----------------------|---------------------------------------------|-------------- |\n| OPENAI_API_KEY       | Your OpenAI API key                         | sk-...        |\n| EMBEDDING_PROVIDER   | Embedding provider to use                   | openai        |\n| EMBEDDING_MODEL      | Model name for embedding tasks              | text-embedding-3-large  |\n| LLM_PROVIDER         | LLM provider to use                         | openai        |\n| SMALL_LLM_MODEL      | Model name for fast/re-writing tasks        | gpt-4.1-nano  |\n| LLM_MODEL            | Model name for main answer generation       | gpt-4.1       |\n| COLLECTION_DIR       | Directory for the Chroma index              | data/index    |\n| COLLECTION_NAME      | Name for the Chroma collection              | my_collection |\n\n#### Note:\n\n- If a provider is selected, the corresponding API key must be set. The application will fail if the required key is missing.\n- If you do not specify a model, the code will select a default based on the chosen provider.\n\n### Example\n\n```env\nOPENAI_API_KEY=sk-...\nEMBEDDING_PROVIDER=openai\nEMBEDDING_MODEL=text-embedding-3-large\nLLM_PROVIDER=openai\nSMALL_LLM_MODEL=gpt-4.1-nano\nLLM_MODEL=gpt-4.1-mini\nCOLLECTION_DIR=data/collection\nCOLLECTION_NAME=default\n```\n\n## Test and development\n\nThis project includes unit and integration tests to ensure correctness and maintainability.\n\n### Running tests\n\nTo run all tests with [pytest](https://pytest.org/):\n\n```bash\npdm run pytest\n```\n\nor, if using classic pip:\n\n```bash\npytest\n```\n\n### Pre-commit hooks\n\nThis project uses [pre-commit](https://pre-commit.com/) to automate code formatting and linting via [Ruff](https://docs.astral.sh/ruff/).\n\nTo enable automatic checks before every commit, install pre-commit and set up the hooks:\n\n```bash\npdm add pre-commit\npdm run pre-commit install\n```\n\nAfter setup, every commit will automatically run ruff and ruff-format on your codebase.\n\n- Configuration is in `.pre-commit-config.yaml`\n- You can manually run all hooks on all files with:\n\n```bash\npdm run pre-commit run --all-files\n```\n\n## Contributing Guidelines\n\nWe welcome contributions, suggestions, and pull requests!\nPlease follow these guidelines to keep the codebase clean, maintainable, and friendly for everyone.\n\n### How to contribute\n\n1. Fork the repository and create a new branch for your feature or fix.\n2. Write clear, concise code following project conventions (see Code Style below).\n3. Add tests for new features or bug fixes, when appropriate.\n4. Run all tests before submitting.\n5. Run code style checks and apply formatting.\n6. Use pre-commit hooks (recommended).\n7. Open a pull request with a clear title and description. Link related issues if relevant.\n\n### Code style\n\n- This project uses [Ruff](https://docs.astral.sh/ruff/) to enforce Python code style and linting.\n- Please follow [PEP8](https://peps.python.org/pep-0008/) guidelines where possible.\n- Organize imports and use descriptive names for functions, classes, and variables.\n\nTo check code style and run linting on the src/ and tests/ directories:\n\n```bash\npdm run ruff check src/ tests/\n```\n\nFor auto-formatting, you can also run:\n\n```bash\npdm run ruff format src/ tests/\n```\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Contact\n\nFor questions, suggestions, or professional inquiries, contact:\n\n- GitHub issues: [GitHub Issues](https://github.com/whitris/open-rag-bot/issues)\n- Email: \u003cnicola.marcantognini@outlook.com\u003e\n- LinkedIn: [My LinkedIn Profile](https://www.linkedin.com/in/nicola-marcantognini/)\n\nFeel free to open an issue or reach out directly!\n\n## Credits\n\nMaintained by [Whitris](https://github.com/Whitris).\n\nThis project makes use of:\n- [Streamlit](https://streamlit.io/) for the web interface\n- [ChromaDB](https://www.trychroma.com/) for vector search and storage\n- [Typer](https://typer.tiangolo.com/) for the CLI\n- [OpenAI](https://platform.openai.com/) APIs for LLM and embeddings\n\n## FAQ \u0026 Troubleshooting\n\n**Q:** The app can’t find my ChromaDB index or does not seem to answer on my data.\n**A:** Double-check your `COLLECTION_DIR` and `COLLECTION_NAME` in `.env` match the values used in previous steps.\n\n**Q:** “Missing API key” error?\n**A:** Set the appropriate keys in your `.env` file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhitris%2Fopen-rag-bot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwhitris%2Fopen-rag-bot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwhitris%2Fopen-rag-bot/lists"}