An open API service indexing awesome lists of open source software.

https://github.com/floerianc/athena

An RAG package fully written in Python.
https://github.com/floerianc/athena

ai chatgpt chatgpt-api chromadb openai openai-api package processing rag rag-chatbot vector-database

Last synced: about 1 month ago
JSON representation

An RAG package fully written in Python.

Awesome Lists containing this project

README

          


Athena Logo


Icon was generated with AI.
It's only a placeholder, will be replaced with real art


Athena



Wisdom for your data.


A RAG-Model built from scratch in Python. Ingest large JSON, PDF, Markdown or TXT-Files and let Athena work its magic.

---

## 🚀 Features

- **🔗 Vector‑based Database**
- Sanitizes input and metadata to ensure correct upserting
- Custom embedding vector generation (not reliant on ChromaDB)
- Split or normalize text (by blank lines, newlines, or fixed‐size chunks)
- Embed chunks into ChromaDB for ultra‑fast semantic lookup
- Careful deletion process
- **🔍 Smart Search & Retrieval**
- Highlight query terms in returned documents
- Filter results by tokens, metadata & distance thresholds
- Cap output by tokens for cost control
- Multiple helper methods for developers
- **🤖 AI Pipeline**
- Convert `QueryResults` + user query into a single, structured prompt
- Full support for JSON, plain‑text & Markdown outputs
- Configurable max_tokens for both input & output
- Optional structured output with custom schema
- Improved text extracting from responses
- **🧠 AI Memory**
- Own memory component
- Shortens past prompts for efficient token usage
- Seperate vector database to get relevant past queries/responses.
- Offers fallbacks and other helper methods
- **💻 CLI**
- Pretty CLI design
- Own stylesheets (ColorProfiles), progress bars and progress messages
- Allows any supported input file type and a (optional) schema path
- **⚙️ Extensive Processor**
- Normalize documents lengths for uniform chunks
- Large parser for TXT, Markdown and PDF-Files:
- Parsing by newline
- Parsing by blank lines
- Parsing by chunks
- Serializer to convert internal objects to human readable JSON/dicts
- Validator to validate input data for the parser
- **📊 Benchmarking**
- Automatically log system specs, input sizes, timings & memory
- CLI‑friendly display and extensive JSON export of every run
- **🛜 Streamlit App**
- Fully grown Streamlit app featuring four pages:
- Simple Chatbot with custom input file, schema and log view.
- Search engine implementation to visualize the main ChromaDB database
- Processor overview to see how the Processor processes the input data
- Config overview
- **⚙️ Rich Configuration** via `Config` (models, parsing modes, memory limits, embedding, search engine configs...)

---

## 📦 Installation

```bash
git clone https://github.com/floerianc/athena.git
cd athena
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
````

## 🔮 Roadmap

| Priority | Task |
| ------------- | --------------------------------------------------- |
| **Very High** | • Improve stability of core components |
| **High** | • Better Error handling |
| | • Token calculation for input max_tokens |
| **Mid** | • Code cleanup |
| | • Create Unit-Tests |

---

## 📄 License

[GPLv3](LICENSE) © 2025 Floerianc <3