https://github.com/floerianc/athena
An RAG package fully written in Python.
https://github.com/floerianc/athena
ai chatgpt chatgpt-api chromadb openai openai-api package processing rag rag-chatbot vector-database
Last synced: about 1 month ago
JSON representation
An RAG package fully written in Python.
- Host: GitHub
- URL: https://github.com/floerianc/athena
- Owner: Floerianc
- License: gpl-3.0
- Created: 2025-07-14T21:48:48.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-18T17:10:03.000Z (10 months ago)
- Last Synced: 2025-08-18T18:36:55.714Z (10 months ago)
- Topics: ai, chatgpt, chatgpt-api, chromadb, openai, openai-api, package, processing, rag, rag-chatbot, vector-database
- Language: Python
- Homepage:
- Size: 918 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Icon was generated with AI.
It's only a placeholder, will be replaced with real art
Athena
Wisdom for your data.
A RAG-Model built from scratch in Python. Ingest large JSON, PDF, Markdown or TXT-Files and let Athena work its magic.
---
## 🚀 Features
- **🔗 Vector‑based Database**
- Sanitizes input and metadata to ensure correct upserting
- Custom embedding vector generation (not reliant on ChromaDB)
- Split or normalize text (by blank lines, newlines, or fixed‐size chunks)
- Embed chunks into ChromaDB for ultra‑fast semantic lookup
- Careful deletion process
- **🔍 Smart Search & Retrieval**
- Highlight query terms in returned documents
- Filter results by tokens, metadata & distance thresholds
- Cap output by tokens for cost control
- Multiple helper methods for developers
- **🤖 AI Pipeline**
- Convert `QueryResults` + user query into a single, structured prompt
- Full support for JSON, plain‑text & Markdown outputs
- Configurable max_tokens for both input & output
- Optional structured output with custom schema
- Improved text extracting from responses
- **🧠 AI Memory**
- Own memory component
- Shortens past prompts for efficient token usage
- Seperate vector database to get relevant past queries/responses.
- Offers fallbacks and other helper methods
- **💻 CLI**
- Pretty CLI design
- Own stylesheets (ColorProfiles), progress bars and progress messages
- Allows any supported input file type and a (optional) schema path
- **⚙️ Extensive Processor**
- Normalize documents lengths for uniform chunks
- Large parser for TXT, Markdown and PDF-Files:
- Parsing by newline
- Parsing by blank lines
- Parsing by chunks
- Serializer to convert internal objects to human readable JSON/dicts
- Validator to validate input data for the parser
- **📊 Benchmarking**
- Automatically log system specs, input sizes, timings & memory
- CLI‑friendly display and extensive JSON export of every run
- **🛜 Streamlit App**
- Fully grown Streamlit app featuring four pages:
- Simple Chatbot with custom input file, schema and log view.
- Search engine implementation to visualize the main ChromaDB database
- Processor overview to see how the Processor processes the input data
- Config overview
- **⚙️ Rich Configuration** via `Config` (models, parsing modes, memory limits, embedding, search engine configs...)
---
## 📦 Installation
```bash
git clone https://github.com/floerianc/athena.git
cd athena
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
````
## 🔮 Roadmap
| Priority | Task |
| ------------- | --------------------------------------------------- |
| **Very High** | • Improve stability of core components |
| **High** | • Better Error handling |
| | • Token calculation for input max_tokens |
| **Mid** | • Code cleanup |
| | • Create Unit-Tests |
---
## 📄 License
[GPLv3](LICENSE) © 2025 Floerianc <3