https://github.com/floerianc/athena

An RAG package fully written in Python.
https://github.com/floerianc/athena

ai chatgpt chatgpt-api chromadb openai openai-api package processing rag rag-chatbot vector-database

Last synced: about 1 month ago
JSON representation

An RAG package fully written in Python.

Host: GitHub
URL: https://github.com/floerianc/athena
Owner: Floerianc
License: gpl-3.0
Created: 2025-07-14T21:48:48.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-18T17:10:03.000Z (10 months ago)
Last Synced: 2025-08-18T18:36:55.714Z (10 months ago)
Topics: ai, chatgpt, chatgpt-api, chromadb, openai, openai-api, package, processing, rag, rag-chatbot, vector-database
Language: Python
Homepage:
Size: 918 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Athena Logo

Icon was generated with AI.
It's only a placeholder, will be replaced with real art

Athena

Wisdom for your data.

A RAG-Model built from scratch in Python. Ingest large JSON, PDF, Markdown or TXT-Files and let Athena work its magic.

---

## 🚀 Features

- **🔗 Vector‑based Database**
- Sanitizes input and metadata to ensure correct upserting
- Custom embedding vector generation (not reliant on ChromaDB)
- Split or normalize text (by blank lines, newlines, or fixed‐size chunks)
- Embed chunks into ChromaDB for ultra‑fast semantic lookup
- Careful deletion process
- **🔍 Smart Search & Retrieval**
- Highlight query terms in returned documents
- Filter results by tokens, metadata & distance thresholds
- Cap output by tokens for cost control
- Multiple helper methods for developers
- **🤖 AI Pipeline**
- Convert `QueryResults` + user query into a single, structured prompt
- Full support for JSON, plain‑text & Markdown outputs
- Configurable max_tokens for both input & output
- Optional structured output with custom schema
- Improved text extracting from responses
- **🧠 AI Memory**
- Own memory component
- Shortens past prompts for efficient token usage
- Seperate vector database to get relevant past queries/responses.
- Offers fallbacks and other helper methods
- **💻 CLI**
- Pretty CLI design
- Own stylesheets (ColorProfiles), progress bars and progress messages
- Allows any supported input file type and a (optional) schema path
- **⚙️ Extensive Processor**
- Normalize documents lengths for uniform chunks
- Large parser for TXT, Markdown and PDF-Files:
- Parsing by newline
- Parsing by blank lines
- Parsing by chunks
- Serializer to convert internal objects to human readable JSON/dicts
- Validator to validate input data for the parser
- **📊 Benchmarking**
- Automatically log system specs, input sizes, timings & memory
- CLI‑friendly display and extensive JSON export of every run
- **🛜 Streamlit App**
- Fully grown Streamlit app featuring four pages:
- Simple Chatbot with custom input file, schema and log view.
- Search engine implementation to visualize the main ChromaDB database
- Processor overview to see how the Processor processes the input data
- Config overview
- **⚙️ Rich Configuration** via `Config` (models, parsing modes, memory limits, embedding, search engine configs...)

---

## 📦 Installation

```bash
git clone https://github.com/floerianc/athena.git
cd athena
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
````

## 🔮 Roadmap

| Priority | Task |
| ------------- | --------------------------------------------------- |
| **Very High** | • Improve stability of core components |
| **High** | • Better Error handling |
| | • Token calculation for input max_tokens |
| **Mid** | • Code cleanup |
| | • Create Unit-Tests |

---

## 📄 License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/floerianc/athena

Awesome Lists containing this project

README

Athena