Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dcarpintero/athena
Scientific Research Assistant built with LLMs, Retrieval Augmented Generation, and Semantic Search.
https://github.com/dcarpintero/athena
cohere cohere-ai embedding-vectors langchain large-language-models prompt-engineering python retrieval-augmented-generation semantic-search streamlit weaviate
Last synced: 7 days ago
JSON representation
Scientific Research Assistant built with LLMs, Retrieval Augmented Generation, and Semantic Search.
- Host: GitHub
- URL: https://github.com/dcarpintero/athena
- Owner: dcarpintero
- License: apache-2.0
- Created: 2023-11-10T10:51:22.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-06T20:34:53.000Z (7 months ago)
- Last Synced: 2025-01-20T22:53:41.412Z (11 days ago)
- Topics: cohere, cohere-ai, embedding-vectors, langchain, large-language-models, prompt-engineering, python, retrieval-augmented-generation, semantic-search, streamlit, weaviate
- Language: Python
- Homepage: https://athena-research.streamlit.app/
- Size: 3.71 MB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![Open_inStreamlit](https://img.shields.io/badge/Open%20In-Streamlit-red?logo=Streamlit)](https://athena-research.streamlit.app/)
[![Python](https://img.shields.io/badge/python-%203.8-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/Apache-2.0-green.svg)](https://github.com/dcarpintero/athena/blob/main/LICENSE)# 🦉 Athena - Research Companion
Athena is an AI-Assist protoype powered by [Cohere-AI](https://cohere.com/) and [Embed-v3](https://txt.cohere.com/introducing-embed-v3/) to faciliate scientific Research. Its key differentiating features include:
- **Advanced Semantic Search**: Outperforms traditional keyword searches with state-of-the-art embeddings, offering a more nuanced and effective data retrieval experience that understands the complex nature of scientific queries.
- **Human-AI Collaboration**: Enables easier review of research literature, highlighting key topics, and augmenting human understanding.
- **Admin Support**: Provides assistance with tasks such as categorization of research articles, e-mail drafting, and tweets generation.## 📚 Overview
### Data Pipeline
As part of this project we have created two datasets of 50.000 arXiv articles related to AI and NLP using [Cohere Embedv3](https://txt.cohere.com/introducing-embed-v3/):
- [https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3](https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3)
- [https://huggingface.co/datasets/dcarpintero/arXiv.cs.CL.embedv3](https://huggingface.co/datasets/dcarpintero/arXiv.cs.CL.embedv3)Steps:
1) Retrieve Articles' Metadata from ArXiv. See [./data_pipeline/retrieve_arxiv.py](./data_pipeline/retrieve_arxiv.py)
2) Embed Articles' Title and Abstract using Embedv3. See [./data_pipeline/embed_arxiv.py](./data_pipeline/embed_arxiv.py)
3) Store Articles' Metadata and Embeddings in Weaviate. See [./data_pipeline/index_arxiv.py](./data_pipeline/index_arxiv.py)### Prompt Templates, Output Formatting, and Validation
Some of our tasks such as enriching abstracts with Wikipedia Links, crafting a glossary, composing e-mails and tweeting rely on a set of:
- [Prompt Templates](./prompts/athena.toml)Those prompts are then composed into a LangChain chain as in the following code snippets:
- [Enrich Abstract](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L130-L150)
- [Keywords](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L153-L173)
- [E-mail Drafting w/ JSON Formatting](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L100-L127)
- [Tweet Generation w/ JSON Formatting](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L74-L97) and [Pydantic Validation](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L17-L28)### Weaviate Schema
See [ArxivArticle](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/data_pipeline/index_arxiv.py#L12-L116) Class.
### Cohere Engine
The [coral.py](./coral.py) class provides an abstraction layer over Cohere endpoints.
### Streamlit App
See [app.py](./app.py)
## 🚀 Quickstart
1. Clone the repository:
```
[email protected]:dcarpintero/athena.git
```2. Create and Activate a Virtual Environment:
```
Windows:py -m venv .venv
.venv\scripts\activatemacOS/Linux
python3 -m venv .venv
source .venv/bin/activate
```3. Install dependencies:
```
pip install -r requirements.txt
```4. Run Data Pipeline (optional)
```
python retrieve_arxiv.py
python embed_arxiv.py
python index_arxiv.py
```5. Launch Web Application
```
streamlit run ./app.py
```## 🔗 References
- [Arxiv](https://arxiv.org/)
- [Embed-v3](https://txt.cohere.com/introducing-embed-v3/)
- [Langchain](https://langchain.com)
- [Weaviate Vector Search](https://weaviate.io/developers/weaviate/search/similarity/)