https://github.com/dcarpintero/athena

Scientific Research Assistant built with LLMs, Retrieval Augmented Generation, and Semantic Search.
https://github.com/dcarpintero/athena

cohere cohere-ai embedding-vectors langchain large-language-models prompt-engineering python retrieval-augmented-generation semantic-search streamlit weaviate

Last synced: 6 months ago
JSON representation

Scientific Research Assistant built with LLMs, Retrieval Augmented Generation, and Semantic Search.

Host: GitHub
URL: https://github.com/dcarpintero/athena
Owner: dcarpintero
License: apache-2.0
Created: 2023-11-10T10:51:22.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-07-06T20:34:53.000Z (about 1 year ago)
Last Synced: 2025-04-13T15:04:40.526Z (6 months ago)
Topics: cohere, cohere-ai, embedding-vectors, langchain, large-language-models, prompt-engineering, python, retrieval-augmented-generation, semantic-search, streamlit, weaviate
Language: Python
Homepage: https://athena-research.streamlit.app/
Size: 3.71 MB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          [![Open_inStreamlit](https://img.shields.io/badge/Open%20In-Streamlit-red?logo=Streamlit)](https://athena-research.streamlit.app/)

[![Python](https://img.shields.io/badge/python-%203.8-blue.svg)](https://www.python.org/)

[![License](https://img.shields.io/badge/Apache-2.0-green.svg)](https://github.com/dcarpintero/athena/blob/main/LICENSE)

# 🦉 Athena - Research Companion



  



Athena is an AI-Assist protoype powered by [Cohere-AI](https://cohere.com/) and [Embed-v3](https://txt.cohere.com/introducing-embed-v3/) to faciliate scientific Research. Its key differentiating features include:

- **Advanced Semantic Search**: Outperforms traditional keyword searches with state-of-the-art embeddings, offering a more nuanced and effective data retrieval experience that understands the complex nature of scientific queries.

- **Human-AI Collaboration**: Enables easier review of research literature, highlighting key topics, and augmenting human understanding.

- **Admin Support**: Provides assistance with tasks such as categorization of research articles, e-mail drafting, and tweets generation.

## 📚 Overview



  



### Data Pipeline

As part of this project we have created two datasets of 50.000 arXiv articles related to AI and NLP using [Cohere Embedv3](https://txt.cohere.com/introducing-embed-v3/):

- [https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3](https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3)

- [https://huggingface.co/datasets/dcarpintero/arXiv.cs.CL.embedv3](https://huggingface.co/datasets/dcarpintero/arXiv.cs.CL.embedv3)

Steps:

1) Retrieve Articles' Metadata from ArXiv. See [./data_pipeline/retrieve_arxiv.py](./data_pipeline/retrieve_arxiv.py)

2) Embed Articles' Title and Abstract using Embedv3. See [./data_pipeline/embed_arxiv.py](./data_pipeline/embed_arxiv.py)

3) Store Articles' Metadata and Embeddings in Weaviate. See [./data_pipeline/index_arxiv.py](./data_pipeline/index_arxiv.py)

### Prompt Templates, Output Formatting, and Validation

Some of our tasks such as enriching abstracts with Wikipedia Links, crafting a glossary, composing e-mails and tweeting rely on a set of:

- [Prompt Templates](./prompts/athena.toml)

Those prompts are then composed into a LangChain chain as in the following code snippets:

- [Enrich Abstract](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L130-L150)

- [Keywords](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L153-L173)

- [E-mail Drafting w/ JSON Formatting](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L100-L127)

- [Tweet Generation w/ JSON Formatting](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L74-L97) and [Pydantic Validation](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L17-L28)

### Weaviate Schema

See [ArxivArticle](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/data_pipeline/index_arxiv.py#L12-L116) Class.

### Cohere Engine

The [coral.py](./coral.py) class provides an abstraction layer over Cohere endpoints.

### Streamlit App

See [app.py](./app.py)

## 🚀 Quickstart

1. Clone the repository:

```

git@github.com:dcarpintero/athena.git

```

2. Create and Activate a Virtual Environment:

```

Windows:

py -m venv .venv

.venv\scripts\activate

macOS/Linux

python3 -m venv .venv

source .venv/bin/activate

```

3. Install dependencies:

```

pip install -r requirements.txt

```

4. Run Data Pipeline (optional)

```

python retrieve_arxiv.py

python embed_arxiv.py

python index_arxiv.py

```

5. Launch Web Application

```

streamlit run ./app.py

```

## 🔗 References

- [Arxiv](https://arxiv.org/)

- [Embed-v3](https://txt.cohere.com/introducing-embed-v3/)

- [Langchain](https://langchain.com)

- [Weaviate Vector Search](https://weaviate.io/developers/weaviate/search/similarity/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dcarpintero/athena

Awesome Lists containing this project

README