{"id":21170204,"url":"https://github.com/dcarpintero/athena","last_synced_at":"2025-04-13T15:04:54.410Z","repository":{"id":207853905,"uuid":"717007614","full_name":"dcarpintero/athena","owner":"dcarpintero","description":"Scientific Research Assistant built with LLMs, Retrieval Augmented Generation, and Semantic Search.","archived":false,"fork":false,"pushed_at":"2024-07-06T20:34:53.000Z","size":3887,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T15:04:40.526Z","etag":null,"topics":["cohere","cohere-ai","embedding-vectors","langchain","large-language-models","prompt-engineering","python","retrieval-augmented-generation","semantic-search","streamlit","weaviate"],"latest_commit_sha":null,"homepage":"https://athena-research.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcarpintero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-10T10:51:22.000Z","updated_at":"2024-12-22T02:15:59.000Z","dependencies_parsed_at":"2024-12-02T17:41:22.639Z","dependency_job_id":null,"html_url":"https://github.com/dcarpintero/athena","commit_stats":null,"previous_names":["dcarpintero/athena"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fathena","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fathena/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fathena/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Fathena/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcarpintero","download_url":"https://codeload.github.com/dcarpintero/athena/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248732483,"owners_count":21152852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cohere","cohere-ai","embedding-vectors","langchain","large-language-models","prompt-engineering","python","retrieval-augmented-generation","semantic-search","streamlit","weaviate"],"created_at":"2024-11-20T15:57:08.307Z","updated_at":"2025-04-13T15:04:54.362Z","avatar_url":"https://github.com/dcarpintero.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Open_inStreamlit](https://img.shields.io/badge/Open%20In-Streamlit-red?logo=Streamlit)](https://athena-research.streamlit.app/)\n[![Python](https://img.shields.io/badge/python-%203.8-blue.svg)](https://www.python.org/)\n[![License](https://img.shields.io/badge/Apache-2.0-green.svg)](https://github.com/dcarpintero/athena/blob/main/LICENSE)\n\n# 🦉 Athena - Research Companion\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./static/athena-dalle.png\"\u003e\n\u003c/p\u003e\n\nAthena is an AI-Assist protoype powered by [Cohere-AI](https://cohere.com/) and [Embed-v3](https://txt.cohere.com/introducing-embed-v3/) to faciliate scientific Research. Its key differentiating features include:\n- **Advanced Semantic Search**: Outperforms traditional keyword searches with state-of-the-art embeddings, offering a more nuanced and effective data retrieval experience that understands the complex nature of scientific queries.\n- **Human-AI Collaboration**: Enables easier review of research literature, highlighting key topics, and augmenting human understanding.\n- **Admin Support**: Provides assistance with tasks such as categorization of research articles, e-mail drafting, and tweets generation.\n\n## 📚 Overview\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./static/athena-app.png\"\u003e\n\u003c/p\u003e\n\n### Data Pipeline\n\nAs part of this project we have created two datasets of 50.000 arXiv articles related to AI and NLP using [Cohere Embedv3](https://txt.cohere.com/introducing-embed-v3/):\n- [https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3](https://huggingface.co/datasets/dcarpintero/arXiv.cs.AI.CL.CV.LG.MA.NE.embedv3)\n- [https://huggingface.co/datasets/dcarpintero/arXiv.cs.CL.embedv3](https://huggingface.co/datasets/dcarpintero/arXiv.cs.CL.embedv3)\n\nSteps:\n1) Retrieve Articles' Metadata from ArXiv. See [./data_pipeline/retrieve_arxiv.py](./data_pipeline/retrieve_arxiv.py)\n2) Embed Articles' Title and Abstract using Embedv3. See [./data_pipeline/embed_arxiv.py](./data_pipeline/embed_arxiv.py)\n3) Store Articles' Metadata and Embeddings in Weaviate. See [./data_pipeline/index_arxiv.py](./data_pipeline/index_arxiv.py)\n\n### Prompt Templates, Output Formatting, and Validation\n\nSome of our tasks such as enriching abstracts with Wikipedia Links, crafting a glossary, composing e-mails and tweeting rely on a set of:\n- [Prompt Templates](./prompts/athena.toml)\n\nThose prompts are then composed into a LangChain chain as in the following code snippets:\n- [Enrich Abstract](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L130-L150)\n- [Keywords](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L153-L173)\n- [E-mail Drafting w/ JSON Formatting](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L100-L127)\n- [Tweet Generation w/ JSON Formatting](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L74-L97) and [Pydantic Validation](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/coral.py#L17-L28)\n\n\n### Weaviate Schema\n\nSee [ArxivArticle](https://github.com/dcarpintero/athena/blob/5457229eba2c634b1bb3804aa342344b50ac278b/data_pipeline/index_arxiv.py#L12-L116) Class.\n\n### Cohere Engine\n\nThe [coral.py](./coral.py) class provides an abstraction layer over Cohere endpoints.\n\n### Streamlit App\n\nSee [app.py](./app.py)\n\n## 🚀 Quickstart\n\n1. Clone the repository:\n```\ngit@github.com:dcarpintero/athena.git\n```\n\n2. Create and Activate a Virtual Environment:\n\n```\nWindows:\n\npy -m venv .venv\n.venv\\scripts\\activate\n\nmacOS/Linux\n\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\n3. Install dependencies:\n\n```\npip install -r requirements.txt\n```\n\n4. Run Data Pipeline (optional)\n\n```\npython retrieve_arxiv.py\npython embed_arxiv.py\npython index_arxiv.py\n```\n\n5. Launch Web Application\n\n```\nstreamlit run ./app.py\n```\n\n## 🔗 References\n\n- [Arxiv](https://arxiv.org/)\n- [Embed-v3](https://txt.cohere.com/introducing-embed-v3/)\n- [Langchain](https://langchain.com)\n- [Weaviate Vector Search](https://weaviate.io/developers/weaviate/search/similarity/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Fathena","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcarpintero%2Fathena","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Fathena/lists"}