{"id":23869141,"url":"https://github.com/extrawest/arxiv-rag","last_synced_at":"2026-06-04T16:31:16.367Z","repository":{"id":251741361,"uuid":"838298939","full_name":"extrawest/Arxiv-RAG","owner":"extrawest","description":null,"archived":false,"fork":false,"pushed_at":"2024-08-05T15:03:28.000Z","size":13689,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-13T16:07:53.879Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/extrawest.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-05T11:06:30.000Z","updated_at":"2024-08-05T15:03:32.000Z","dependencies_parsed_at":"2024-08-05T13:23:44.619Z","dependency_job_id":"21268896-7c46-4ad5-a96b-afff1aefe9b9","html_url":"https://github.com/extrawest/Arxiv-RAG","commit_stats":null,"previous_names":["extrawest/arxiv-rag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/extrawest/Arxiv-RAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extrawest%2FArxiv-RAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extrawest%2FArxiv-RAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extrawest%2FArxiv-RAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extrawest%2FArxiv-RAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/extrawest","download_url":"https://codeload.github.com/extrawest/Arxiv-RAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/extrawest%2FArxiv-RAG/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33914543,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-03T12:16:11.378Z","updated_at":"2026-06-04T16:31:16.346Z","avatar_url":"https://github.com/extrawest.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Arxiv RAG\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)]()\n[![Maintaner](https://img.shields.io/static/v1?label=Nariman%20Mamutov\u0026message=Maintainer\u0026color=red)](mailto:nairman.mamutov@extrawest.com)\n[![Ask Me Anything !](https://img.shields.io/badge/Ask%20me-anything-1abc9c.svg)]()\n![GitHub license](https://img.shields.io/github/license/Naereen/StrapDown.js.svg)\n![GitHub release](https://img.shields.io/badge/release-v1.0.0-blue)\n\n![](https://raw.githubusercontent.com/extrawest/Arxiv-RAG/main/preview.gif)\n\n## About\n\nArxiv RAG (Retriever-Augmented Generation) is a sophisticated web application and API designed for generating notes and answering questions on Arxiv papers using advanced AI technologies. The application leverages Large Language Models (LLMs) to process and understand scientific papers, providing users with insightful summaries and answers to complex queries. The Unstructured API is utilized for parsing and chunking PDFs, allowing for efficient handling and analysis of large documents. Additionally, Supabase is employed to manage the PostgreSQL database, which is integral for storing document embeddings and performing efficient queries.\n\nThe primary aim of Arxiv RAG is to facilitate easier access to and understanding of scientific literature, empowering researchers, students, and enthusiasts to quickly glean important information from vast amounts of data.\n\n## Features\n\n- **PDF Text Extraction and Analysis**: Utilizing the Unstructured API, Arxiv RAG can parse and chunk PDF documents into manageable pieces, enabling thorough analysis of scientific papers.\n  \n- **Insight Generation with OpenAI**: The application leverages OpenAI's powerful language models to generate insightful summaries and responses. These models are fine-tuned to understand the context of scientific literature, providing accurate and meaningful insights.\n\n- **Data Management with Supabase**: Supabase, an open-source Firebase alternative, is used to manage the PostgreSQL database. This database stores the parsed document data, embeddings, and question-answer pairs, enabling efficient querying and retrieval of information.\n\n- **Embeddings and Document Matching**: Arxiv RAG uses embeddings to represent documents in a high-dimensional space, allowing for efficient similarity searches. This is crucial for retrieving relevant information based on user queries.\n\n- **Question Answering System**: The application includes a robust question-answering system that can handle complex queries about the content of Arxiv papers. By leveraging the stored embeddings and context from the documents, the system provides accurate and contextually relevant answers.\n\n## Setup\n\n### Prerequisites\n\n- Node.js\n- Yarn package manager\n- Supabase account\n- Unstructured API key\n\n### Environment Configuration\n\nCreate a `.env.development.local` file in the `./api` directory with the following content:\n\n```\nUNSTRUCTURED_API_KEY=\nOPENAI_API_KEY=\nSUPABASE_PRIVATE_KEY=\nSUPABASE_URL=\nPORT=\nUNSTRUCTURED_API_URL=\n```\n\n### Database Setup in Supabase\n\nExecute the following SQL commands in your Supabase project to set up the required database structure:\n\n```sql\n-- Enable the pgvector extension\ncreate extension vector;\n\n-- Create tables for storing Arxiv papers, embeddings, and question answering data\nCREATE TABLE arxiv_papers (\n  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),\n  created_at TIMESTAMPTZ DEFAULT now(),\n  paper TEXT,\n  arxiv_url TEXT,\n  notes JSONB[],\n  name TEXT\n);\n\nCREATE TABLE arxiv_embeddings (\n  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),\n  created_at TIMESTAMPTZ DEFAULT now(),\n  content TEXT,\n  embedding vector,\n  metadata JSONB\n);\n\nCREATE TABLE arxiv_question_answering (\n  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),\n  created_at TIMESTAMPTZ DEFAULT now(),\n  question TEXT,\n  answer TEXT,\n  followup_questions TEXT[],\n  context TEXT\n);\n\n-- Create a function for document matching\ncreate function match_documents (\n  query_embedding vector(1536),\n  match_count int DEFAULT null,\n  filter jsonb DEFAULT '{}'\n) returns table (\n  id UUID,\n  content text,\n  metadata jsonb,\n  embedding vector,\n  similarity float\n)\nlanguage plpgsql\nas $$\n#variable_conflict use_column\nbegin\n  return query\n  select\n    id,\n    content,\n    metadata,\n    embedding,\n    1 - (arxiv_embeddings.embedding \u003c=\u003e query_embedding) as similarity\n  from arxiv_embeddings\n  where metadata @\u003e filter\n  order by arxiv_embeddings.embedding \u003c=\u003e query_embedding\n  limit match_count;\nend;\n$$;\n```\n\n### Supabase Type Generation\n\nAdd your project ID to the Supabase generate types script in package.json:\n\n```json\n{\n  \"gen:supabase:types\": \"touch ./src/generated.ts \u0026\u0026 supabase gen types typescript --schema public \u003e ./src/generated.ts --project-id \u003cYOUR_PROJECT_ID\u003e\"\n}\n```\n\n## Running the Application\n\n### Build the API Server\n\n```shell\nyarn build\n```\n\n### Start the API Server\n\n```shell\nyarn start:api\n```\n\n### Start the Web Server\n\n```shell\nyarn start:web\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fextrawest%2Farxiv-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fextrawest%2Farxiv-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fextrawest%2Farxiv-rag/lists"}