https://github.com/bessouat40/rag-scientific-papers

Automated pipeline that daily fetches, stores, and indexes arXiv research papers in MinIO.
https://github.com/bessouat40/rag-scientific-papers

arxiv arxiv-api arxiv-daily arxiv-papers automation database minio minio-client prefect python

Last synced: 3 months ago
JSON representation

Automated pipeline that daily fetches, stores, and indexes arXiv research papers in MinIO.

Host: GitHub
URL: https://github.com/bessouat40/rag-scientific-papers
Owner: Bessouat40
License: mit
Created: 2025-02-09T10:03:23.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-03-05T09:49:12.000Z (3 months ago)
Last Synced: 2025-03-05T10:39:20.661Z (3 months ago)
Topics: arxiv, arxiv-api, arxiv-daily, arxiv-papers, automation, database, minio, minio-client, prefect, python
Language: Python
Homepage:
Size: 13 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# RAG Scientific Papers

RAG Scientific Papers is a project that enables you to automatically fetch, process, and ingest the latest ArXiv research papers on any given topic on a daily basis. This daily retrieval supports continuous technological monitoring, ensuring that you stay up-to-date with emerging research and trends. The pipeline is orchestrated using [Prefect](https://www.prefect.io/) for scheduling and seamless automation, and it stores the retrieved PDFs in a [MinIO](https://min.io/) object storage system for efficient management and retrieval.

Thank you to arXiv for use of its open access interoperability.

## Features

- Fetch ArXiv Papers: Automatically query the ArXiv API for research papers based on a topic and publication date.
- PDF Ingestion: Download the PDF files and store them in a MinIO bucket.
- Embeddings extraction : Extract embeddings and store them inside Chroma vector store.
- Pipeline Orchestration: Use Prefect flows and tasks to schedule and manage the pipelines.
- UI to display pdf, read them and filter them.

## Installation

1. Clone the repository

```bash
git clone https://github.com/Bessouat40/rag-scientific-papers.git
cd rag-scientific-papers
```

2. Configure .env File

You'll need to rename **.env.example** file and fill it with your own values :

```bash
mv .env.example .env
```

3. Install the required packages

```bash
python -m pip install -r backend/requirements.txt
cd frontend
npm i
```

## Usage

### Start the Pipeline with Prefect locally

You can run the pipeline as a scheduled flow using Prefect. For example, to run the pipeline daily at midnight, use the Prefect deployment approach or serve the flow directly (for testing purposes).

```bash
python -m backend.main
```

### Running Pipelines and UI with Docker

You can now run Prefect flow and UI inside a Docker container :

```bash
docker-compose up -d --build
```

Now you can access Prefect UI at [localhost:4200](http://localhost:4200/dashboard).
Your flow will run every day at midnight.

You can access UI at [localhost:3000](http://localhost:3000).

## Configuration

### Topic

The pipeline fetches articles based on a given topic.

You can modify this parameter in the **.env** file.

## TODO

- [x] **Containerization with Docker:** Create a Dockerfile to containerize the application and manage its dependencies.

- [x] **Embedding Extraction:** Use a model to extract and store embeddings from the PDFs for later semantic search.

- [x] **Semantic Search:** Implement a semantic search feature that leverages the stored embeddings to enable more accurate article search.

- [x] **Add UI**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bessouat40/rag-scientific-papers

Awesome Lists containing this project

README