An open API service indexing awesome lists of open source software.

https://github.com/shreyasgm/gl_deep_search

Agentic RAG for Growth Lab text
https://github.com/shreyasgm/gl_deep_search

agentic-rag agents ai deep-search ocr

Last synced: 4 months ago
JSON representation

Agentic RAG for Growth Lab text

Awesome Lists containing this project

README

          

# Growth Lab Deep Search

An agentic RAG system that helps users query Growth Lab-specific unstructured data.

## ๐Ÿ” Project Overview

Growth Lab Deep Search is an agentic AI system designed to answer complex questions about the Growth Lab's research and publications. The system incorporates:

**Key Features:**

- Automated ETL pipeline for harvesting Growth Lab publications and academic papers
- Advanced OCR processing of PDF documents using modern tools
- Vector embeddings with hybrid search
- Agentic RAG system based on LangGraph

## Project Architecture

### Directory structure

This is a rough outline of the intended directory structure. The actual structure might look different, but this should give an idea of the intended code organization.

```
gl_deep_search/
โ”œโ”€โ”€ .github/
โ”‚ โ””โ”€โ”€ workflows/
โ”‚ โ”œโ”€โ”€ etl-pipeline.yml # Scheduled ETL runs and deployment
โ”‚ โ”œโ”€โ”€ service-deploy.yml # Service API deployment
โ”‚ โ””โ”€โ”€ frontend-deploy.yml # Frontend deployment
โ”œโ”€โ”€ .gitignore
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ pyproject.toml # Python project config for uv
โ”œโ”€โ”€ docker-compose.yml # Local development setup
โ”‚
โ”œโ”€โ”€ backend/
โ”‚ โ”œโ”€โ”€ etl/
โ”‚ โ”‚ โ”œโ”€โ”€ config.yaml # Default ETL configuration
โ”‚ โ”‚ โ”œโ”€โ”€ config.production.yaml # Production ETL configuration
โ”‚ โ”‚ โ”œโ”€โ”€ orchestrator.py # ETL orchestration entry point
โ”‚ โ”‚ โ”œโ”€โ”€ models/
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ publications.py # Publication data models
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ tracking.py # ETL tracking models
โ”‚ โ”‚ โ”œโ”€โ”€ scrapers/
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ growthlab.py # Growth Lab website scraper
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ openalex.py # OpenAlex API client
โ”‚ โ”‚ โ”œโ”€โ”€ scripts/ # ETL execution scripts
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ run_growthlab_scraper.py
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ run_openalex_scraper.py
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ run_gl_file_downloader.py
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ run_openalex_file_downloader.py
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ run_pdf_processor.py
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ run_embeddings_generator.py
โ”‚ โ”‚ โ””โ”€โ”€ utils/
โ”‚ โ”‚ โ”œโ”€โ”€ pdf_processor.py # PDF processing and OCR
โ”‚ โ”‚ โ”œโ”€โ”€ gl_file_downloader.py # Growth Lab file downloader
โ”‚ โ”‚ โ”œโ”€โ”€ oa_file_downloader.py # OpenAlex file downloader
โ”‚ โ”‚ โ”œโ”€โ”€ text_chunker.py # Text chunking utilities
โ”‚ โ”‚ โ”œโ”€โ”€ embeddings_generator.py # Embedding generation
โ”‚ โ”‚ โ”œโ”€โ”€ publication_tracker.py # Publication tracking
โ”‚ โ”‚ โ””โ”€โ”€ retry.py # Retry utilities
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ service/ # Main backend service (future)
โ”‚ โ”‚ โ”œโ”€โ”€ main.py # FastAPI entry point
โ”‚ โ”‚ โ”œโ”€โ”€ routes.py # API endpoints
โ”‚ โ”‚ โ”œโ”€โ”€ graph.py # LangGraph definition
โ”‚ โ”‚ โ””โ”€โ”€ utils/
โ”‚ โ”‚ โ””โ”€โ”€ retriever.py # Vector retrieval
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€ storage/ # Storage abstraction layer
โ”‚ โ”‚ โ”œโ”€โ”€ base.py # Storage interface
โ”‚ โ”‚ โ”œโ”€โ”€ local.py # Local filesystem adapter
โ”‚ โ”‚ โ”œโ”€โ”€ gcs.py # Google Cloud Storage adapter
โ”‚ โ”‚ โ”œโ”€โ”€ cloud.py # Cloud storage utilities
โ”‚ โ”‚ โ”œโ”€โ”€ database.py # Database utilities
โ”‚ โ”‚ โ””โ”€โ”€ factory.py # Storage factory
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€ tests/ # Unit and integration tests
โ”‚ โ”œโ”€โ”€ etl/
โ”‚ โ””โ”€โ”€ service/
โ”‚
โ”œโ”€โ”€ data/ # Data directory (gitignored)
โ”‚ โ”œโ”€โ”€ raw/ # Raw scraped data
โ”‚ โ”‚ โ”œโ”€โ”€ documents/ # Raw documents by source
โ”‚ โ”‚ โ””โ”€โ”€ pdfs/ # Downloaded PDF files
โ”‚ โ”œโ”€โ”€ intermediate/ # Intermediate processing data
โ”‚ โ”‚ โ””โ”€โ”€ *.csv # Scraped publication metadata
โ”‚ โ”œโ”€โ”€ processed/ # Processed data
โ”‚ โ”‚ โ”œโ”€โ”€ documents/ # Processed documents with text
โ”‚ โ”‚ โ”œโ”€โ”€ chunks/ # Chunked documents
โ”‚ โ”‚ โ””โ”€โ”€ embeddings/ # Generated embeddings
โ”‚ โ””โ”€โ”€ reports/ # ETL execution reports
โ”‚
โ”œโ”€โ”€ deployment/ # GCP deployment infrastructure
โ”‚ โ”œโ”€โ”€ cloud-run/ # Cloud Run job scripts
โ”‚ โ”œโ”€โ”€ vm/ # VM-based deployment scripts
โ”‚ โ”œโ”€โ”€ scripts/ # Setup and utility scripts
โ”‚ โ””โ”€โ”€ config/ # GCP configuration
โ”‚
โ””โ”€โ”€ frontend/ # Frontend application (future)
โ”œโ”€โ”€ app.py # Streamlit/Chainlit application
โ””โ”€โ”€ utils.py # Frontend utility functions
```

## Tech Stack

- **ETL Pipeline**: GitHub Actions, Modern OCR tools (Docling/Marker/Gemini Flash 2)
- **Vector Storage**: Qdrant for embeddings, with Cohere for reranking
- **Agent System**: LangGraph for agentic RAG workflows
- **Backend API**: FastAPI, Python 3.12+
- **Frontend**: Streamlit or Chainlit for MVP
- **Deployment**: Google Cloud Run
- **Package Management**: uv

## Getting Started

### Prerequisites

- Docker and Docker Compose
- Python 3.12+
- `uv` for project management (check documentarion [here](https://docs.astral.sh/uv/))
- GCP account and credentials (for production)
- API keys for OpenAI, Anthropic, etc.

### Environment Setup

1. Clone the repository:
```bash
git clone https://github.com/shreyasgm/gl_deep_search.git
cd gl_deep_search
```
2. Run `uv` in the CLI to check that it is available. After this, run `uv sync` to install dependencies and create the virtual environment. This command will only install the core dependencies specified in the `pyproject.toml` file. To install dependencies that belong to a specific component (*i.e.*, optional dependencies) use:
```bash
# For a single optional component
uv sync --extra etl

# For multiple optional components
uv sync --extra etl, frontend, [other groups]
```

3. To add new packages to the project, use the following format:
```bash
# Add a package to a specific group (etl, service, frontend, dev, prod)
uv add package_name --optional group_name

# Example: Add seaborn to the service group
uv add seaborn --optional service
```

4. Create and configure environment files:
```bash
cp backend/etl/.env.example backend/etl/.env
cp backend/service/.env.example backend/service/.env
cp frontend/.env.example frontend/.env
```

5. Add your API keys and configuration to the `.env` files

### Docker Development Environment

The project uses Docker for consistent development and deployment environments:

1. Start the complete development stack:
```bash
docker-compose up
```

2. Access local services:
- Frontend UI: http://localhost:8501
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs

3. Run individual components:
```bash
# Run only the ETL service
docker-compose up etl

# Run only the backend service
docker-compose up service

# Run only the frontend
docker-compose up frontend
```

### Running the ETL Pipeline

The ETL pipeline supports both development and production environments through containerized deployment. Initial data ingestion and processing of historical documents is executed on a High-Performance Computing (HPC) infrastructure using SLURM workload manager. Incremental updates for new documents are handled through Google Cloud Run.

```bash
# Development: Execute ETL pipeline in local environment
docker-compose run --rm etl python main.py

# Production: Initial bulk processing via HPC/SLURM
sbatch scripts/slurm_etl_initial.sh

# Component-specific execution
docker-compose run --rm etl python main.py --component scraper
docker-compose run --rm etl python main.py --component processor
docker-compose run --rm etl python main.py --component embedder
```

Post-initial processing, data is migrated to Google Cloud Storage. Subsequent ETL operations are orchestrated through automated GitHub Actions workflows and executed on Google Cloud Run.

## Deployment

### GCP Deployment Infrastructure

The project includes a complete deployment infrastructure for Google Cloud Platform (GCP). See the [deployment guide](deployment/README.md) for detailed instructions.

**Quick Start:**
1. Configure GCP settings: `cp deployment/config/gcp-config.sh.template deployment/config/gcp-config.sh`
2. Run setup scripts: `./deployment/scripts/01-setup-gcp-project.sh` (and 02-04)
3. Deploy Cloud Run Job: `./deployment/cloud-run/deploy.sh`
4. Schedule weekly updates: `./deployment/cloud-run/schedule.sh`

**Deployment Options:**
- **VM-based**: For initial batch processing (`./deployment/vm/create-vm.sh`)
- **Cloud Run Jobs**: For scheduled weekly updates (automated via Cloud Scheduler)
- **Manual execution**: Run on-demand (`./deployment/cloud-run/execute.sh`)

**Documentation:**
- [Deployment README](deployment/README.md) - Quick start and troubleshooting
- [GCP Deployment Guide](docs/GCP_DEPLOYMENT_GUIDE.md) - Comprehensive deployment documentation

### Production Infrastructure

- **ETL Pipeline**: Cloud Run Jobs (scheduled weekly) + VM instances (initial batch)
- **Backend Service**: Cloud Run with autoscaling (future)
- **Vector Database**: Managed Qdrant instance or Qdrant Cloud
- **Document Storage**: Cloud Storage (GCS)
- **Frontend**: Streamlit or Chainlit (future)
- **Scheduling**: Cloud Scheduler or GitHub Actions workflows

## ๐Ÿงช Development Workflow

### Contributing Guidelines

1. Create a feature branch from `main`
2. Implement your changes with tests
3. Submit a pull request for review

### Testing

```bash
# Run tests
pytest

# Run with coverage
pytest --cov=backend
```

### Deployment

Development and production environments are managed through Docker and GitHub Actions:

```bash
# Deploy to development
./scripts/deploy.sh dev

# Deploy to production
./scripts/deploy.sh prod
```

## ๐Ÿ”’ Security & Configuration

- API keys and secrets are managed via `.env` files (not committed to GitHub)
- Production secrets are stored in GCP Secret Manager
- Access control is implemented at the API level

## License

This project is licensed under CC-BY-NC-SA 4.0. See the LICENSE file for details.