https://github.com/shreyasgm/gl_deep_search
Agentic RAG for Growth Lab text
https://github.com/shreyasgm/gl_deep_search
agentic-rag agents ai deep-search ocr
Last synced: 4 months ago
JSON representation
Agentic RAG for Growth Lab text
- Host: GitHub
- URL: https://github.com/shreyasgm/gl_deep_search
- Owner: shreyasgm
- Created: 2025-03-13T14:14:11.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-11-12T19:52:15.000Z (7 months ago)
- Last Synced: 2025-11-12T21:22:09.975Z (7 months ago)
- Topics: agentic-rag, agents, ai, deep-search, ocr
- Language: Python
- Homepage: https://github.com/shreyasgm/gl_deep_search
- Size: 3.38 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Growth Lab Deep Search
An agentic RAG system that helps users query Growth Lab-specific unstructured data.
## ๐ Project Overview
Growth Lab Deep Search is an agentic AI system designed to answer complex questions about the Growth Lab's research and publications. The system incorporates:
**Key Features:**
- Automated ETL pipeline for harvesting Growth Lab publications and academic papers
- Advanced OCR processing of PDF documents using modern tools
- Vector embeddings with hybrid search
- Agentic RAG system based on LangGraph
## Project Architecture
### Directory structure
This is a rough outline of the intended directory structure. The actual structure might look different, but this should give an idea of the intended code organization.
```
gl_deep_search/
โโโ .github/
โ โโโ workflows/
โ โโโ etl-pipeline.yml # Scheduled ETL runs and deployment
โ โโโ service-deploy.yml # Service API deployment
โ โโโ frontend-deploy.yml # Frontend deployment
โโโ .gitignore
โโโ README.md
โโโ pyproject.toml # Python project config for uv
โโโ docker-compose.yml # Local development setup
โ
โโโ backend/
โ โโโ etl/
โ โ โโโ config.yaml # Default ETL configuration
โ โ โโโ config.production.yaml # Production ETL configuration
โ โ โโโ orchestrator.py # ETL orchestration entry point
โ โ โโโ models/
โ โ โ โโโ publications.py # Publication data models
โ โ โ โโโ tracking.py # ETL tracking models
โ โ โโโ scrapers/
โ โ โ โโโ growthlab.py # Growth Lab website scraper
โ โ โ โโโ openalex.py # OpenAlex API client
โ โ โโโ scripts/ # ETL execution scripts
โ โ โ โโโ run_growthlab_scraper.py
โ โ โ โโโ run_openalex_scraper.py
โ โ โ โโโ run_gl_file_downloader.py
โ โ โ โโโ run_openalex_file_downloader.py
โ โ โ โโโ run_pdf_processor.py
โ โ โ โโโ run_embeddings_generator.py
โ โ โโโ utils/
โ โ โโโ pdf_processor.py # PDF processing and OCR
โ โ โโโ gl_file_downloader.py # Growth Lab file downloader
โ โ โโโ oa_file_downloader.py # OpenAlex file downloader
โ โ โโโ text_chunker.py # Text chunking utilities
โ โ โโโ embeddings_generator.py # Embedding generation
โ โ โโโ publication_tracker.py # Publication tracking
โ โ โโโ retry.py # Retry utilities
โ โ
โ โโโ service/ # Main backend service (future)
โ โ โโโ main.py # FastAPI entry point
โ โ โโโ routes.py # API endpoints
โ โ โโโ graph.py # LangGraph definition
โ โ โโโ utils/
โ โ โโโ retriever.py # Vector retrieval
โ โ
โ โโโ storage/ # Storage abstraction layer
โ โ โโโ base.py # Storage interface
โ โ โโโ local.py # Local filesystem adapter
โ โ โโโ gcs.py # Google Cloud Storage adapter
โ โ โโโ cloud.py # Cloud storage utilities
โ โ โโโ database.py # Database utilities
โ โ โโโ factory.py # Storage factory
โ โ
โ โโโ tests/ # Unit and integration tests
โ โโโ etl/
โ โโโ service/
โ
โโโ data/ # Data directory (gitignored)
โ โโโ raw/ # Raw scraped data
โ โ โโโ documents/ # Raw documents by source
โ โ โโโ pdfs/ # Downloaded PDF files
โ โโโ intermediate/ # Intermediate processing data
โ โ โโโ *.csv # Scraped publication metadata
โ โโโ processed/ # Processed data
โ โ โโโ documents/ # Processed documents with text
โ โ โโโ chunks/ # Chunked documents
โ โ โโโ embeddings/ # Generated embeddings
โ โโโ reports/ # ETL execution reports
โ
โโโ deployment/ # GCP deployment infrastructure
โ โโโ cloud-run/ # Cloud Run job scripts
โ โโโ vm/ # VM-based deployment scripts
โ โโโ scripts/ # Setup and utility scripts
โ โโโ config/ # GCP configuration
โ
โโโ frontend/ # Frontend application (future)
โโโ app.py # Streamlit/Chainlit application
โโโ utils.py # Frontend utility functions
```
## Tech Stack
- **ETL Pipeline**: GitHub Actions, Modern OCR tools (Docling/Marker/Gemini Flash 2)
- **Vector Storage**: Qdrant for embeddings, with Cohere for reranking
- **Agent System**: LangGraph for agentic RAG workflows
- **Backend API**: FastAPI, Python 3.12+
- **Frontend**: Streamlit or Chainlit for MVP
- **Deployment**: Google Cloud Run
- **Package Management**: uv
## Getting Started
### Prerequisites
- Docker and Docker Compose
- Python 3.12+
- `uv` for project management (check documentarion [here](https://docs.astral.sh/uv/))
- GCP account and credentials (for production)
- API keys for OpenAI, Anthropic, etc.
### Environment Setup
1. Clone the repository:
```bash
git clone https://github.com/shreyasgm/gl_deep_search.git
cd gl_deep_search
```
2. Run `uv` in the CLI to check that it is available. After this, run `uv sync` to install dependencies and create the virtual environment. This command will only install the core dependencies specified in the `pyproject.toml` file. To install dependencies that belong to a specific component (*i.e.*, optional dependencies) use:
```bash
# For a single optional component
uv sync --extra etl
# For multiple optional components
uv sync --extra etl, frontend, [other groups]
```
3. To add new packages to the project, use the following format:
```bash
# Add a package to a specific group (etl, service, frontend, dev, prod)
uv add package_name --optional group_name
# Example: Add seaborn to the service group
uv add seaborn --optional service
```
4. Create and configure environment files:
```bash
cp backend/etl/.env.example backend/etl/.env
cp backend/service/.env.example backend/service/.env
cp frontend/.env.example frontend/.env
```
5. Add your API keys and configuration to the `.env` files
### Docker Development Environment
The project uses Docker for consistent development and deployment environments:
1. Start the complete development stack:
```bash
docker-compose up
```
2. Access local services:
- Frontend UI: http://localhost:8501
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
3. Run individual components:
```bash
# Run only the ETL service
docker-compose up etl
# Run only the backend service
docker-compose up service
# Run only the frontend
docker-compose up frontend
```
### Running the ETL Pipeline
The ETL pipeline supports both development and production environments through containerized deployment. Initial data ingestion and processing of historical documents is executed on a High-Performance Computing (HPC) infrastructure using SLURM workload manager. Incremental updates for new documents are handled through Google Cloud Run.
```bash
# Development: Execute ETL pipeline in local environment
docker-compose run --rm etl python main.py
# Production: Initial bulk processing via HPC/SLURM
sbatch scripts/slurm_etl_initial.sh
# Component-specific execution
docker-compose run --rm etl python main.py --component scraper
docker-compose run --rm etl python main.py --component processor
docker-compose run --rm etl python main.py --component embedder
```
Post-initial processing, data is migrated to Google Cloud Storage. Subsequent ETL operations are orchestrated through automated GitHub Actions workflows and executed on Google Cloud Run.
## Deployment
### GCP Deployment Infrastructure
The project includes a complete deployment infrastructure for Google Cloud Platform (GCP). See the [deployment guide](deployment/README.md) for detailed instructions.
**Quick Start:**
1. Configure GCP settings: `cp deployment/config/gcp-config.sh.template deployment/config/gcp-config.sh`
2. Run setup scripts: `./deployment/scripts/01-setup-gcp-project.sh` (and 02-04)
3. Deploy Cloud Run Job: `./deployment/cloud-run/deploy.sh`
4. Schedule weekly updates: `./deployment/cloud-run/schedule.sh`
**Deployment Options:**
- **VM-based**: For initial batch processing (`./deployment/vm/create-vm.sh`)
- **Cloud Run Jobs**: For scheduled weekly updates (automated via Cloud Scheduler)
- **Manual execution**: Run on-demand (`./deployment/cloud-run/execute.sh`)
**Documentation:**
- [Deployment README](deployment/README.md) - Quick start and troubleshooting
- [GCP Deployment Guide](docs/GCP_DEPLOYMENT_GUIDE.md) - Comprehensive deployment documentation
### Production Infrastructure
- **ETL Pipeline**: Cloud Run Jobs (scheduled weekly) + VM instances (initial batch)
- **Backend Service**: Cloud Run with autoscaling (future)
- **Vector Database**: Managed Qdrant instance or Qdrant Cloud
- **Document Storage**: Cloud Storage (GCS)
- **Frontend**: Streamlit or Chainlit (future)
- **Scheduling**: Cloud Scheduler or GitHub Actions workflows
## ๐งช Development Workflow
### Contributing Guidelines
1. Create a feature branch from `main`
2. Implement your changes with tests
3. Submit a pull request for review
### Testing
```bash
# Run tests
pytest
# Run with coverage
pytest --cov=backend
```
### Deployment
Development and production environments are managed through Docker and GitHub Actions:
```bash
# Deploy to development
./scripts/deploy.sh dev
# Deploy to production
./scripts/deploy.sh prod
```
## ๐ Security & Configuration
- API keys and secrets are managed via `.env` files (not committed to GitHub)
- Production secrets are stored in GCP Secret Manager
- Access control is implemented at the API level
## License
This project is licensed under CC-BY-NC-SA 4.0. See the LICENSE file for details.