{"id":24010845,"url":"https://github.com/eddieatgoogle/sql-based-genai-data-pipeline","last_synced_at":"2025-02-25T13:48:35.848Z","repository":{"id":270960821,"uuid":"911863508","full_name":"EddieAtGoogle/SQL-Based-GenAI-Data-Pipeline","owner":"EddieAtGoogle","description":"GenAI data pipeline that performs data preparation, management and performance evaluation tasks for RAG systems using SQL as the primary development language.  Please feel free to use this as a starting point for your own projects.","archived":false,"fork":false,"pushed_at":"2025-01-21T07:12:51.000Z","size":1722,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-21T08:20:42.927Z","etag":null,"topics":["bigquery","bqml","dataform","dbt","embeddings","gemini","google-cloud-platform","sql","vector-search","vertex-ai"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EddieAtGoogle.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-04T03:40:16.000Z","updated_at":"2025-01-21T07:12:54.000Z","dependencies_parsed_at":"2025-01-06T19:37:53.989Z","dependency_job_id":null,"html_url":"https://github.com/EddieAtGoogle/SQL-Based-GenAI-Data-Pipeline","commit_stats":null,"previous_names":["eddieatgoogle/genai_data_pipeline","eddieatgoogle/sql-based-genai-data-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EddieAtGoogle%2FSQL-Based-GenAI-Data-Pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EddieAtGoogle%2FSQL-Based-GenAI-Data-Pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EddieAtGoogle%2FSQL-Based-GenAI-Data-Pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EddieAtGoogle%2FSQL-Based-GenAI-Data-Pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EddieAtGoogle","download_url":"https://codeload.github.com/EddieAtGoogle/SQL-Based-GenAI-Data-Pipeline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240680827,"owners_count":19840314,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","bqml","dataform","dbt","embeddings","gemini","google-cloud-platform","sql","vector-search","vertex-ai"],"created_at":"2025-01-08T04:42:32.724Z","updated_at":"2025-02-25T13:48:35.414Z","avatar_url":"https://github.com/EddieAtGoogle.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# Building GenAI Data Pipelines with SQL\n\n🎯 **Perfect for data teams who know SQL and want to build production-ready GenAI pipelines without Python expertise**\n\n🤖 Data analytics and engineering teams in enterprises are finding themselves responsible for GenAI data pipelines, often without extensive machine learning expertise. Many of these teams are most comfortable working in SQL, and traditional approaches would require significant upskilling in Python and ML frameworks. This repository bridges that gap by showing how to build sophisticated GenAI pipelines using familiar SQL-based tools.\n\n⚡ **Get Started in Minutes**: Deploy a complete GenAI pipeline with just SQL and a few commands.\n\n📚 **What You'll Learn**:\n\n✨ **Build** production-grade GenAI pipelines using SQL  \n🛡️ **Implement** best practices for data quality and security  \n📊 **Monitor** and evaluate GenAI model performance\n\n**Prerequisites**: SQL knowledge  \n**Not Required**: Python, Machine Learning expertise\n\n[![Built with Dataform](https://img.shields.io/badge/Built%20with-Dataform-blue)](https://cloud.google.com/dataform)\n[![GCP Ready](https://img.shields.io/badge/GCP-Ready-green)](https://cloud.google.com/)\n[![Uses Gemini](https://img.shields.io/badge/AI-Gemini%201.5-purple)](https://cloud.google.com/vertex-ai)\n[![SQL Only](https://img.shields.io/badge/SQL-Only-orange)](https://cloud.google.com/bigquery)\n\n\u003c/div\u003e\n\n---\n\n## 📋 Overview\n\nTransform your customer feedback data into actionable insights using state-of-the-art AI/ML techniques:\n\n- 🎯 Sentiment analysis using Gemini 1.5\n- 🔍 Vector embeddings for semantic search\n- 🔮 Intelligent question clustering\n- 📊 Automated theme identification\n- ⚡ Analytics on RAG system usage, quality and performance\n\n## 🏗️ Architecture\n\n\u003cdiv align=\"center\"\u003e\n\n![Architecture Diagram](./docs/assets/architecture.png)\n\n\u003c/div\u003e\n\nBuilt on enterprise-grade Google Cloud technologies:\n\n- **Dataform** - Orchestration \u0026 transformation\n- **BigQuery** - Serverless data warehouse\n- **Vertex AI** - Machine learning operations\n  - Gemini 1.5\n  - Text Embedding API\n  - Vector Search\n\n### Dataform Configuration\n\nThe project supports two options for Dataform repository configuration:\n\n1. **Default Repository (Recommended for Learning)**\n   - Uses Dataform's built-in repository\n   - Perfect for learning and experimentation\n   - No additional Git setup required\n   - Enabled by default\n\n2. **Remote Git Repository (Optional)**\n   - For production or team collaboration\n   - Requires a GitHub repository and personal access token\n   - Enable by setting `use_remote_git = true` in `terraform.tfvars`\n\nChoose the option that best suits your needs. For most learning scenarios, the default repository is recommended.\n\n## 🚀 Prerequisites Checklist\n\nBefore starting the deployment, ensure you have the following prerequisites in place:\n\n### 1. Google Cloud Environment\n- [ ] A Google Cloud project with billing enabled\n- [ ] Owner or Editor role on the project\n\n### 2. Development Tools\n- [ ] Git (version \u003e= 2.0)\n- [ ] Terraform (version \u003e= 1.0)\n- [ ] Google Cloud SDK (version \u003e= 440.0.0)\n\n### 3. Required APIs\nRun this command to enable necessary APIs:\n```bash\ngcloud services enable \\\n  secretmanager.googleapis.com \\\n  dataform.googleapis.com \\\n  bigquery.googleapis.com \\\n  artifactregistry.googleapis.com\n```\n\n## 🚀 Quick Start\n\n1. **Clone the Repository**\n   ```bash\n   git clone \u003crepository-url\u003e\n   cd genai_data_pipeline\n   ```\n\n2. **Create Required Google Groups**\n   - Create three Google Groups in your workspace:\n     - Dataform Users (e.g., `dataform-users@your-domain.com`)\n     - Data Readers (e.g., `data-readers@your-domain.com`)\n     - Data Owner (use an existing team email)\n\n3. **Configure Terraform Variables**\n   ```bash\n   cd terraform\n   cp terraform.tfvars.example terraform.tfvars\n   ```\n   Edit `terraform.tfvars` and set:\n   - Your `project_id`\n   - Your Google Groups emails\n   - Optionally customize other settings\n\n4. **Initialize and Apply Terraform**\n   ```bash\n   terraform init\n   terraform plan\n   terraform apply\n   ```\n\n5. **Verify Deployment**\n   - Visit the [Dataform UI](https://console.cloud.google.com/bigquery/dataform)\n   - Select your project and repository\n   - Try creating a new definition\n\n## 🚀 Terraform State Management\n\nThis project supports two options for managing Terraform state:\n\n### Local State (Default)\nBy default, Terraform will store state locally in your workspace. This is suitable for:\n- Individual learning and development\n- Quick prototyping\n- Local testing and experimentation\n\n### Remote State in Google Cloud Storage (Recommended for Teams)\nFor team environments or production deployments, we recommend using Google Cloud Storage (GCS) for state management. This provides:\n- 🤝 Team collaboration capabilities\n- 🔒 State locking to prevent concurrent modifications\n- 🔄 Version history and backup\n- 🛡️ Better security through Google Cloud IAM\n\nTo enable GCS state storage:\n\n```bash\n# 1. Create a GCS bucket for state storage\nexport PROJECT_ID=\"your-project-id\"\ngsutil mb -l us-central1 gs://${PROJECT_ID}-terraform-state\n\n# 2. Enable versioning for state history\ngsutil versioning set on gs://${PROJECT_ID}-terraform-state\n\n# 3. Update backend configuration\n# Uncomment and configure the backend block in terraform/backend.tf:\nterraform {\n  backend \"gcs\" {\n    bucket = \"YOUR_PROJECT_ID-terraform-state\"\n    prefix = \"genai-pipeline\"\n  }\n}\n```\n\n\u003e **Note**: For this educational project, local state is perfectly fine for getting started. Consider switching to GCS state storage when working in a team or moving to production.\n\n## 🚀 Setup Guide\n\nChoose your preferred setup path:\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ch3\u003e📱 Option A: Using Google Cloud Console (Recommended for Beginners)\u003c/h3\u003e\u003c/summary\u003e\n\n### Step 1: Initial Setup\n\n1. **Access Google Cloud Console**\n   - Navigate to [console.cloud.google.com](https://console.cloud.google.com)\n   - Create or select your project\n   - Note your `Project ID` for later use\n\n2. **Enable Required APIs**\n   - Go to [APIs \u0026 Services](https://console.cloud.google.com/apis/dashboard)\n   - Click \"Enable APIs and Services\"\n   - Enable the following:\n     - BigQuery API\n     - BigQuery Connection API\n     - Cloud Storage API\n     - Vertex AI API\n\n### Step 2: Create Storage Bucket\n\n1. Navigate to [Cloud Storage](https://console.cloud.google.com/storage)\n2. Click \"Create Bucket\"\n   - Name: `your-project-consumer-reviews`\n   - Location: `us-central1`\n   - Default storage class: `Standard`\n   - Access control: `Uniform`\n3. Click \"Create\"\n4. Upload Data:\n   - Open your new bucket\n   - Click \"Upload Files\"\n   - Select the sample data file from `genai_data_pipeline/data/consumer_review_data.parquet`\n   - Wait for completion\n\n### Step 3: Initialize BigQuery Dataset\n\n1. Open [BigQuery Console](https://console.cloud.google.com/bigquery)\n2. Create Dataset:\n   - Click your project name\n   - Click \"Create Dataset\"\n   - Dataset ID: `consumer_reviews_dataset`\n   - Data location: `US (multi-region)`\n   - Click \"Create dataset\"\n3. Load Data:\n   - Click \"Create Table\"\n   - Source: Select \"Google Cloud Storage\"\n   - File format: `Parquet`\n   - Source path: `gs://your-project-consumer-reviews/consumer_review_data.parquet`\n   - Table name: `consumer_review_data`\n   - Schema: Select \"Auto detect\"\n   - Click \"Create table\"\n\n### Step 4: Configure Remote Connection\n\n1. **Create Connection**\n   - In BigQuery, click \"More\" → \"Connections\"\n   - Click \"Create Connection\"\n   - Configure:\n     ```\n     Connection type: Cloud Resource\n     Service: Vertex AI\n     Connection ID: vertex-ai\n     Location: us-central1\n     ```\n   - Click \"Create\"\n\n2. **Set Up Permissions**\n   - Go to [IAM \u0026 Admin](https://console.cloud.google.com/iam-admin)\n   - Find: `bq-connection-sa@your-project-id.iam.gserviceaccount.com`\n   - Add roles:\n     - Vertex AI User\n     - BigQuery Admin\n\n### Step 5: Update Configuration Files\n\n1. Edit `dataform.json`:\n   ```json\n   {\n     \"defaultSchema\": \"consumer_reviews_dataset\",\n     \"defaultDatabase\": \"your-project-id\",\n     \"defaultLocation\": \"US\"\n   }\n   ```\n\n2. Edit `includes/constants.js`:\n   ```javascript\n   const PROJECT_ID = \"your-project-id\";\n   const SCHEMA_NAME = \"consumer_reviews_dataset\";\n   const REMOTE_CONNECTION = \"projects/your-project-id/locations/us-central1/connections/vertex-ai\";\n   ```\n\n### Step 6: Verify Setup\n\n1. In BigQuery Console:\n   - Run: `SELECT COUNT(*) FROM consumer_reviews_dataset.consumer_review_data`\n2. Check Connection:\n   - Go to \"Connections\"\n   - Verify `vertex-ai` status is \"Connected\"\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003ch3\u003e💻 Option B: Using Command Line\u003c/h3\u003e\u003c/summary\u003e\n\n### Step 1: Initial Setup\n\n```bash\n# Set environment variables\nexport PROJECT_ID=\"your-project-id\"\nexport BUCKET_NAME=\"${PROJECT_ID}-consumer-reviews\"\n\n# Configure gcloud\ngcloud config set project $PROJECT_ID\n\n# Enable APIs\ngcloud services enable bigquery.googleapis.com\ngcloud services enable bigqueryconnection.googleapis.com\ngcloud services enable storage.googleapis.com\ngcloud services enable aiplatform.googleapis.com\n```\n\n### Step 2: Create Storage Bucket\n\n```bash\n# Create bucket\ngsutil mb -l us-central1 gs://$BUCKET_NAME\n\n# Upload data\ngsutil cp genai_data_pipeline/data/consumer_review_data.parquet gs://$BUCKET_NAME/raw/reviews/\n```\n\n### Step 3: Initialize BigQuery Dataset\n\n```bash\n# Create dataset\nbq mk --dataset \\\n  --location=US \\\n  ${PROJECT_ID}:consumer_reviews_dataset\n\n# Load data\nbq query --use_legacy_sql=false \\\n  \"LOAD DATA INTO \\`${PROJECT_ID}.consumer_reviews_dataset.consumer_review_data\\`\n   FROM FILES (\n     format = 'PARQUET',\n     uris = ['gs://${BUCKET_NAME}/consumer_review_data.parquet']\n   );\"\n```\n\n### Step 4: Configure Remote Connection\n\n```bash\n# Create connection\nbq mk --connection \\\n  --location=us-central1 \\\n  --project_id=${PROJECT_ID} \\\n  --connection_type=CLOUD_RESOURCE \\\n  vertex-ai\n\n# Get service account\nexport CONNECTION_SA=$(bq show --connection ${PROJECT_ID}.us-central1.vertex-ai \\\n  | grep \"serviceAccountId\" | cut -d'\"' -f4)\n\n# Grant permissions\ngcloud projects add-iam-policy-binding ${PROJECT_ID} \\\n  --member=\"serviceAccount:${CONNECTION_SA}\" \\\n  --role=\"roles/aiplatform.user\"\n\ngcloud projects add-iam-policy-binding ${PROJECT_ID} \\\n  --member=\"serviceAccount:${CONNECTION_SA}\" \\\n  --role=\"roles/bigquery.admin\"\n```\n\n### Step 5: Update Configuration Files\n\n```bash\n# Get connection ID\nexport CONNECTION_ID=$(bq show --connection ${PROJECT_ID}.us-central1.vertex-ai \\\n  | grep \"name\" | cut -d'\"' -f4)\n\n# Update files (manual step)\necho \"Update dataform.json and constants.js with your project details\"\n```\n\n### Step 6: Verify Setup\n\n```bash\n# Check data\nbq query --use_legacy_sql=false \\\n  \"SELECT COUNT(*) FROM ${PROJECT_ID}.consumer_reviews_dataset.consumer_review_data\"\n\n# Verify connection\nbq show --connection ${PROJECT_ID}.us-central1.vertex-ai\n\n# Test Vertex AI access\ngcloud ai models list --region=us-central1\n```\n\n\u003c/details\u003e\n\n## 🔄 Pipeline Components\n\n### Review Processing\n1. `incoming_reviews` - Data ingestion \u0026 validation\n2. `reviews_with_sentiment` - Sentiment analysis\n3. `reviews_with_embeddings` - Vector embedding generation\n4. `create_vector_index` - Similarity search indexing\n\n### Question Analysis\n1. `questions_with_embeddings` - Semantic embedding\n2. `questions_with_clusters` - K-means clustering\n3. `question_themes` - Theme generation\n4. `qa_with_evaluation` - Quality assessment\n5. `qa_with_product_type` - Product classification\n6. `qa_quality_data` - Analysis aggregation\n\n## 🏷️ Tags\n\n- `process_reviews` - Review processing\n- `quality_data_prep` - Question analysis\n- `bqml_model` - Model operations\n- `vector_index_creation` - Search setup\n- `regenerate_question_themes` - Theme updates\n\n## ✅ Data Quality\n\nBuilt-in data quality checks ensure:\n- ✓ Key uniqueness\n- ✓ Required field validation\n- ✓ Row-level conditions\n- ✓ Incremental processing\n\n## 📦 Dependencies\n\n- @dataform/core: 2.8.3\n- Google Cloud Platform:\n  - BigQuery\n  - Vertex AI (Gemini)\n  - Cloud Storage\n\n## 🆘 Troubleshooting\n\n\u003cdetails\u003e\n\u003csummary\u003eCommon Issues \u0026 Solutions\u003c/summary\u003e\n\n### Permission Errors\n```bash\n# Verify IAM roles\ngcloud projects get-iam-policy $PROJECT_ID \\\n  --flatten=\"bindings[].members\" \\\n  --format='table(bindings.role)' \\\n  --filter=\"bindings.members:$(gcloud config get-value account)\"\n```\n\n### Connection Issues\n```bash\n# Check API status\ngcloud services list --enabled | grep -E \"bigquery|aiplatform\"\n\n# Verify service account\ngcloud iam service-accounts describe ${CONNECTION_SA}\n```\n\n### Data Loading Issues\n```bash\n# Check job status\nbq show -j ${PROJECT_ID}:US.recent_job_id\n```\n\n\u003c/details\u003e\n\n## 🔑 Access Management\n\n### Dataform User Access\nThis project uses Google Groups to manage Dataform access. Users need to be members of the Dataform users group to:\n- Create and edit Dataform definitions\n- Execute Dataform workflows\n- View and query data in BigQuery\n\n#### Setting up Dataform Access\n1. Create a Google Group for Dataform users:\n   ```bash\n   # Using Google Workspace Admin Console or gcloud\n   gcloud identity groups create dataform-users@your-domain.com \\\n     --organization=your-org-id \\\n     --display-name=\"Dataform Users\"\n   ```\n\n2. Add members to the group:\n   ```bash\n   gcloud identity groups memberships add \\\n     --group-email=dataform-users@your-domain.com \\\n     --member-email=user@your-domain.com\n   ```\n\n3. Update `terraform.tfvars` with your group:\n   ```hcl\n   dataform_users_group = \"dataform-users@your-domain.com\"\n   ```\n\n4. Apply the Terraform configuration:\n   ```bash\n   terraform apply\n   ```\n\n#### Verifying Access\nAfter setup, users can verify their access:\n1. Visit the [Dataform UI](https://console.cloud.google.com/bigquery/dataform)\n2. Select your project and repository\n3. Try creating a new definition or running a workflow\n\n#### Troubleshooting Access Issues\nIf users can't access Dataform:\n1. Verify group membership:\n   ```bash\n   gcloud identity groups memberships list \\\n     --group-email=dataform-users@your-domain.com\n   ```\n\n2. Check IAM bindings:\n   ```bash\n   gcloud projects get-iam-policy $PROJECT_ID \\\n     --flatten=\"bindings[].members\" \\\n     --filter=\"bindings.role:dataform.developer\"\n   ```\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feddieatgoogle%2Fsql-based-genai-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feddieatgoogle%2Fsql-based-genai-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feddieatgoogle%2Fsql-based-genai-data-pipeline/lists"}