https://github.com/dylanpicart/google_sheet_survey_pipeline

Lightweight Python ETL using Google Cloud to Extract, Transform, and Load Google Surveys for Exploratory Data Analysis on Streamlit Dashboard and for Excel viewing.
https://github.com/dylanpicart/google_sheet_survey_pipeline

api ci cicd continuous-integration data-cleaning devsecops etl google google-cloud google-cloud-platform python streamlit

Last synced: about 1 month ago
JSON representation

Lightweight Python ETL using Google Cloud to Extract, Transform, and Load Google Surveys for Exploratory Data Analysis on Streamlit Dashboard and for Excel viewing.

Host: GitHub
URL: https://github.com/dylanpicart/google_sheet_survey_pipeline
Owner: dylanpicart
License: mit
Created: 2025-08-04T21:20:38.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-06T04:24:12.000Z (11 months ago)
Last Synced: 2025-09-25T21:14:00.807Z (9 months ago)
Topics: api, ci, cicd, continuous-integration, data-cleaning, devsecops, etl, google, google-cloud, google-cloud-platform, python, streamlit
Language: Python
Homepage:
Size: 124 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

# Google Sheet Survey ETL Pipeline

[![CI](https://github.com/dylanpicart/google_sheet_survey_pipeline/actions/workflows/ci.yml/badge.svg)](https://github.com/dylanpicart/google_sheet_survey_pipeline/actions/workflows/ci.yml)

This project provides a robust, DevSecOps-ready reproducible ETL (Extract, Transform, Load) pipeline for end-to-end analysis of multi-year, multi-language Google Sheet survey data, including raw extraction, cleaning, canonical mapping, advanced summaries, and executive-ready Excel outputs.

---

## **Pipeline Overview**

* **Input:** Raw survey CSVs (downloaded from Google Drive or other source folders).
* **Process:** Clean, normalize, translate, audit, map, summarize, and consolidate responses and questions.
* **Output:** Master Excel files for analysis; fully harmonized survey data for reporting or modeling.

---

## Project Structure

```
root/
│
├── Makefile # Run extract, transform, or load scripts
├── .venv/ # Python virtual environment
├── creds/ # Google API credentials
│
├── data/
│ ├── configs/ # Config files (YAML, mappings, links)
│ ├── raw/ # All raw survey data (CSV, pre-clean)
│ └── processed/ # All cleaned, processed, and summary data
│
├── notebook/ # Jupyter notebooks and experiments
├── tests/ # Unit tests and pipeline QA scripts
├── utils/ # Shared Python utilities/helpers
│
├── scripts/
│ ├── extract/ # Scripts for data download, cleaning, auditing
│ ├── transform/ # Scripts for mapping, summaries, consolidation
│ └── load/ # Scripts for Excel/output/reporting
│
├── logs/ # Logging data and issues throughout the ETL pipeline
├── .env # Environment variables
├── .gitignore # Files and folders excluded from git
├── README.md # Project overview and documentation
├── SECURITY.md # Security and responsible disclosure policy
└── requirements.txt # Python dependencies

```
---

## **ETL Process**

**1. Extract**

* Download all raw survey CSVs (English, Spanish, etc.) for all years/cohorts.

### **Google API Extraction**

#### **Overview**

This pipeline’s first step (`scrape_drive_links.py`) automates the downloading of raw survey CSV files from Google Drive (or optionally Google Sheets, converted to CSV).
It uses the Google Drive and/or Google Sheets API for authenticated access to your organization’s files and ensures you always work with up-to-date, original data.

---

#### **Requirements**

* **Google Cloud Platform Project** with Drive and/or Sheets API enabled
* **Service Account credentials** (or OAuth2 credentials) with access to the relevant files/folders
* The following **Python packages**:

* `google-api-python-client`
* `google-auth-httplib2`
* `google-auth-oauthlib`
* `gspread` (for direct Google Sheets to CSV, optional)
* `pandas`
* `requests`
* `tqdm` (optional, for progress bars)

**Install them with:**

```bash
pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib gspread pandas requests tqdm
```

---

#### **Google API Setup**

1. **Create a Google Cloud project** at [https://console.cloud.google.com/](https://console.cloud.google.com/)
2. **Enable the APIs:**

* Google Drive API
* (Optionally) Google Sheets API
3. **Create and download a Service Account key** (JSON) or OAuth client secret.
4. **Share your survey folder/files** with the Service Account email (if using a Service Account).
5. **Save your credentials** (usually as `service_account.json` or `credentials.json`) in a secure location.

---

#### **Script Configuration**

* Place your credentials file in a known location (e.g., `secrets/service_account.json`).
* Set an environment variable or `.env` entry:

```
GOOGLE_APPLICATION_CREDENTIALS=secrets/service_account.json
```
* Update `scrape_drive_links.py` to read your folder IDs or search queries as needed.

---

#### **How the Extraction Script Works**

* **Authenticates** with Google Drive using your credentials.
* **Lists all files** in a given folder or matching a search query.
* **Downloads** each file as CSV (either as a raw file or converts a Google Sheet to CSV).
* **Saves** each file into a local directory (e.g., `raw/` or `data/raw/`).

---

#### **Common Usage Example**

```bash
python -m scripts.extract.scrape_drive_links
```

This will:

* Authenticate using your Google credentials
* Download all target survey files (for each year/cohort) to your local `raw/` directory
* Ensure that your pipeline always starts with the latest official data

---

#### **Possible Extraction Script Features**

* **File type filtering** (only download `.csv` or `.gsheet`)
* **Date filtering** (only new/modified files)
* **Automatic conversion** of Google Sheets to CSV
* **Logging of all downloaded file names, IDs, and timestamps**
* **Parallel downloads** for large batches

---

#### **Error Handling**

* **Missing credentials**: Script halts and prints a clear message
* **API quota exceeded**: Script sleeps and retries (or halts with a warning)
* **No files found**: Script exits and prints which query/folder was empty
* **Partial download**: Script can be resumed/restarted safely

---

#### **Security**

* **Never commit credentials** (JSON) to git or public repos
* Restrict service account to “read-only” where possible

---

### **Pro Tips**

* Use **Google Groups** to give bulk file access to your service account.
* For one-off manual downloads, use Google Drive web UI, but for reproducibility, always prefer API automation.
* Keep your service account credentials and folder IDs in `.env` or as command-line args for flexibility.

---

**2. Transform**

* Standardize headers, clean rows, handle missing or malformed data.
* Translate Spanish CSVs to English.
* Audit all unique questions/options; map variations to canonical/overarching questions.
* Generate summary tables (Likert, frequency, Yes/No, etc.) per year and group.
* Consolidate all responses and question summaries across years and groups.
* Summarize totals for cross-year and cross-group analysis.

**3. Load**

* Export all final and intermediate outputs to Excel, with tabs for every summary and year/group.
* Provide “master summary” and “full data” workbooks for stakeholders.

---

## **Requirements**

* **Python 3.8+**
* **Required Python packages:**

* `pandas`
* `xlsxwriter`
* (optionally: `openpyxl`, `numpy`, `unicodedata`, `regex`)
* **bash** (for pipeline orchestration)

Install requirements (example):

```bash
pip install pandas xlsxwriter openpyxl
```

---

## **How to Run the Pipeline**

### **Installation and Requirements**

1. **Clone this repository:**

```bash
git clone
cd
```

2. **Set up your Python virtual environment** (recommended):

```bash
python -m venv venv
source venv/bin/activate # (On Windows: venv\Scripts\activate)
```

3. **Install all Python dependencies:**

```bash
pip install -r requirements.txt
```

---

4. **Run the pipeline with:**

```bash
./run_etl.sh
```

*(Ensure it’s executable: `chmod +x run_etl.sh`)*

**This will:**

* Scrape and standardize raw data
* Translate (if needed)
* Audit, map, and consolidate all survey questions/responses
* Output results as `SF_Master_Summary.xlsx` and `SF_Master_Data.xlsx` in the `data/processed/` directory

---

### **Using the Makefile for Your ETL Pipeline**

This project includes a **Makefile** for easy automation and orchestration of the ETL workflow.

#### **Common Makefile Commands**

* **Run the full ETL pipeline (skipping Google Drive scraping and confirming before the final Excel output):**

```bash
make pipeline
```

* **Run only the extract step (data loading, cleaning, translation, audit):**

```bash
make extract
```

* **Run only the transform step (mapping, summary tables, consolidation):**

```bash
make transform
```

* **Manually run the scraping step, if you need to update the survey links:**

```bash
make scrape
```

* **Run only the final Excel export step:**

```bash
make load
```

* **Clean processed data outputs:**

```bash
make clean
```

* **Run all unit tests:**

```bash
make test
```

---

#### **How it works**

* The Makefile allows you to run any ETL stage individually or in sequence.
* The default `pipeline` target skips the Google Drive scraping step and asks for confirmation before creating the final Excel output, so you won’t overwrite results by accident.
* You can always run the original `run_etl.sh` script for fully automated execution.

---

#### **Pro tip:**

**Always run `make` commands from the project root directory.**
The Makefile expects all scripts, data, and outputs to use the standard project structure.

---

## **Logging**

All steps of the ETL pipeline are **fully logged** for transparency, debugging, and auditability.
Logs are stored in the `logs/` directory, with separate subfolders for each ETL phase:

```
logs/
├── extract/
│ ├── extract.log # All extraction progress and info
│ └── extract_error.log # Extraction warnings and errors only
├── transform/
│ ├── transform.log # Data cleaning, mapping, consolidation steps
│ └── transform_error.log # Warnings and errors during transformation
├── load/
│ ├── load.log # Final Excel/report generation steps
│ └── load_error.log # Any errors or warnings during loading/output
```

* **Every script in the pipeline logs to its respective ETL phase.**
* **Info, warning, and error messages** are captured, including which files were processed, skipped, or had issues.
* **Error logs** make it easy to spot and debug failed data loads, missing columns, or API problems.
* Logs rotate automatically to prevent disk overflows.

---

## 🚦 Continuous Integration & Quality Assurance

This project uses **GitHub Actions** for full CI/CD, automated linting, and security checks:

- **On every push or pull request:**
- All code is linted with [Ruff](https://github.com/astral-sh/ruff) and [Flake8](https://flake8.pycqa.org/).
- Security is checked with [Bandit](https://bandit.readthedocs.io/) and [pip-audit](https://pypi.org/project/pip-audit/).
- All tests are run with [pytest](https://pytest.org/), including code coverage reports.
- Logs and coverage reports are uploaded as build artifacts on failure.

You can view build status, logs, and artifacts on the [Actions tab](https://github.com/yourusername/google_sheet_survey_pipeline/actions).

---

## 🧹 Linting & Formatting

**Code is autoformatted and style-enforced with:**
- [Black](https://black.readthedocs.io/en/stable/) (for code style, line length, and blank lines)
- [Ruff](https://github.com/astral-sh/ruff) (for code hygiene and unused code removal)
- [Flake8](https://flake8.pycqa.org/) (for PEP8 and custom style rules)

**To lint and auto-fix your code locally:**
```bash
black . --line-length 120
ruff check scripts/ tests/ utils/ --fix
flake8 scripts/ tests/ utils/
bandit -r scripts/ utils/
```
---

## **Possible Approaches to Data Cleaning**

* **Column normalization:** Consistently rename and strip all headers (lowercase, no spaces, uniform naming).
* **String cleaning:** Remove leading/trailing whitespace, standardize punctuation/quotes, handle Unicode artifacts.
* **Missing value handling:** Remove or fill empty rows/cells, standardize “N/A”, handle outliers.
* **Canonical mapping:** Use `QCON_MAP` and audit mapping to ensure all question variations are consolidated.
* **Translation:** Ensure all responses/options are in English for cross-year consistency.

---

## **Next Steps (Pivot & Advanced Analysis)**

* **Multi-select/multi-pivot:** Use canonical mappings to group, count, and analyze all subquestions/options as part of overarching categories.
* **Percentages and rates:** Compute %Yes/No/Maybe per group/year/option.
* **Longitudinal tracking:** Analyze trends in responses across years and cohorts.
* **Visualization:** Use Excel or Python/Streamlit dashboards for charts, pivots, and interactive summaries.

---

## **Error Handling & Quality Assurance**

* **Each ETL step is checked:** Pipeline halts if a key script or file is missing or if any step fails.
* **Inter-step file checks:** Verifies output files from each phase before moving to the next.
* **Logging:** All errors and successes are printed to console (or optionally, a log file).
* **Custom checks:** Can be extended to email, Slack, or alert on failure for production.

**How to debug:**

* Inspect the output/error message to see where the process halted.
* Review intermediate outputs (CSV/Excel) to verify data at each stage.
* Re-run just the failed script for rapid iteration.

---
## License
This project is licensed under the MIT License. See the LICENSE file for details.

---

## **Author**

Developed by **Dylan Picart at Partnership With Children.**

Open an issue or submit a pull request!
For help running or customizing the pipeline, contact dylanpicart@mail.adelphi.edu.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dylanpicart/google_sheet_survey_pipeline

Awesome Lists containing this project

README