An open API service indexing awesome lists of open source software.

https://github.com/dylanpicart/google_sheet_survey_pipeline

Lightweight Python ETL using Google Cloud to Extract, Transform, and Load Google Surveys for Exploratory Data Analysis on Streamlit Dashboard and for Excel viewing.
https://github.com/dylanpicart/google_sheet_survey_pipeline

api ci cicd continuous-integration data-cleaning devsecops etl google google-cloud google-cloud-platform python streamlit

Last synced: about 1 month ago
JSON representation

Lightweight Python ETL using Google Cloud to Extract, Transform, and Load Google Surveys for Exploratory Data Analysis on Streamlit Dashboard and for Excel viewing.

Awesome Lists containing this project

README

          

# Google Sheet Survey ETL Pipeline

[![CI](https://github.com/dylanpicart/google_sheet_survey_pipeline/actions/workflows/ci.yml/badge.svg)](https://github.com/dylanpicart/google_sheet_survey_pipeline/actions/workflows/ci.yml)

This project provides a robust, DevSecOps-ready reproducible ETL (Extract, Transform, Load) pipeline for end-to-end analysis of multi-year, multi-language Google Sheet survey data, including raw extraction, cleaning, canonical mapping, advanced summaries, and executive-ready Excel outputs.

---

## **Pipeline Overview**

* **Input:** Raw survey CSVs (downloaded from Google Drive or other source folders).
* **Process:** Clean, normalize, translate, audit, map, summarize, and consolidate responses and questions.
* **Output:** Master Excel files for analysis; fully harmonized survey data for reporting or modeling.

---

## Project Structure

```
root/

├── Makefile # Run extract, transform, or load scripts
├── .venv/ # Python virtual environment
├── creds/ # Google API credentials

├── data/
│ ├── configs/ # Config files (YAML, mappings, links)
│ ├── raw/ # All raw survey data (CSV, pre-clean)
│ └── processed/ # All cleaned, processed, and summary data

├── notebook/ # Jupyter notebooks and experiments
├── tests/ # Unit tests and pipeline QA scripts
├── utils/ # Shared Python utilities/helpers

├── scripts/
│ ├── extract/ # Scripts for data download, cleaning, auditing
│ ├── transform/ # Scripts for mapping, summaries, consolidation
│ └── load/ # Scripts for Excel/output/reporting

├── logs/ # Logging data and issues throughout the ETL pipeline
├── .env # Environment variables
├── .gitignore # Files and folders excluded from git
├── README.md # Project overview and documentation
├── SECURITY.md # Security and responsible disclosure policy
└── requirements.txt # Python dependencies

```
---

## **ETL Process**

**1. Extract**

* Download all raw survey CSVs (English, Spanish, etc.) for all years/cohorts.

### **Google API Extraction**

#### **Overview**

This pipeline’s first step (`scrape_drive_links.py`) automates the downloading of raw survey CSV files from Google Drive (or optionally Google Sheets, converted to CSV).
It uses the Google Drive and/or Google Sheets API for authenticated access to your organization’s files and ensures you always work with up-to-date, original data.

---

#### **Requirements**

* **Google Cloud Platform Project** with Drive and/or Sheets API enabled
* **Service Account credentials** (or OAuth2 credentials) with access to the relevant files/folders
* The following **Python packages**:

* `google-api-python-client`
* `google-auth-httplib2`
* `google-auth-oauthlib`
* `gspread` (for direct Google Sheets to CSV, optional)
* `pandas`
* `requests`
* `tqdm` (optional, for progress bars)

**Install them with:**

```bash
pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib gspread pandas requests tqdm
```

---

#### **Google API Setup**

1. **Create a Google Cloud project** at [https://console.cloud.google.com/](https://console.cloud.google.com/)
2. **Enable the APIs:**

* Google Drive API
* (Optionally) Google Sheets API
3. **Create and download a Service Account key** (JSON) or OAuth client secret.
4. **Share your survey folder/files** with the Service Account email (if using a Service Account).
5. **Save your credentials** (usually as `service_account.json` or `credentials.json`) in a secure location.

---

#### **Script Configuration**

* Place your credentials file in a known location (e.g., `secrets/service_account.json`).
* Set an environment variable or `.env` entry:

```
GOOGLE_APPLICATION_CREDENTIALS=secrets/service_account.json
```
* Update `scrape_drive_links.py` to read your folder IDs or search queries as needed.

---

#### **How the Extraction Script Works**

* **Authenticates** with Google Drive using your credentials.
* **Lists all files** in a given folder or matching a search query.
* **Downloads** each file as CSV (either as a raw file or converts a Google Sheet to CSV).
* **Saves** each file into a local directory (e.g., `raw/` or `data/raw/`).

---

#### **Common Usage Example**

```bash
python -m scripts.extract.scrape_drive_links
```

This will:

* Authenticate using your Google credentials
* Download all target survey files (for each year/cohort) to your local `raw/` directory
* Ensure that your pipeline always starts with the latest official data

---

#### **Possible Extraction Script Features**

* **File type filtering** (only download `.csv` or `.gsheet`)
* **Date filtering** (only new/modified files)
* **Automatic conversion** of Google Sheets to CSV
* **Logging of all downloaded file names, IDs, and timestamps**
* **Parallel downloads** for large batches

---

#### **Error Handling**

* **Missing credentials**: Script halts and prints a clear message
* **API quota exceeded**: Script sleeps and retries (or halts with a warning)
* **No files found**: Script exits and prints which query/folder was empty
* **Partial download**: Script can be resumed/restarted safely

---

#### **Security**

* **Never commit credentials** (JSON) to git or public repos
* Restrict service account to “read-only” where possible

---

### **Pro Tips**

* Use **Google Groups** to give bulk file access to your service account.
* For one-off manual downloads, use Google Drive web UI, but for reproducibility, always prefer API automation.
* Keep your service account credentials and folder IDs in `.env` or as command-line args for flexibility.

---

**2. Transform**

* Standardize headers, clean rows, handle missing or malformed data.
* Translate Spanish CSVs to English.
* Audit all unique questions/options; map variations to canonical/overarching questions.
* Generate summary tables (Likert, frequency, Yes/No, etc.) per year and group.
* Consolidate all responses and question summaries across years and groups.
* Summarize totals for cross-year and cross-group analysis.

**3. Load**

* Export all final and intermediate outputs to Excel, with tabs for every summary and year/group.
* Provide “master summary” and “full data” workbooks for stakeholders.

---

## **Requirements**

* **Python 3.8+**
* **Required Python packages:**

* `pandas`
* `xlsxwriter`
* (optionally: `openpyxl`, `numpy`, `unicodedata`, `regex`)
* **bash** (for pipeline orchestration)

Install requirements (example):

```bash
pip install pandas xlsxwriter openpyxl
```

---

## **How to Run the Pipeline**

### **Installation and Requirements**

1. **Clone this repository:**

```bash
git clone
cd
```

2. **Set up your Python virtual environment** (recommended):

```bash
python -m venv venv
source venv/bin/activate # (On Windows: venv\Scripts\activate)
```

3. **Install all Python dependencies:**

```bash
pip install -r requirements.txt
```

---

4. **Run the pipeline with:**

```bash
./run_etl.sh
```

*(Ensure it’s executable: `chmod +x run_etl.sh`)*

**This will:**

* Scrape and standardize raw data
* Translate (if needed)
* Audit, map, and consolidate all survey questions/responses
* Output results as `SF_Master_Summary.xlsx` and `SF_Master_Data.xlsx` in the `data/processed/` directory

---

### **Using the Makefile for Your ETL Pipeline**

This project includes a **Makefile** for easy automation and orchestration of the ETL workflow.

#### **Common Makefile Commands**

* **Run the full ETL pipeline (skipping Google Drive scraping and confirming before the final Excel output):**

```bash
make pipeline
```

* **Run only the extract step (data loading, cleaning, translation, audit):**

```bash
make extract
```

* **Run only the transform step (mapping, summary tables, consolidation):**

```bash
make transform
```

* **Manually run the scraping step, if you need to update the survey links:**

```bash
make scrape
```

* **Run only the final Excel export step:**

```bash
make load
```

* **Clean processed data outputs:**

```bash
make clean
```

* **Run all unit tests:**

```bash
make test
```

---

#### **How it works**

* The Makefile allows you to run any ETL stage individually or in sequence.
* The default `pipeline` target skips the Google Drive scraping step and asks for confirmation before creating the final Excel output, so you won’t overwrite results by accident.
* You can always run the original `run_etl.sh` script for fully automated execution.

---

#### **Pro tip:**

**Always run `make` commands from the project root directory.**
The Makefile expects all scripts, data, and outputs to use the standard project structure.

---

## **Logging**

All steps of the ETL pipeline are **fully logged** for transparency, debugging, and auditability.
Logs are stored in the `logs/` directory, with separate subfolders for each ETL phase:

```
logs/
├── extract/
│ ├── extract.log # All extraction progress and info
│ └── extract_error.log # Extraction warnings and errors only
├── transform/
│ ├── transform.log # Data cleaning, mapping, consolidation steps
│ └── transform_error.log # Warnings and errors during transformation
├── load/
│ ├── load.log # Final Excel/report generation steps
│ └── load_error.log # Any errors or warnings during loading/output
```

* **Every script in the pipeline logs to its respective ETL phase.**
* **Info, warning, and error messages** are captured, including which files were processed, skipped, or had issues.
* **Error logs** make it easy to spot and debug failed data loads, missing columns, or API problems.
* Logs rotate automatically to prevent disk overflows.

---

## 🚦 Continuous Integration & Quality Assurance

This project uses **GitHub Actions** for full CI/CD, automated linting, and security checks:

- **On every push or pull request:**
- All code is linted with [Ruff](https://github.com/astral-sh/ruff) and [Flake8](https://flake8.pycqa.org/).
- Security is checked with [Bandit](https://bandit.readthedocs.io/) and [pip-audit](https://pypi.org/project/pip-audit/).
- All tests are run with [pytest](https://pytest.org/), including code coverage reports.
- Logs and coverage reports are uploaded as build artifacts on failure.

You can view build status, logs, and artifacts on the [Actions tab](https://github.com/yourusername/google_sheet_survey_pipeline/actions).

---

## 🧹 Linting & Formatting

**Code is autoformatted and style-enforced with:**
- [Black](https://black.readthedocs.io/en/stable/) (for code style, line length, and blank lines)
- [Ruff](https://github.com/astral-sh/ruff) (for code hygiene and unused code removal)
- [Flake8](https://flake8.pycqa.org/) (for PEP8 and custom style rules)

**To lint and auto-fix your code locally:**
```bash
black . --line-length 120
ruff check scripts/ tests/ utils/ --fix
flake8 scripts/ tests/ utils/
bandit -r scripts/ utils/
```
---

## **Possible Approaches to Data Cleaning**

* **Column normalization:** Consistently rename and strip all headers (lowercase, no spaces, uniform naming).
* **String cleaning:** Remove leading/trailing whitespace, standardize punctuation/quotes, handle Unicode artifacts.
* **Missing value handling:** Remove or fill empty rows/cells, standardize “N/A”, handle outliers.
* **Canonical mapping:** Use `QCON_MAP` and audit mapping to ensure all question variations are consolidated.
* **Translation:** Ensure all responses/options are in English for cross-year consistency.

---

## **Next Steps (Pivot & Advanced Analysis)**

* **Multi-select/multi-pivot:** Use canonical mappings to group, count, and analyze all subquestions/options as part of overarching categories.
* **Percentages and rates:** Compute %Yes/No/Maybe per group/year/option.
* **Longitudinal tracking:** Analyze trends in responses across years and cohorts.
* **Visualization:** Use Excel or Python/Streamlit dashboards for charts, pivots, and interactive summaries.

---

## **Error Handling & Quality Assurance**

* **Each ETL step is checked:** Pipeline halts if a key script or file is missing or if any step fails.
* **Inter-step file checks:** Verifies output files from each phase before moving to the next.
* **Logging:** All errors and successes are printed to console (or optionally, a log file).
* **Custom checks:** Can be extended to email, Slack, or alert on failure for production.

**How to debug:**

* Inspect the output/error message to see where the process halted.
* Review intermediate outputs (CSV/Excel) to verify data at each stage.
* Re-run just the failed script for rapid iteration.

---
## License
This project is licensed under the MIT License. See the LICENSE file for details.

---

## **Author**

Developed by **Dylan Picart at Partnership With Children.**

Open an issue or submit a pull request!
For help running or customizing the pipeline, contact dylanpicart@mail.adelphi.edu.