https://github.com/windy-civi-pipelines/or-data-pipeline
๐๏ธ Oregon legislative data pipeline
https://github.com/windy-civi-pipelines/or-data-pipeline
legislative-data openstates state-pipeline
Last synced: 15 days ago
JSON representation
๐๏ธ Oregon legislative data pipeline
- Host: GitHub
- URL: https://github.com/windy-civi-pipelines/or-data-pipeline
- Owner: windy-civi-pipelines
- Created: 2025-10-24T04:47:43.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-05-31T06:26:15.000Z (29 days ago)
- Last Synced: 2026-05-31T08:19:06.554Z (29 days ago)
- Topics: legislative-data, openstates, state-pipeline
- Size: 242 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐๏ธ Windy Civi Data Pipeline Template
A **GitHub Actions-powered pipeline** that scrapes, cleans, versions, and extracts text from state legislative data from **Open States**. This repository acts as a standardized template for all state-level pipelines within the Windy Civi ecosystem.
---
## โ๏ธ What This Pipeline Does
Each state pipeline provides a self-contained automation workflow to:
1. ๐งน **Scrape** data for a single U.S. state from the [OpenStates](https://github.com/openstates/openstates-scrapers) project
2. ๐งผ **Sanitize** the data by removing ephemeral fields (`_id`, `scraped_at`) for deterministic output
3. ๐ง **Format** it into a blockchain-style, versioned structure with incremental processing
4. ๐ **Link** events to bills and sessions automatically
5. ๐ฉบ **Monitor** data quality by tracking orphaned bills
6. ๐ **Extract** full text from bills, amendments, and supporting documents (PDFs, XMLs, HTMLs)
7. ๐ **Commit** the formatted output and extracted text nightly (or manually) with auto-save
This approach keeps every state repository consistent, auditable, and easy to maintain.
---
## โจ Key Features
- **๐ Incremental Processing** - Only processes new or updated bills (no duplicate work!)
- **๐พ Auto-Save Failsafe** - Commits progress every 30 minutes during text extraction
- **๐ฉบ Data Quality Monitoring** - Tracks orphaned bills (votes/events without bill data)
- **๐ Bill-Event Linking** - Automatically connects committee hearings and events to bills
- **โฑ๏ธ Timestamp Tracking** - Two-level timestamps for logs and text extraction
- **๐ฏ Multi-Format Text Extraction** - XML โ HTML โ PDF with fallbacks
- **๐ Concurrent Job Support** - Multiple runs can safely update the same repository
- **๐ Detailed Error Logging** - Categorized errors for easy debugging
---
## ๐ง Setup Instructions
1. **Click the green "Use this template" button** on this repository page to create a new repository from this template.
2. **Name your new repository** using the convention: `Oregon Data Pipeline` (e.g., `il-data-pipeline`, `tx-data-pipeline`).
3. **Update the state abbreviation** in both workflow files:
**In `.github/workflows/scrape-and-format-data.yml`:**
```yaml
env:
STATE_CODE: or # CHANGE THIS to your state abbreviation
jobs:
scrape:
- name: Scrape data
uses: windy-civi/toolkit/actions/scrape@main
with:
state: ${{ env.STATE_CODE }}
format:
- name: Format data
uses: windy-civi/toolkit/actions/format@main
with:
state: ${{ env.STATE_CODE }}
```
**In `.github/workflows/extract-text.yml`:**
```yaml
- name: Extract text
uses: windy-civi/toolkit/actions/extract@main
with:
state: or # CHANGE THIS to your state abbreviation
```
Make sure the state abbreviation matches the folder name used in [Open States scrapers](https://github.com/openstates/openstates-scrapers/tree/main/scrapers).
4. **Enable GitHub Actions** in your repo (if not already enabled).
5. (Optional) Enable nightly runs by ensuring the schedule blocks are uncommented in both workflow files:
```yaml
on:
workflow_dispatch:
schedule:
- cron: "0 1 * * *" # For scrape-and-format-data.yml
# or
- cron: "0 3 * * *" # For extract-text.yml (runs later to avoid overlap)
```
---
## ๐
Workflow Schedule
The pipeline runs in two stages:
### **Stage 1: Scrape & Format** (1am UTC)
Two separate jobs that run sequentially:
1. **Scrape Job** - Downloads legislative data using OpenStates scrapers
2. **Format Job** - Processes scraped data, links events, and monitors quality
### **Stage 2: Text Extraction** (3am UTC)
Independent workflow that extracts full bill text from documents.
This separation allows:
- โ
Faster metadata updates
- โ
Independent monitoring and debugging
- โ
Text extraction can timeout and restart without affecting scraping
- โ
Better resource management (text extraction can take hours)
---
## ๐ Folder Structure
```
Oregon Data Pipeline/
โโโ .github/workflows/
โ โโโ scrape-and-format-data.yml # Metadata scraping + formatting
โ โโโ extract-text.yml # Text extraction (independent)
โโโ country:us/
โ โโโ state:xx/ # state:usa for federal, state:il for Illinois, etc.
โ โโโ sessions/
โ โโโ {session_id}/
โ โโโ bills/
โ โ โโโ {bill_id}/
โ โ โโโ metadata.json # Bill data + _processing timestamps
โ โ โโโ files/ # Extracted text & documents
โ โ โ โโโ *.pdf # Original PDFs
โ โ โ โโโ *.xml # Original XMLs
โ โ โ โโโ *_extracted.txt # Extracted text
โ โ โโโ logs/ # Action/event/vote logs
โ โโโ events/ # Committee hearings
โ โโโ {timestamp}_hearing.json
โโโ .windycivi/ # Pipeline metadata (committed)
โ โโโ errors/ # Processing errors
โ โ โโโ text_extraction_errors/ # Text extraction failures
โ โ โ โโโ download_failures/ # Failed downloads
โ โ โ โโโ parsing_errors/ # Failed text parsing
โ โ โ โโโ missing_files/ # Missing source files
โ โ โโโ missing_session/ # Bills without session info
โ โ โโโ event_archive/ # Archived event data
โ โ โโโ orphaned_placeholders_tracking.json # Data quality monitoring
โ โโโ bill_session_mapping.json # Bill-to-session mappings (flattened)
โ โโโ sessions.json # Session metadata (flattened)
โ โโโ latest_timestamp_seen.txt # Last processed timestamp
โโโ Pipfile, Pipfile.lock
โโโ README.md
```
---
## ๐ฆ Output Format
### Metadata Output (`country:us/state:*/`)
Formatted metadata is saved to `country:us/state:xx/sessions/`, organized by session and bill.
Each bill directory contains:
- `metadata.json` โ structured information about the bill **with `_processing` timestamps**
- `logs/` โ action, event, and vote logs
- `files/` โ original documents and extracted text
**Example `metadata.json` structure:**
```json
{
"identifier": "HB 1234",
"title": "Example Bill",
"_processing": {
"logs_latest_update": "2025-01-15T14:30:00Z",
"text_extraction_latest_update": "2025-01-16T08:00:00Z"
},
"actions": [
{
"description": "Introduced in House",
"date": "2025-01-01",
"_processing": {
"log_file_created": "2025-01-01T12:00:00Z"
}
}
]
}
```
### Text Extraction Output (`files/`)
When text extraction is enabled, each bill directory also includes:
- `files/` โ original documents and extracted text
- `*.pdf` โ Original PDF documents
- `*.xml` โ Original XML bill text
- `*.html` โ Original HTML documents
- `*_extracted.txt` โ Plain text extracted from documents
### Error Output (`.windycivi/errors/`)
Failed items are logged separately:
- `.windycivi/errors/text_extraction_errors/download_failures/` โ Documents that couldn't be downloaded
- `.windycivi/errors/text_extraction_errors/parsing_errors/` โ Documents that couldn't be parsed
- `.windycivi/errors/text_extraction_errors/missing_files/` โ Bills missing source files
- `.windycivi/errors/missing_session/` โ Bills without session information
### Data Quality Monitoring (`orphaned_placeholders_tracking.json`)
The pipeline automatically tracks **orphaned bills** - bills that have vote events or hearings but no actual bill data. Check this file periodically to identify data quality issues:
```json
{
"HB999": {
"first_seen": "2025-01-21T12:00:00Z",
"last_seen": "2025-01-23T14:30:00Z",
"occurrence_count": 3,
"session": "103",
"vote_count": 2,
"event_count": 0,
"path": "country:us/state:il/sessions/103/bills/HB999"
}
}
```
**What to look for:**
- Bills with high `occurrence_count` (3+) are **chronic orphans** - likely data quality issues
- Check for typos in bill identifiers or scraper configuration
- Orphans automatically resolve when the bill data arrives! ๐
๐ See [orphan tracking documentation](https://github.com/windy-civi/toolkit/blob/main/docs/orphan_tracking.md) for more details.
---
## ๐ชต Logging & Error Handling
Each run includes detailed logs to track progress and capture failures:
### Scraping & Formatting Logs
- Logs are saved per bill under `logs/`
- Processing summary shows total bills, events, and votes processed
- Session mapping tracks bill-to-session relationships
- **Orphan tracking** shows new, existing, and resolved orphans
### Text Extraction Logs
- Download attempts with success/failure status
- Extraction method used (XML, HTML, PDF)
- Error details saved to `text_extraction_errors/`
- **Auto-save commits** every 30 minutes prevent data loss
- Summary reports include:
- Total documents processed
- Successful extractions by type
- Skipped (already extracted) documents
- Failed downloads/extractions with reasons
Pipelines are fault-tolerant โ if a bill fails, the workflow continues for all others.
---
## ๐ Supported Document Types
The text extraction workflow supports:
| Type | Format | Extraction Method | Notes |
| -------------- | -------- | ------------------- | ------------------------------ |
| **Bills** | XML | Direct XML parsing | Primary bill text |
| **Bills** | PDF | pdfplumber + PyPDF2 | With strikethrough detection |
| **Bills** | HTML | BeautifulSoup | Fallback for HTML-only sources |
| **Amendments** | PDF | pdfplumber + PyPDF2 | State amendments only |
| **Documents** | PDF/HTML | Auto-detect | CBO reports, committee reports |
**Note**: Federal `congress.gov` HTML amendments are currently skipped due to blocking issues. XML bill versions from `govinfo.gov` work perfectly.
---
## ๐ง Workflow Configuration Options
### Scrape Action Inputs
```yaml
uses: windy-civi/toolkit/actions/scrape@main
with:
state: or # State abbreviation (required)
github-token: ${{ secrets.GITHUB_TOKEN }}
use-scrape-cache: "false" # Skip scraping, use cached data
```
### Format Action Inputs
```yaml
uses: windy-civi/toolkit/actions/format@main
with:
state: or # State abbreviation (required)
github-token: ${{ secrets.GITHUB_TOKEN }}
```
### Text Extraction Action Inputs
```yaml
uses: windy-civi/toolkit/actions/extract@main
with:
state: or # State abbreviation (required)
github-token: ${{ secrets.GITHUB_TOKEN }}
```
---
## ๐งฉ Optional: Enabling Raw Scraped Data Storage
By default, raw scraped data (`_data/`) is not stored to keep the repository lightweight.
### โ
To Enable `_data` Saving:
Uncomment the copy and commit steps in your workflow file:
```yaml
- name: Copy Scraped Data to Repo
run: |
mkdir -p "$GITHUB_WORKSPACE/_data/$STATE"
cp -r "${RUNNER_TEMP}/_working/_data/$STATE"/* "$GITHUB_WORKSPACE/_data/$STATE/"
```
And include `_data` in the commit:
```bash
git add _data country:us/ .windycivi/
```
### ๐ซ To Disable `_data` Saving (Default):
Comment out the copy step and exclude `_data` from the commit command:
```bash
git add country:us/ .windycivi/
```
---
## ๐ Running the Pipeline
### Automatic (Scheduled)
Once enabled, workflows run automatically:
- **Scrape & Format**: 1am UTC daily
- **Text Extraction**: 3am UTC daily (runs independently)
### Manual Trigger
1. Go to **Actions** tab in GitHub
2. Select the workflow (Scrape & Format or Extract Text)
3. Click **Run workflow**
4. Choose the branch and click **Run**
### Testing Locally
```bash
# Clone the repository
git clone https://github.com/YOUR-ORG/Oregon Data Pipeline
cd Oregon Data Pipeline
# Install dependencies
pipenv install
# Run scraping and formatting
pipenv run python scrape_and_format/main.py \
--state il \
--openstates-data-folder /path/to/scraped/data \
--git-repo-folder /path/to/output
# Run text extraction (with incremental flag)
pipenv run python text_extraction/main.py \
--state il \
--data-folder /path/to/output \
--output-folder /path/to/output \
--incremental
```
---
## ๐ Known Issues
See the [known_problems/](https://github.com/windy-civi/toolkit/tree/main/known_problems) directory in the main repository for:
- State-specific scraper issues
- Formatter validation issues
- Text extraction limitations
- Status of all 56 jurisdictions
---
## ๐ Monitoring & Debugging
### Check Workflow Status
- GitHub Actions tab shows all runs
- Green checkmark = success
- Red X = failure (click for logs)
### Check Data Quality
1. Review `.windycivi/errors/orphaned_placeholders_tracking.json` for data issues
2. Look for chronic orphans (occurrence_count >= 3)
3. Check `.windycivi/errors/` for formatting/extraction errors
4. Monitor auto-save commits during text extraction runs
### Common Issues
**Scraping fails**:
- Check if OpenStates scraper for your state is working
- Verify state abbreviation matches OpenStates format
- Check for new legislative sessions not yet configured
**Text extraction fails or times out**:
- Check `.windycivi/errors/text_extraction_errors/` for details
- Look for auto-save commits (pipeline saves progress every 30 minutes)
- Re-run the workflow - it will resume from where it left off (incremental)
- Review error logs for specific bills
**Orphaned bills appear**:
- Check `orphaned_placeholders_tracking.json` for details
- Verify bill identifiers match between scraper and vote/event data
- Bills may auto-resolve on next scrape if it's a timing issue
**Push conflicts**:
- The pipeline auto-handles conflicts with `git pull --rebase`
- If manual resolution needed, check logs for specific conflicts
---
## ๐ค Contributions & Support
This template is part of the [Windy Civi](https://github.com/windy-civi) project. If you're onboarding a new state or improving the automation, feel free to open an issue or PR.
**Main Repository**: https://github.com/windy-civi/toolkit
For discussions, join our community on Slack or GitHub Discussions.
---
## ๐ฏ Next Steps After Setup
1. โ
Verify both workflows are enabled
2. โ
Test with manual trigger first (start with Scrape & Format)
3. โ
Check output in `country:us/state:xx/sessions/`
4. โ
Review `.windycivi/errors/orphaned_placeholders_tracking.json` for data quality
5. โ
Check any errors in `.windycivi/errors/`
6. โ
Test text extraction workflow independently
7. โ
Enable scheduled runs once testing is successful
8. โ
Monitor first few automated runs for issues
---
## ๐ Additional Documentation
- **[Incremental Processing Guide](https://github.com/windy-civi/toolkit/blob/main/docs/incremental_processing/)** - How incremental updates work
- **[Orphan Tracking Guide](https://github.com/windy-civi/toolkit/blob/main/docs/orphan_tracking.md)** - Understanding data quality monitoring
- **[Main Repository README](https://github.com/windy-civi/toolkit)** - Full technical documentation
---
**Part of the [Windy Civi](https://windycivi.com) ecosystem โ building a transparent, verifiable civic data archive for all 50 states.**