https://github.com/emilyriederer/nc-votes-duckdb

Demonstrating a "minimum viable approach" to large-scale OOM data processing with DuckDB and Arrow using NC voterfile data for the 2022 midterms
https://github.com/emilyriederer/nc-votes-duckdb

Last synced: 6 months ago
JSON representation

Demonstrating a "minimum viable approach" to large-scale OOM data processing with DuckDB and Arrow using NC voterfile data for the 2022 midterms

Host: GitHub
URL: https://github.com/emilyriederer/nc-votes-duckdb
Owner: emilyriederer
Created: 2022-09-11T14:46:18.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2022-09-28T10:04:30.000Z (over 2 years ago)
Last Synced: 2024-12-01T00:33:48.842Z (7 months ago)
Language: Python
Homepage:
Size: 33.2 KB
Stars: 11
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

jimsghstars - emilyriederer/nc-votes-duckdb - Demonstrating a "minimum viable approach" to large-scale OOM data processing with DuckDB and Arrow using NC voterfile data for the 2022 midterms (Python)

README

## duckdb + NC midterm data

This repository contains a proof-of-concept for lightweight data "infrastructure" to analyze NC midterm data out-of-memory with `duckdb` and `pyarrow`.

**This repo is not intended to illustrate best engineering practices but rather the power of duckdb as part of a "good enough" workflow.**

### Data sources

The ultimate data tables accessible via the `duckdb` database are:

- NC 2022 midterm early vote data from [NCSBE](https://www.ncsbe.gov/results-data)
- NC voter registration file from [NCSBE](https://www.ncsbe.gov/results-data)
- NC 10-year voter history file from [NCSBE](https://www.ncsbe.gov/results-data)
- Current Population Survey 2022 November voting supplement from [US Census Bureau](https://www.census.gov/data/datasets/time-series/demo/cps/cps-supp_cps-repwgt/cps-voting.html)
- County-level past election results from [MIT Election Lab via Harvard Dataverse](https://dataverse.harvard.edu/file.xhtml?fileId=6104822&version=10.0)

### Database tables

The data sources mentioned above are queryable via `nc.duckdb` in the respective tables:

- `early_vote` (~6K records as-of 2022-09-24 and growing fast!)
- `register` (~8.6M records)
- `hist_gen` (~22M records) + `hist_oth` (~11M records) (these tables have the same schema but were split for analytical reasons since elections that are not statewide general elections are not equally applicable to all voters)
- `cps_suppl`
- `county_results`

### Key scripts

Core scripts are in the `etl` subdirectory:

- `extract-*.py` scripts: Download (and unzip) different data sources from NCSBE and US Census Bureau and write results to `data/raw/`
- `transform-*.py` scripts: Wrangle downloaded data sources from `csv` to `parquet` with light transformation with `pyarrow`. Results are written to `data/`
- `load-db.py`: Loads nothings! It creates a `nc.duckdb` file which references the external parquet files in views

## Running on Codespaces

1. Launch on Codespaces

2. Set-up environment:

```
python3 -m venv venv
source venv/bin/activate
python3 -m pip install -r requirements.txt
```

3. Pull all raw data:

```
chmod +x etl/extract-all.sh
etl/extract-all.sh
```

4. Transform all raw data:

```
chmod +x etl/transform-all.sh
etl/transform-all.sh
```

5. Create duckdb database:

```
python etl/load-db.py
```

6. (Optional) Install duckdb CLI

```
chmod +x get-duckdb-cli.sh
./get-duckdb-cli.sh
```

7. Run sample queries

7a. Run sample queries in CLI

Launch the CLI:

```
./duckdb nc.duckdb
.timer on
```

(Note: You can exit the DuckDB CLI with `Ctrl + D`)

Try out some sample queries. For example, we might wonder how many past general elections that early voters have voted in before:

```
with voter_general as (
select early_vote.ncid, count(1) as n
from
early_vote
left join
hist_gen
on early_vote.ncid = hist_gen.ncid
group by 1)
select n, count(1) as freq
from voter_general
group by 1
order by 1
;
```

And, this question is more interesting if we join on registration data to learn how many prior general elections each voter was eligible to vote in:

```
with voter_general as (
select
early_vote.ncid,
extract('year' from register.registr_dt) as register_year,
count(1) as n
from
early_vote
left join
hist_gen
on early_vote.ncid = hist_gen.ncid
left join
register
on early_vote.ncid = register.ncid
group by 1,2)
select
n,
case
when register_year < 2012 then 'Pre-2012'
else register_year
end as register_year,
count(1) as freq
from voter_general
group by 1,2
order by 1,2
;
```

(Yes, of course *date* matters more than year here, etc. etc. This is purely to demonstrate `duckdb` not rigorous analysis!)

7b. Run sample queries in python

In python: See sample queries in `test-query.py` file

8. Run `free` in the terminal to marvel at what 8GB of RAM can do!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/emilyriederer/nc-votes-duckdb

Awesome Lists containing this project

README