https://github.com/databricks/simple-pipeline

Example pipeline for bit.io
https://github.com/databricks/simple-pipeline

Last synced: 3 months ago
JSON representation

Example pipeline for bit.io

Host: GitHub
URL: https://github.com/databricks/simple-pipeline
Owner: databricks
License: mit
Archived: true
Created: 2021-08-09T20:37:07.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2023-05-23T02:10:32.000Z (about 2 years ago)
Last Synced: 2025-03-25T07:13:01.852Z (3 months ago)
Language: Python
Size: 2.19 MB
Stars: 10
Watchers: 2
Forks: 10
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

jimsghstars - databricks/simple-pipeline - Example pipeline for bit.io (Python)

README

# bit.io simple pipeline

A simple bit.io pipeline example using scripts and the UNIX cron scheduler.

## Scope

This repo is intended to provide a simple pipeline example for getting started with programmtic data ingestion and updates in bit.io. To keep the repo simple, many best practices such as logging, configuration files, and a more robust orchestration/scheduling framework are omitted.

## Setup

- Add a .env file at the root with your own bit.io Postgres connection string as `PG_CONN_STRING`
- Create environment
- `python3 -m venv venv`

- `source venv/bin/activate`

- `python3 -m pip install --upgrade pip -r requirements.txt`

- Create a repo on bit.io, we named ours `simple_pipeline` for this demo

## Contents

- simple_pipeline
- main.py # command line script for ETL jobs
- extract.py # Handles extraction of data into a pandas DataFrame
- transform.py # Transforms data using pandas
- load.py # Loads data from pandas to bit.io
- sql_executor.py # Runs arbitrary SQL scripts on bit.io
- ca_covid_data.sql # Example SQL script for bit.io
- acs_5yr_population_data.csv # Population data, this changes annually
- README.md
- requirements.txt
- scheduled_run.sh # This shows how to batch calls to the python scripts together for a simple pipeline
- LICENSE

## Usage

As a demo piece, this simple pipeline contains two main data processing scripts:
1. `simple_pipeline/main.py` extracts, transforms (optional), and loads a csv from a URL or local file into bit.io
2. `simple_pipeline/sql_executor.py` executes SQL scripts on bit.io, such as for creating joined, de-normalized tables

In addition, a shell script `scheduled_run.sh` is included to show how the two scripts can be composed to form a simple pipeline. Utility programs like `cron` can then be used to run the shell script on a schedule for automated updates in bit.io. Here is an example `crontab` job that I created on my local system for this pipeline:

`45 09 * * * cd ~/Documents/simple_pipeline && ./scheduled_run.sh`

The `45 09 * * *` defines a schedule of once daily, at 9:45. You can learn more about cron syntax at [crontab.guru](https://crontab.guru/).

## Using simple_pipeline/main.py

This is a simple extract, transform, load script. The main script `main.py` can be run from the command line as follows:

`python simple_pipeline/main.py `

The script also takes a `-local_source` option that indicates the source is a local file path (default is a URL) and a `-name` option with an argument for a transformation function to run. Here is an example command for a URL source with a transformation function called "nyt_cases_counties":

`python main.py -name nyt_cases_counties https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv bitdotio/simple_pipeline.cases_counties`

Here is an example command that uses a local file and skips the transformation step (note that no `-name` specified):
`python main.py -local_source -name acs_population_counties acs_5yr_population_data.csv bitdotio/simple_pipeline.population_counties`

The transformation functions are defined in `transform.py`. If you want to run these examples, make sure to update the destination with your own username in place of `bitdotio` and your own repo name if it is different from `simple_pipeline`.

## Using simple_pipeline/sql_executor.py

Once data has been extracted, transformed, and loaded, we sometimes want to create derived tables within the database. This script takes one argument, a path to a SQL script to run on bit.io. For example, to create the derived California COVID data table, the script is called as follows:

`python sql_executor.py ca_covid_data.sql bitdotio simple_pipeline`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/databricks/simple-pipeline

Awesome Lists containing this project

README