{"id":18174350,"url":"https://github.com/databricks/simple-pipeline","last_synced_at":"2025-04-01T15:31:17.175Z","repository":{"id":42417390,"uuid":"394431685","full_name":"databricks/simple-pipeline","owner":"databricks","description":"Example pipeline for bit.io","archived":true,"fork":false,"pushed_at":"2023-05-23T02:10:32.000Z","size":2300,"stargazers_count":10,"open_issues_count":1,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-25T07:13:01.852Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-09T20:37:07.000Z","updated_at":"2024-10-17T18:43:49.000Z","dependencies_parsed_at":"2024-05-28T01:36:58.124Z","dependency_job_id":"b1815739-d46d-4424-b998-132006fb2232","html_url":"https://github.com/databricks/simple-pipeline","commit_stats":null,"previous_names":["databricks/simple-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimple-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimple-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimple-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fsimple-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/simple-pipeline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246662325,"owners_count":20813728,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-02T16:03:00.395Z","updated_at":"2025-04-01T15:31:17.169Z","avatar_url":"https://github.com/databricks.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# bit.io simple pipeline\n\nA simple bit.io pipeline example using scripts and the UNIX cron scheduler.\n\n## Scope\n\nThis repo is intended to provide a simple pipeline example for getting started with programmtic data ingestion and updates in bit.io. To keep the repo simple, many best practices such as logging, configuration files, and a more robust orchestration/scheduling framework are omitted.\n\n## Setup\n\n- Add a .env file at the root with your own bit.io Postgres connection string as `PG_CONN_STRING`\n- Create environment\n    - `python3 -m venv venv`\u003cbr\u003e\n    - `source venv/bin/activate`\u003cbr\u003e\n    - `python3 -m pip install --upgrade pip -r requirements.txt`\u003cbr\u003e\n- Create a repo on bit.io, we named ours `simple_pipeline` for this demo\n\n## Contents\n\n- simple_pipeline\n    - main.py # command line script for ETL jobs\n    - extract.py # Handles extraction of data into a pandas DataFrame\n    - transform.py # Transforms data using pandas\n    - load.py # Loads data from pandas to bit.io\n    - sql_executor.py # Runs arbitrary SQL scripts on bit.io\n    - ca_covid_data.sql # Example SQL script for bit.io\n    - acs_5yr_population_data.csv # Population data, this changes annually\n- README.md\n- requirements.txt\n- scheduled_run.sh # This shows how to batch calls to the python scripts together for a simple pipeline\n- LICENSE\n\n## Usage\n\nAs a demo piece, this simple pipeline contains two main data processing scripts:\n1. `simple_pipeline/main.py` extracts, transforms (optional), and loads a csv from a URL or local file into bit.io\n2. `simple_pipeline/sql_executor.py` executes SQL scripts on bit.io, such as for creating joined, de-normalized tables\n\nIn addition, a shell script `scheduled_run.sh` is included to show how the two scripts can be composed to form a simple pipeline. Utility programs like `cron` can then be used to run the shell script on a schedule for automated updates in bit.io. Here is an example `crontab` job that I created on my local system for this pipeline:\n\n`45 09 * * * cd ~/Documents/simple_pipeline \u0026\u0026 ./scheduled_run.sh`\n\nThe `45 09 * * *` defines a schedule of once daily, at 9:45. You can learn more about cron syntax at [crontab.guru](https://crontab.guru/).\n\n## Using simple_pipeline/main.py\n\nThis is a simple extract, transform, load script. The main script `main.py` can be run from the command line as follows:\n\n`python simple_pipeline/main.py \u003cSOURCE_URL_OR_FILE_PATH\u003e \u003cDESTINATION_FULLY_QUALIFIED_TABLE\u003e`\n\nThe script also takes a `-local_source` option that indicates the source is a local file path (default is a URL) and a `-name` option with an argument for a transformation function to run. Here is an example command for a URL source with a transformation function called \"nyt_cases_counties\":\n\n`python main.py -name nyt_cases_counties https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv bitdotio/simple_pipeline.cases_counties`\n\nHere is an example command that uses a local file and skips the transformation step (note that no `-name` specified):\n`python main.py -local_source -name acs_population_counties acs_5yr_population_data.csv bitdotio/simple_pipeline.population_counties`\n\nThe transformation functions are defined in `transform.py`. If you want to run these examples, make sure to update the destination with your own username in place of `bitdotio` and your own repo name if it is different from `simple_pipeline`.\n\n## Using simple_pipeline/sql_executor.py\n\nOnce data has been extracted, transformed, and loaded, we sometimes want to create derived tables within the database. This script takes one argument, a path to a SQL script to run on bit.io. For example, to create the derived California COVID data table, the script is called as follows:\n\n`python sql_executor.py ca_covid_data.sql bitdotio simple_pipeline`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fsimple-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Fsimple-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fsimple-pipeline/lists"}