https://github.com/zsvoboda/ngods

New generation opensource data stack
https://github.com/zsvoboda/ngods

analytics data data-pipeline iceberg jdbc presto prestodb prestosql python scala spark spark-sql sparksql sql trino trinodb

Last synced: about 1 month ago
JSON representation

New generation opensource data stack

Host: GitHub
URL: https://github.com/zsvoboda/ngods
Owner: zsvoboda
License: bsd-3-clause
Created: 2022-05-20T10:30:48.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-05-20T15:50:41.000Z (about 3 years ago)
Last Synced: 2025-03-29T06:51:34.376Z (about 2 months ago)
Topics: analytics, data, data-pipeline, iceberg, jdbc, presto, prestodb, prestosql, python, scala, spark, spark-sql, sparksql, sql, trino, trinodb
Language: Dockerfile
Homepage:
Size: 1.62 MB
Stars: 65
Watchers: 4
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        
# ngods: opensource data stack

This repository contains Docker compose script that creates opensource data analytics stack on your local machine.  

![ngods architecture](https://raw.githubusercontent.com/zsvoboda/ngods/main/img/ngods.png)

Currently, the stack consists of multiple components: 

- [**Trino**](https://trino.io/) for low-latency distributed SQL queries 

- [**Apache Spark**](https://spark.apache.org/) for data pipeline

- [**Apache Iceberg**](https://iceberg.apache.org/) for atomic data pipeline with schema evolution, and partitioning for performance 

- [**Hive Metastore**](https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) for metadata management (metadata are stored in MariaDB)

- [**Minio**](https://min.io/) for mimicking S3 storage on a local computer

I plan to add more components soon: 

- [**Postgres**](https://www.postgresql.org/) for low-latency queries (if Trino on Iceberg doesn't provide satisfactory low-latency queries). I'll also add Postgre Foreign Data Wrapper technology for more convenient ELT between Trino and Postgres

- [**DBT**](https://www.getdbt.com/) for ELT on top of Spark SQL or Trino

- [**GoodData.CN**](https://www.gooddata.com/developers/cloud-native/) or [**Cube.dev**](https://cube.dev/) for analytics model and metrics 

- [**Metabase**](https://www.metabase.com/) or [**Apache Superset**](https://superset.apache.org/) for dashboards and data visualization

## How ngods works: Simple example

You can simply start the ngds by executing 

```bash

docker-compose up

```

from the top-level directory where you've cloned this repo. 

Once all images are pulled and all containers start, open Minio in your browser [http://localhost:9000](http://localhost:9000) log in with ```minio``` username and  ```minio123``` password and create a top level bucket called ```bronze``` .

Then use your favorite SQL console tool (I use [DBeaver](https://dbeaver.io/)) and connect to the Trino instance running on your local machine (jdbc url: ```jdbc:trino://localhost:8080```, username: ```trino```, empty database) and execute the following script:

```sql

create schema if not exists warehouse.bronze with (location = 's3a://bronze/');

drop table if exists warehouse.bronze.employee;

create table warehouse.bronze.employee (

   employee_id integer not null,

   employee_name varchar not null

)

with (

   format = 'parquet'

);

insert into warehouse.bronze.employee (employee_id, employee_name) values (1, 'john doe');

insert into warehouse.bronze.employee (employee_id, employee_name) values (2, 'jane doe');

insert into warehouse.bronze.employee (employee_id, employee_name) values (3, 'joe doe');

insert into warehouse.bronze.employee (employee_id, employee_name) values (4, 'james doe');

select * from warehouse.bronze.employee;

```

## How ngods works: Loading parquet file as table

Now we'll load the February 2022 NYC taxi trip data parquet file as a new table to ngods.

First, create a new ```nyc``` directory in the ```./data/stage``` directory and download the [February 2022 NYC taxi trips parquet file](https://s3.amazonaws.com/nyc-tlc/trip%20data/fhvhv_tripdata_2022-02.parquet) to it.

Then open and execute the ngods [Spark notebook script](http://localhost:8888/notebooks/notebooks/spark.nyc.taxti.example.ipynb) that loads the data as a new table to the ```bronze``` schema of a ```warehouse``` database.

```python

df=spark.read.parquet("/home/data/stage/nyc/fhvhv_tripdata_2022-02.parquet")

df.writeTo("warehouse.bronze.ny_taxis_feb").create()

```

Now open your SQL console again, connect to the the Trino instance running on your local machine (jdbc url: ```jdbc:trino://localhost:8080```, username: ```trino```, empty database) and execute this SQL queries:

```sql

select count(*) from warehouse.bronze.ny_taxis_feb;

select 

	hour(pickup_datetime),

	sum(trip_miles),

	sum(trip_time),

	sum(base_passenger_fare)

from warehouse.bronze.ny_taxis_feb group by 1 order by 1;

```

You should see your query results in no time!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zsvoboda/ngods

Awesome Lists containing this project

README