https://github.com/zsvoboda/ngods-stocks

New Generation Opensource Data Stack Demo
https://github.com/zsvoboda/ngods-stocks
cube dagster datahub dbt iceberg metabase python spark spark-sql trino trinodb
Last synced: 3 months ago
JSON representation
New Generation Opensource Data Stack Demo
Host: GitHub
URL: https://github.com/zsvoboda/ngods-stocks
Owner: zsvoboda
License: bsd-3-clause
Created: 2022-07-03T12:34:46.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-02-06T08:44:57.000Z (over 2 years ago)
Last Synced: 2025-03-29T12:07:39.883Z (3 months ago)
Topics: cube, dagster, datahub, dbt, iceberg, metabase, python, spark, spark-sql, trino, trinodb
Language: Jupyter Notebook
Homepage:
Size: 22.1 MB
Stars: 427
Watchers: 16
Forks: 100
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # ngods stock market demo 

This repository contains a stock market analysis demo of the ngods data stack. The demo performs the following steps:

1. Download selected stock symbols data from [Yahoo Finance API](https://finance.yahoo.com/).

2. Store the stock data in ngods data warehouse (using [Iceberg](https://iceberg.apache.org/) format).

3. Transform the data (e.g. normalize stock prices) using [dbt](https://www.getdbt.com/).

4. Expose analytics data model using [cube.dev](https://cube.dev/).

5. Visualize data as reports and dashboards using [Metabase](https://www.metabase.com/).

6. Predicts stock prices using ARIMA in Apache Spark.

The demo is packaged as [docker-compose](https://github.com/docker/compose) script that downloads, installs, and runs all components of the data stack.

## UPDATES

- 2023-02-03: 

    - Upgrade to Apache Iceberg 1.1.0

    - Upgrade to Trino 406

    - Migrated to the new JDBC catalog (removed the heavyweigt Hive Metastore)

# ngods

ngods stands for New Generation Opensource Data Stack. It includes the following components: 

- [Apache Spark](https://spark.apache.org) for data transformation 

- [Apache Iceberg](https://iceberg.apache.org) as a data storage format 

- [Trino](https://trino.io/) for federated data query 

- [dbt](https://www.getdbt.com/) for ELT 

- [Dagster](https://dagster.io/) for data orchetsration 

- [cube.dev](https://cube.dev/) for data analysis and semantic data model 

- [Metabase](https://www.metabase.com/) for self-service data visualization (dashboards) 

- [Minio](https://min.io) for local S3 storage 

![ngods components](./img/ngods.architecture.png)

ngods is open-sourced under a [BSD license](https://github.com/zsvoboda/ngods-stocks/blob/main/LICENSE) and it is distributed as a docker-compose script that supports Intel and ARM architectures.

# Running the demo

ngods requires a machine with at least 16GB RAM and Intel or Arm 64 CPU running [Docker](https://www.docker.com/). It requires [docker-compose](https://github.com/docker/compose).

1. Clone the [ngods repo](https://github.com/zsvoboda/ngods-stocks)

```bash

git clone https://github.com/zsvoboda/ngods-stocks.git

```

2. Start the data stack with the `docker-compose up` command

```bash

cd ngods-stocks

docker-compose up -d

```

**NOTE:** This can take quite long depending on your network speed.

3. Stop the data stack via the `docker-compose down` command

```bash

docker-compose down

```

4. Execute the data pipeline from the Dagster console at http://localhost:3070/ with [this yaml config file](./projects/dagster/e2e.yaml).

![Dagster e2e](./img/demo/dagster.e2e.png)

Cut and paste the content of the [e2e.yaml file](./projects/dagster/e2e.yaml) to this [Dagster UI console page](http://localhost:3070/workspace/[email protected]/jobs/e2e/playground) and start the data pipeline by clicking the `Launch Run` button. 

**NOTE:** You can customize the list of stock symbols that will be downloaded. 

5. Review and customize the [cube.dev metrics, and dimensions](./conf/cube/schema/). Test these metrics in the [cube.dev playground](http://localhost:4000/#/build?query={%22measures%22:[%22StockMarketsMonthly.price_close_relative_avg%22],%22timeDimensions%22:[{%22dimension%22:%22StockMarketsMonthly.dt%22,%22granularity%22:%22month%22,%22dateRange%22:[%222014-09-01%22,%222022-07-03%22]}],%22dimensions%22:[%22StockMarketsMonthly.symbol%22],%22filters%22:[{%22member%22:%22StockMarketsMonthly.symbol%22,%22operator%22:%22equals%22,%22values%22:[%22AAPL%22,%22GC=F%22,%22BTC-USD%22]}],%22order%22:[[%22StockMarketsMonthly.symbol%22,%22asc%22],[%22StockMarketsMonthly.dt%22,%22desc%22]]}).

![cube.dev playground](./img/demo/cube.playground.png)

See the [cube.dev documentation](https://cube.dev/docs/) for more information.

6. Check out the Metabase [data visualizations](http://localhost:3030/question#eyJkYXRhc2V0X3F1ZXJ5Ijp7InR5cGUiOiJuYXRpdmUiLCJuYXRpdmUiOnsicXVlcnkiOiJzZWxlY3QgXG4gICAgICAgIGR0LCBcbiAgICAgICAgc3ltYm9sLCBcbiAgICAgICAgcHJpY2VfY2xvc2VfcmVsYXRpdmVfYXZnIFxuICAgIGZyb20gU3RvY2tNYXJrZXRzTW9udGhseVxuICAgIHdoZXJlIFxuICAgICAgICBzeW1ib2wgaW4gKCdBQVBMJywgJ0JUQy1VU0QnLCAnR0M9RicpIGFuZCBcbiAgICAgICAgZHQgPj0gJzIwMTQtMDktMDEnXG4gICAgb3JkZXIgYnkgZHQsIHN5bWJvbFxuICAgICIsInRlbXBsYXRlLXRhZ3MiOnt9fSwiZGF0YWJhc2UiOjN9LCJkaXNwbGF5IjoibGluZSIsImRpc3BsYXlJc0xvY2tlZCI6dHJ1ZSwidmlzdWFsaXphdGlvbl9zZXR0aW5ncyI6eyJncmFwaC5kaW1lbnNpb25zIjpbImR0Iiwic3ltYm9sIl0sImdyYXBoLm1ldHJpY3MiOlsicHJpY2VfY2xvc2VfcmVsYXRpdmVfYXZnIl0sImdyYXBoLnhfYXhpcy50aXRsZV90ZXh0IjoiRGF0ZSAobW9udGhzKSIsImdyYXBoLnlfYXhpcy50aXRsZV90ZXh0IjoiQ2xvc2UgcHJpY2UgKHJlbGF0aXZlIHRvIEphbiAxc3QgMjAwMCkifSwib3JpZ2luYWxfY2FyZF9pZCI6MzN9) that is connected to the cube.dev analytical model. You can run [SQL queries](https://cube.dev/docs/backend/sql) on top of the cube.dev schema.  

 

 Use username `[email protected]` and password `metabase1`.

![Metabase](./img/metabase.png)

You can create your own data visualizations and dashboards. See the [Metabase documentation](https://metabase.com/docs/latest) for more information.

7. Predict stock close price. Run the [ARIMA time-series prediction model](http://localhost:8888/notebooks/arima.ipynb) notebook that is trained on 29 months of the `Apple:AAPL` stock data and predicts the next month.

![Jupyter ARIMA](./img/jupyter.arima.png)

8. Download [DBeaver](https://dbeaver.io/download/) SQL tool.

9. Connect to the Postgres database that contains the `gold` stage data. Use `jdbc:postgresql://localhost:5432/ngods` JDBC URL with username `ngods` and password `ngods`.

![Postgres JDBC connection](./img/demo/postgres.jdbc.connection.png)

10. Connect to the Trino database that has access to all data stages (`bronze`, `silver`, and `gold` schemas of the `warehouse` database). Use `jdbc:trino://localhost:8060` JDBC URL with username `trino` and password `trino`. 

![Trino JDBC connection](./img/demo/trino.jdbc.connection.png)

![Trino schemas](./img/demo/trino.schemas.png)

11. Connect to the Spark database that is used for data transformations. Use `jdbc:hive2://localhost:10009` JDBC URL with no username and password.

![Spark JDBC connection](./img/demo/spark.jdbc.connection.png)

# Customizing the demo

This chapter contains useful information for customizing the demo.

## ngods directories

Here are few distribution's directories that you may need to customize:

- `conf` configuration of all data stack components

    - `cube` cube.dev schema (semantic model definition)

- `data` main data directory 

    - `minio` root data directory (contains buckets and file data)

    - `spark` Jupyter notebooks

    - `stage` file stage data. Spark can access this directory via `/var/lib/ngods/stage` path. 

- `projects` dbt, Dagster, and DataHub projects

    - `dagster` Dagster orchestration project

    - `dbt` dbt transformations (one project per each medallion stage: `bronze`, `silver`, and `gold`) 

## ngods endpoints

The data stack has the following endpoints

- Spark

    - http://localhost:8888 - Jupyter notebooks 

    - `jdbc:hive2://localhost:10009` JDBC URL (no username / password)

    - localhost:7077 - Spark API endpoint

    - http://localhost:8061 - Spark master node monitoring page 

    - http://localhost:8062 - Spark slave node monitoring page 

    - http://localhost:18080 - Spark history server page 

- Trino

    - `jdbc:trino://localhost:8060` JDBC URL (username `trino` / no password)

- Postgres

    - `jdbc:postgresql://localhost:5432/ngods` JDBC URL (username `ngods` / password `ngods`)

- Cube.dev

    - http://localhost:4000 - cube.dev development UI 

    - `jdbc:postgresql://localhost:3245/cube` JDBC URL (username `cube` / password `cube`)

- Metabase

    - http://localhost:3030 Metabase UI (username `[email protected]` / password `metabase1`)

- Dagster

    - http://localhost:3070 - Dagster orchestration UI

- Minio

    - http://localhost:9001 - Minio UI (username `minio` / password `minio123`)

## ngods databases: Spark, Trino, and Postgres

ngods stack includes three database engines: Spark, Trino, and Postgres. Both Spark and Trino have access to Iceberg tables in `warehouse.bronze` and `warehouse.silver` schemas. Trino engine can also access the `analytics.gold` schema in Postgres. Trino can federate queries between the Postgres and Iceberg tables. 

The Spark engine is configured for ELT and pyspark data transformations. 

![Spark](./img/spark.schemas.png)

The Trino engine is configured for data federation between the Iceberg and Postgres tables. Additional catalogs can be [configured](./conf/trino/catalog) as needed. 

![Trino](./img/trino.schemas.png)

The Postgres database has accesses only to the `analytics.gold` schema and it is used for executing analytical queries over the gold data.

## Demo data pipeline

The demo data pipeline is utilizes the [medallion architecture](https://databricks.com/fr/glossary/medallion-architecture) with `bronze`, `silver`, and `gold` data stages. 

![data pipeline](./img/data.pipeline.png)

and consists of the following phases:

1. Data are downloaded from Yahoo Finance REST API to the local Minio bucket ([./data/stage](./data/stage)) using this [Dagster operation](./projects/dagster/download.py).

2. The downloaded CSV file is loaded to the bronze stage Iceberg tables (warehouse.bronze Spark schema) using dbt models that are executed in Spark ([./projects/dbt/bronze](./projects/dbt/bronze/models/in_yahoo_finance.sql)).

3. Silver stage Iceberg tables (warehouse.silver Spark schema) are created using dbt models that are executed in Spark ([./projects/dbt/silver](./projects/dbt/silver/models/stock_markets_with_relative_prices.sql)). 

5. Gold stage Postgres tables (analytics.gold Trino schema) are created using dbt models that are executed in Trino ([./projects/dbt/gold](./projects/dbt/gold/models/stock_markets.sql)).

![DBT models](./img/dbt.models.png)

All data pipeline phases are orchestrated by [Dagster](https://www.dagster.io/) framework. Dagster operations, resources and jobs are defined in the [Dagster project](./projects/dagster/). 

![Dagster console](./img/dagster.console.png)

The pipeline is executed by running the e2e job from the Dagster console at http://localhost:3070/ using [this yaml config file](./projects/dagster/e2e.yaml)

## ngods analytics layer

ngods includes [cube.dev](https://cube.dev/) for [semantic data model](./conf/cube/schema) and [Metabase](https://www.metabase.com/) for self-service analytics (dashboards, reports, and visualizations).

![Analytics](./img/analytics.png)

Analytical (semantic) model is defined in [cube.dev](https://cube.dev/) and is used for executing analytical queries over the gold data.

![cube.dev](./img/cube.png)

[Metabase](https://www.metabase.com/) is connected to the [cube.dev](https://cube.dev/) via [SQL API](https://cube.dev/docs/backend/sql). End users can use it for self-service creation of dashboards, reports, and data visualizations. [Metabase](https://www.metabase.com/) is also directly connected to the gold schema in the Postgres database.

![Metabase](./img/demo/metabase.cube.connection.png)

## ngods machine learning

[Jupyter Notebooks](https://jupyter.org/) with Scala, Java and Python backends can be used for machine learning.

![Jupyter](./img/jupyter.arima.png)

# Support

Create a [github issue](https://github.com/zsvoboda/ngods-stocks/issues) if you have any questions.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zsvoboda/ngods-stocks

Awesome Lists containing this project

README