{"id":23647649,"url":"https://github.com/hassonor/nrt-e2e-data-warehouses","last_synced_at":"2026-05-18T03:06:37.973Z","repository":{"id":267795679,"uuid":"902366494","full_name":"hassonor/nrt-e2e-data-warehouses","owner":"hassonor","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-15T15:35:50.000Z","size":1369,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-28T14:39:45.823Z","etag":null,"topics":["apache-airflow","apache-kafka","apache-pinot","apache-superset","celery","financial-institutions","postgresql","redpanda"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hassonor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-12T12:31:41.000Z","updated_at":"2024-12-15T15:33:25.000Z","dependencies_parsed_at":"2024-12-15T15:22:14.246Z","dependency_job_id":null,"html_url":"https://github.com/hassonor/nrt-e2e-data-warehouses","commit_stats":null,"previous_names":["hassonor/rt-e2e-data-warehouses","hassonor/nrt-e2e-data-warehouses"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fnrt-e2e-data-warehouses","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fnrt-e2e-data-warehouses/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fnrt-e2e-data-warehouses/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hassonor%2Fnrt-e2e-data-warehouses/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hassonor","download_url":"https://codeload.github.com/hassonor/nrt-e2e-data-warehouses/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239599040,"owners_count":19665911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-kafka","apache-pinot","apache-superset","celery","financial-institutions","postgresql","redpanda"],"created_at":"2024-12-28T14:38:40.197Z","updated_at":"2025-10-12T10:11:09.513Z","avatar_url":"https://github.com/hassonor.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Near Real-Time End-to-End Data Warehouses (`nrt-e2e-data-warehouses`)\n\nThis project demonstrates a near real-time data pipeline using Apache Pinot, Apache Superset, Apache Airflow, and\nRedpanda\nKafka for ingesting and querying dimension and fact data.\n\n---\n\n## **Getting Started**\n\n### **Prerequisites**\n\nEnsure you have the following installed:\n\n- Docker and Docker Compose\n- A web browser for accessing the services\n\n---\n\n## **Setup Instructions**\n\n### **Step 1: Start the Environment**\n\nRun the following command to build and start all services:\n\n```bash\ndocker compose up --build -d\n```\n\n---\n\n## **Access Services**\n\n- **Airflow**: [http://localhost:8180](http://localhost:8180)\n    - Login with: `airflow` / `airflow`\n- **Redpanda Console**: [http://localhost:8080](http://localhost:8080)\n- **Apache Pinot**: [http://localhost:9091](http://localhost:9091)\n- **Apache Superset**: [http://localhost:8088](http://localhost:8088)\n    - Login with: `admin` / `admin`\n\n---\n\n## **Steps to Run the Pipeline**\n\n### Step 1: Generate Dimension Data\n\n1. In **Airflow**, run the following DAGs **once** to generate dimension data:\n    - `account_dim_generator`\n    - `branch_dim_generator`\n    - `customer_dim_generator`\n    - `date_dim_generator`\n\n---\n\n### Step 2: Generate Transaction Facts\n\n1. Run the `transaction_facts_generator` DAG **once**.\n2. Navigate to **Redpanda Console** and confirm the `transaction_facts` topic has been created.\n\n---\n\n### Step 3: Configure Pinot\n\n1. Run the following Airflow DAGs **in sequence**:\n    - `schema_dag`\n    - `table_dag`\n    - `load_dag`\n2. Navigate to **Apache Pinot**, click on `Tables` and `Query Console`, and verify that data has been consumed from\n   Kafka.\n\n---\n\n### Step 4: Configure Superset\n\n1. Navigate to **Apache Superset** at [http://localhost:8088](http://localhost:8088).\n    - Login with: `admin` / `admin`\n2. Connect a database:\n    - Go to `+ -\u003e Data -\u003e Connect a database`.\n    - Select `Apache Pinot` as the database.\n    - Fill in the **SQLAlchemy URI**:\n      ```\n      pinot://pinot-broker:8099/query?server=http://pinot-controller:9000\n      ```\n    - Click `TEST CONNECTION` and ensure \"Looks good\".\n    - Press `Connect`.\n\n---\n\n### Step 5: Query the Data\n\n1. Go to **SQL Lab** \u003e `SQL LAB`.\n    - Choose a table and view its schema with `SEE TABLE SCHEMA`.\n2. Run the following query in SQL Lab:\n   ```sql\n   SELECT\n     tf.*,\n     CONCAT(cd.first_name, ' ', cd.last_name) AS full_name,\n     email,\n     phone_number,\n     registration_date,\n     branch_name,\n     branch_address,\n     city,\n     state,\n     zipcode,\n     account_type,\n     status,\n     balance\n   FROM\n     transaction_facts tf\n     LEFT JOIN account_dim ad ON tf.account_id = ad.account_id\n     LEFT JOIN customer_dim cd ON tf.customer_id = cd.customer_id\n     LEFT JOIN branch_dim bd ON tf.branch_id = bd.branch_id;\n   ```\n\n3. Save the query results as a dataset:\n    - Click `Save Dropdown` \u003e `Save as New`.\n    - Name the dataset: **`transaction_fact_combined`**.\n    - Click `SAVE \u0026 EXPLORE`.\n\n---\n\n### Step 6: Create a Superset Dashboard\n\n#### **Charts to Add:**\n\n1. **Bar Chart**:\n    - **X-Axis**: `branch_name`\n    - **Metrics**: `transaction_amount (SUM)`\n    - **Row Limit**: 10\n    - Name the chart: **`Top 10 Profitable Branches`**.\n    - Save it to the dashboard.\n\n2. **Big Number Chart**:\n    - Metric: **`Count`**\n    - Name: **`Total Records`**.\n    - Save it to the dashboard.\n\n3. **Pie Chart (Currency Distribution)**:\n    - **Dimensions**: `currency`\n    - **Metrics**: `transaction_amount`\n    - Name: **`Currency Distribution`**.\n    - Save it to the dashboard.\n\n4. **Pie Chart (Account Type Distribution)**:\n    - **Dimensions**: `account_type`\n    - **Metrics**: `transaction_amount`\n    - Name: **`Account Type Distribution`**.\n    - Under `CUSTOMIZE`, enable:\n        - `SHOW TOTAL`\n        - Currency Format: `NIS`\n    - Save it to the dashboard.\n\n#### **Dashboard Settings**:\n\n1. Go to the dashboard, click `...` \u003e `Edit`.\n2. Set the dashboard refresh rate to **10 seconds**.\n\n---\n\n### Step 7: Watch Real-Time Updates\n\n1. In Airflow, run the `transaction_facts_generator` DAG to simulate new transactions.\n2. Navigate to **Superset** \u003e Dashboard.\n    - Watch as the values update dynamically with the new data from Kafka!\n\n---\n\n### **Production Notes**\n\n- For production, scale Kafka with more brokers to ensure reliability and high throughput.\n- And Better way more the csvs from airflow to pinot.\n\n---\n\n### **Services Overview**\n\n- **Airflow**: Orchestrates data pipelines ([http://localhost:8180](http://localhost:8180)).\n- **Redpanda (Kafka)**: Real-time message broker ([http://localhost:8080](http://localhost:8080)).\n- **Pinot**: OLAP datastore for real-time analytics ([http://localhost:9091](http://localhost:9091)).\n- **Superset**: Data visualization and dashboarding ([http://localhost:8088](http://localhost:8088)).\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhassonor%2Fnrt-e2e-data-warehouses","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhassonor%2Fnrt-e2e-data-warehouses","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhassonor%2Fnrt-e2e-data-warehouses/lists"}