{"id":48035864,"url":"https://github.com/shsiddhant/cricket-warehouse","last_synced_at":"2026-04-04T13:59:08.597Z","repository":{"id":343862065,"uuid":"1178767172","full_name":"shsiddhant/cricket-warehouse","owner":"shsiddhant","description":"A data warehouse for ball-by-ball cricket match data, designed for analytics and modeling.","archived":false,"fork":false,"pushed_at":"2026-03-19T17:14:26.000Z","size":527,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-19T21:29:56.595Z","etag":null,"topics":["cricket-data","dbt","elt","postgresql","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shsiddhant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-11T10:50:36.000Z","updated_at":"2026-03-19T17:14:34.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/shsiddhant/cricket-warehouse","commit_stats":null,"previous_names":["shsiddhant/cricket-warehouse"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shsiddhant/cricket-warehouse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shsiddhant%2Fcricket-warehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shsiddhant%2Fcricket-warehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shsiddhant%2Fcricket-warehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shsiddhant%2Fcricket-warehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shsiddhant","download_url":"https://codeload.github.com/shsiddhant/cricket-warehouse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shsiddhant%2Fcricket-warehouse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31402276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cricket-data","dbt","elt","postgresql","python"],"created_at":"2026-04-04T13:59:07.922Z","updated_at":"2026-04-04T13:59:08.590Z","avatar_url":"https://github.com/shsiddhant.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cricket Warehouse\n\n![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fraw.githubusercontent.com%2Fshsiddhant%2Fmemory.fm%2Frefs%2Fheads%2Fmain%2Fpyproject.toml\u0026style=for-the-badge\u0026logo=python\u0026logoColor=FFE873\u0026color=4B8BBE)\n![dbt](https://img.shields.io/badge/dbt-data_pipeline-orange?style=for-the-badge)\n![Apache Airflow](https://img.shields.io/badge/Apache%20Airflow-017CEE?style=for-the-badge\u0026logo=Apache%20Airflow\u0026logoColor=white)\n![Postgres](https://img.shields.io/badge/postgres-%23316192.svg?style=for-the-badge\u0026logo=postgresql\u0026logoColor=white)\n![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)\n\nAn ELT pipeline to build data warehouse for ball-by-ball cricket match data, designed for analytics and modeling.\n\nThe project ingests raw match JSON from [Cricsheet](https://cricsheet.org/), normalizes the data into relational tables in PostgreSQL, and builds analytical models using dbt.\n\nThe pipeline follows an ELT architecture, where raw data is first loaded into the warehouse and transformations are performed using dbt.\n\n## Overview\n\nCricsheet data is nested and complex. Each match file contains hierarchical JSON describing:\n\n- match info\n\t- match dates and format\n\t- teams and players\n\t- outcomes and events\n- innings\n\t- deliveries\n\nThis project builds a reproducible data pipeline that transforms this raw data into a structured warehouse suitable for querying and analytics.\n\n**Pipeline stages:**\n\n1. Fetch raw match data from Cricsheet.\n2. Ingest JSON files into normalized source tables\n3. Track ingestion state using file hashes for incremental updates\n4. Transform and model data using dbt\n5. Expose analytical tables for queries and downstream models\n\n---\n\n\n## Installation\n\n### Prerequisites\n\n* Python 3.10+\n* PostgreSQL 14+\n* Git\n* Docker (optional)\n\nThere are two ways to run the pipeline: locally or in a docker container, orchestrated with Airflow.\n\n---\n\n### Running Locally\n\n\n#### 1. Clone the repository\n\n```shell\ngit clone https://github.com/shsiddhant/cricket-warehouse.git\ncd cricket-warehouse\n```\n\n#### 2. Create and activate a virtual environment\n\n#### Using `uv`\n\n```shell\nuv venv .venv --seed\nsource .venv/bin/activate\nuv sync\n```\n\n#### Using `pip`\n\n```shell\npython -m venv .venv\nsource .venv/bin/activate\npip install .\n```\n\n#### 3. Use CLI to run the pipeline manually.\n\nSee [below](#CLI) on how to use the CLI to run the pipeline.\n\n---\n\n### Run in Airflow\n\nThe full pipeline can be run using Docker Compose.\n\n\n#### 1. Clone the repository\n\n```shell\ngit clone https://github.com/shsiddhant/cricket-warehouse.git\ncd cricket-warehouse\n```\n\n#### 2. Start services\n\n```bash\ndocker compose up --build\n```\n\nThis will start:\n\n- PostgreSQL (Data Warehouse)\n- Airflow Scheduler\n- Airflow Webserver (UI)\n\n\n#### 2. Access Airflow UI\n\nYou can open the UI at [localhost:8080](http://localhost:8080).\n\nDefault credentials:\n\n- username: admin\n- password: admin\n\nYou can set credentials in a .env file. Look at the .env.example provided in the repo for variable names.\n\n![Airflow Webserver Log in](assets/log_in.png)\n\n#### 3. Add a Connection\n\nBefore you can run the DAG, you must setup a postgres connection with credentials\npresent in the .env files.\n\n![Add a PostgreSQL connection](assets/add_pg_conn.png)\n\n#### 4. Run the DAG\n\nNow you can run the Airflow DAG.\n\nEdit the Cricsheet URL as per your needs, and start the DAG. Note that you must use the correct connection\nid as set in the previous step.\n\n![Toggle and trigger the DAG](assets/cricket_elt_dag.png)\n\n\n#### 5. Access the Warehouse\n\nYou can access the warehouse for queries by running the psql in the postgres container:\n\n```\ndocker exec -it cricket-warehouse-postgres-1 psql -U admin -d cricket_warehouse \n```\n\n---\n\n## Data Source\n\n### Cricsheet\n\n**Source:** https://cricsheet.org/\n\nCricsheet provides structured cricket match data across multiple formats and leagues.\n\nRaw dataset characteristics:\n\n\n- **Format:** JSON\n- **Granularity:** Ball-by-ball\n- **File structure:**\tOne file per match\n\n---\n\n## Architecture\n\n```mermaid\nflowchart LR\n\nsubgraph Airflow DAG\n    A[Cricsheet] --\u003e|Fetch and Extract JSON| B[Local Cache]\n    B --\u003e|Batch Ingestion| C[Source Tables: PostgreSQL]\n\nC --\u003e|dbt| D[Staging Layer: Normalized JSONB]\n\nD --\u003e|dbt| E[Intermediate Layer: Relational Tables]\n\nE --\u003e|dbt| F[Marts: Analytics Tables]\n\nend\n```\n\nThe entire pipeline, including ingestion and transformation, is orchestrated within an Airflow DAG.\n\nKey design decisions:\n\n- JSON ingestion is partially flattened during loading.\n- Ingestion is incremental, tracked using file hashes.\n- Transformations are implemented as dbt models\n- venue metadata is managed via seed tables.\n\n---\n\n## Orchestration (Airflow)\n\nThe entire ELT pipeline is orchestrated using Apache Airflow running in Docker.\n\n- Airflow runs with a `LocalExecutor` for parallel task execution\n- The pipeline is implemented as a DAG (`cricket_etl`) covering both ingestion and transformation stages\n- Tasks include:\n  - Fetching and extracting JSON data from Cricsheet\n  - Batch ingestion into PostgreSQL source tables\n  - Running dbt transformations within the warehouse\n\ndbt transformations are executed using **Astronomer Cosmos**, which integrates dbt directly into Airflow DAGs.\n\nThe DAG is structured into task groups aligned with dbt layers:\n\n- **Staging**\n- **Intermediate**\n- **Marts**\n\nThis provides clear visibility into pipeline stages and dependencies within the Airflow UI.\n\nThe pipeline is fully automated end-to-end, requiring no manual dbt execution.\n\n---\n\n## Database Model\n\nThe warehouse stores cricket match data in normalized relational tables.\n\n### Core models\n\n```mermaid\nerDiagram\n    int_matches {\n        integer match_id PK\n        text venue_id FK\n        date start_date\n        text format\n        text event_name\n        text winner\n        text player_of_match\n    }\n\n    int_deliveries {\n        integer match_id FK\n        integer innings_number\n        text team\n        integer over_number\n        integer ball_in_over\n        integer runs\n        text batter\n        text bowler\n        text player_out\n    }\n\n    int_innings {\n        integer match_id FK\n        integer innings_number\n        text team\n        bigint runs_scored\n        bigint wickets_lost\n    }\n\n    int_match_players {\n        text match_player_id PK\n        integer match_id FK\n        text player_id\n        text player_name\n        text team\n    }\n\n    int_match_teams {\n        text match_team_id PK\n        integer match_id FK\n        text team\n        text opponent\n        boolean won_match\n    }\n\n    int_players {\n        text player_id PK\n        text team_id FK\n        text player_name\n        text team\n    }\n\n    int_teams {\n        text team_id PK\n        text team\n        text format\n    }\n\n    %% Core Relationships\n    int_matches ||--o{ int_deliveries : \"recorded in\"\n    int_matches ||--o{ int_innings : \"summarizes\"\n    int_matches ||--o{ int_match_players : \"participated by\"\n    int_matches ||--o{ int_match_teams : \"contested by\"\n    int_teams ||--o{ int_players : \"belongs to\"\n    \n```\n\n### Layers\n\nThere are three layers of dbt DAG:\n\n1. **Staging Models**\n\t - `stg_cricsheet__match_info`: Stage match info JSONB into normalized table.\n\t - `stg_cricsheet__deliveries`: Stage deliveries JSONB into normalized table.\n\n2. **Intermediate Models**\n\t- `int_venues`: Each row represents a match venue.\n    - `int_matches`: Each row represents a match, with columns such as\n    - `int_deliveries`: Each row represents a unique match delivery.\n    - `int_innings`: Each row represents a match innings, with columns such as\n    - `int_teams`: Each row represents a unique (team, format) pair.\n    - `int_match_teams`: Junction table for represent many to many relationship between matches and teams.\n    - `int_players`: Each row represents a unique (player, team, format) tuple.\n    - `int_match_players`: Junction table for represent many to many relationship between matches and players.\n\n3. **Marts**\n    - `fct_batting_order`: Batting order of each match innings.\n    - `fct_dismissed_players`: Dismissed players in each match innings.\n    - `fct_deliveries_sequence`: Sequence of deliveries in each innings.\n    - `fct_batting_scorecard`: Batting scorecard of each innings.\n    - `fct_bowling_scorecard`: Bowling scorecard of each innings.\n\n---\n\n## Quick Test Dataset\n\nIf you want to quickly test the pipeline, you can ingest a small subset of matches from Cricsheet tournament archives.\n\nExamples:\n\n- [ICC Women's Cricket World Cup](https://cricsheet.org/downloads/icc_womens_cricket_world_cup_json.zip)\n- [Indian Premier League](https://cricsheet.org/downloads/ipl_json.zip)\n\nRun the pipeline with those URLs to be able to run example queries below.\n\n---\n\n## Example Analytical Queries\n\nOnce the warehouse is built, you can use analytical marts to answer many questions such as:\n\n### Top Runs Scorers - ICC Women's World Cup 2025\n\n```sql\nWITH stats AS (\n\nSELECT\n\n    COUNT(*) AS innings,\n    bs.player_name AS batter,\n    SUM(bs.runs) AS runs,\n    SUM(bs.balls) AS balls,\n    SUM(CASE WHEN bs.is_dismissed THEN 1 ELSE 0 END) AS dismissals\n\nFROM fct_batting_scorecard bs\nJOIN int_matches m USING (match_id)\nWHERE\n    m.event_name = 'ICC Women''s World Cup' AND\n    EXTRACT( YEAR FROM m.start_date) = 2025\nGROUP BY bs.player_name\n)\n\nSELECT\n\n    batter,\n    innings,\n    innings - dismissals AS not_outs,\n    runs,\n    ROUND(runs / NULLIF(dismissals, 0), 2) AS average,\n    ROUND(100.0 * runs / NULLIF(balls, 0), 2) AS strike_rate\n\nFROM stats\nORDER BY runs DESC\nLIMIT 10\n\n```\n\n**Output:**\n\n```\n     batter      | innings | not_outs | runs | average | strike_rate\n-----------------+---------+----------+------+---------+-------------\n L Wolvaardt     |       9 |        1 |  571 |   71.38 |       98.79\n S Mandhana      |       9 |        1 |  434 |   54.25 |       99.09\n A Gardner       |       5 |        1 |  328 |   82.00 |      130.16\n Pratika Rawal   |       6 |        0 |  308 |   51.33 |       77.78\n P Litchfield    |       7 |        1 |  304 |   50.67 |      112.18\n AJ Healy        |       5 |        1 |  299 |   74.75 |      125.10\n JI Rodrigues    |       7 |        2 |  292 |   58.40 |      101.04\n SFM Devine      |       5 |        0 |  289 |   57.80 |       85.25\n HC Knight       |       7 |        1 |  288 |   48.00 |       85.71\n NR Sciver-Brunt |       6 |        0 |  262 |   43.67 |       85.34\n(10 rows)\n```\n\n### Top Bowlers by Dot Balls Bowled - IPL 2025\n\n```sql\nWITH stats AS (\n\nSELECT\n\n    COUNT(*) AS innings,\n    bs.bowler,\n    SUM(bs.runs) AS runs,\n    SUM(bs.balls) AS balls,\n    SUM(bs.wickets) AS wickets,\n    SUM(bs.dots) AS dots\n\nFROM fct_bowling_scorecard bs\nJOIN int_matches m USING (match_id)\nWHERE\n    m.event_name = 'Indian Premier League' AND\n    EXTRACT( YEAR FROM m.start_date) = 2025\nGROUP BY bs.bowler\n)\n\nSELECT\n\n    bowler,\n    innings,\n    wickets,\n    DIV(balls, 6)::text || '.' || MOD(balls, 6) AS overs,\n    ROUND(runs / NULLIF(wickets, 0), 2) AS average,\n    ROUND(6 * runs / NULLIF(balls, 0), 2) AS economy,\n    dots,\n    ROUND(100 * dots / NULLIF(balls, 0), 2) AS dot_ball_pct\n\nFROM stats\nORDER BY dots DESC, dot_ball_pct DESC\nLIMIT 10;\n```\n\n**Output**\n\n```\n      bowler       | innings | wickets | overs | average | economy | dots | dot_ball_pct\n-------------------+---------+---------+-------+---------+---------+------+--------------\n Mohammed Siraj    |      15 |      16 | 57.0  |   32.94 |    9.25 |  151 |        44.15\n M Prasidh Krishna |      15 |      25 | 59.0  |   19.52 |    8.27 |  146 |        41.24\n KK Ahmed          |      14 |      15 | 46.4  |   29.80 |    9.58 |  137 |        48.93\n Arshdeep Singh    |      16 |      21 | 58.2  |   24.67 |    8.88 |  137 |        39.14\n JJ Bumrah         |      12 |      18 | 47.2  |   17.56 |    6.68 |  128 |        45.07\n TA Boult          |      16 |      22 | 57.4  |   23.50 |    8.97 |  127 |        36.71\n B Kumar           |      14 |      17 | 52.0  |   28.41 |    9.29 |  123 |        39.42\n JR Hazlewood      |      12 |      22 | 44.0  |   17.55 |    8.77 |  120 |        45.45\n PJ Cummins        |      14 |      16 | 49.4  |   28.13 |    9.06 |  118 |        39.60\n CV Varun          |      13 |      17 | 50.0  |   22.53 |    7.66 |  117 |        39.00\n(10 rows)\n```\n---\n\n\n## CLI\n\nThe project includes a CLI for managing the ingestion pipeline.\n\n```shell\ncricwh --help\nUsage: cricwh [OPTIONS] COMMAND [ARGS]...                                                                                                                                    \n╭─ Options ───────────────────────────────────────────────────────────────╮\n│ --install-completion       Install completion for the current shell. │\n│ --show-completion          Show completion for the current shell, to │\n│                            copy it or customize the installation.    │\n│ --help                     Show this message and exit.               │\n╰──────────────────────────────────────────────────────────────────────────╯\n╭─ Commands───────────────────────────────────────────────────────────────╮\n│ fetch      Fetch data from Cricsheet.                                │\n│ configure  Configure cricket-warehouse.                              │\n│ init       Initialize source tables and seeds.                       │\n│ ingest     Ingest JSON files into source tables.                     │\n│ update     Update venue city seed.                                   │\n╰──────────────────────────────────────────────────────────────────────────╯\n```\n\n### Configuration\n\nA config file is provided to manage PostgreSQL database credentials. On first run, `cricwh` initializes an example config. The config file may be found at:\n\n|Operating System |Location|\n|---|---|\n|**Linux/Unix** |`~/.config/cricketwarehouse/config.yaml`|\n|**macOS**|`~/Library/Preferences/cricketwarehouse/config.yaml`|\n|**Windows**|`C:\\Users\\\u003cusername\u003e\\AppData\\Local\\cricketwarehouse\\cricketwarehouse/config.yaml`|\n\nYou can edit the configuration using the `configure` command:\n\n```shell\ncricwh configure [--init-config-file]\n```\n\nYou can reset the config file using the `--init-config-file` flag in the `configure` command.\n\n### Logs\n\nDetailed logs are written during each command. The log file may be found at:\n\n|Operating System |Location|\n|---|---|\n|**Linux/Unix** |`~/.local/share/cricketwarehouse/cricwh.log`|\n|**macOS**|`~/Library/Application Support/cricketwarehouse/cricwh.log`|\n|**Windows**|`C:\\Users\\\u003cusername\u003e\\AppData\\Local\\cricketwarehouse\\cricketwarehouse/cricwh.log`|\n\n### CLI Workflow\n\nAssuming you've configured your database, a typical workflow goes as follows:\n\n1. Initialize source tables (only on first run)\n\n\t```shell\n\tcricwh init\n\t```\n\n2. Fetch and extract raw match data.\n\n    ```shell\n    cricwh fetch [URL] [ZIP FILE PATH]\n    ```\n\n3. Ingest match data into source tables.\n\n\t```shell\n\tcricwh ingest\n\t```\n\n4. Run dbt models\n\t```shell\n\tdbt deps --project-dir dbt/\n    dbt build --project-dir dbt/\n\t```\n\n---\n\n## Tools and Libraries\n\n| Tool\t            | Purpose                           |\n|-------------------|-----------------------------------|\n| Python\t        | Data ingestion and CLI tooling    |\n| Apache Airflow    | Pipeline Orchestration            |\n| dbt\t            | Data modeling and transformations |\n| PostgreSQL        | Data warehouse                    |\n| Astronomer Cosmos | ELT Pipeline Orchestration        |\n| psycopg2\t        | PostgreSQL database interface     |\n\n---\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshsiddhant%2Fcricket-warehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshsiddhant%2Fcricket-warehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshsiddhant%2Fcricket-warehouse/lists"}