{"id":13696391,"url":"https://github.com/droher/boxball","last_synced_at":"2025-09-07T15:34:46.504Z","repository":{"id":42043040,"uuid":"183717434","full_name":"droher/boxball","owner":"droher","description":"Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.","archived":false,"fork":false,"pushed_at":"2023-12-16T15:34:00.000Z","size":286,"stargazers_count":122,"open_issues_count":8,"forks_count":18,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-04-20T11:06:50.459Z","etag":null,"topics":["apache-drill","baseball","baseballdatabank","clickhouse","column-store","containers","docker","mysql","play-by-play","postgres","postgresql","retrosheet","sabermetrics","sports","sports-data","sports-stats","sql","sqlite"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/droher.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-04-27T02:09:59.000Z","updated_at":"2025-03-30T17:15:04.000Z","dependencies_parsed_at":"2023-12-14T19:25:18.159Z","dependency_job_id":"458c7191-66d4-4c5d-ac10-1f4d9fd85d4d","html_url":"https://github.com/droher/boxball","commit_stats":{"total_commits":55,"total_committers":4,"mean_commits":13.75,"dds":0.07272727272727275,"last_synced_commit":"f5a0bd4908243020a3267cd86df833c97ac3f834"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/droher%2Fboxball","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/droher%2Fboxball/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/droher%2Fboxball/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/droher%2Fboxball/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/droher","download_url":"https://codeload.github.com/droher/boxball/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252226645,"owners_count":21714836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-drill","baseball","baseballdatabank","clickhouse","column-store","containers","docker","mysql","play-by-play","postgres","postgresql","retrosheet","sabermetrics","sports","sports-data","sports-stats","sql","sqlite"],"created_at":"2024-08-02T18:00:39.177Z","updated_at":"2025-05-03T17:30:57.137Z","avatar_url":"https://github.com/droher.png","language":"Python","funding_links":[],"categories":["Integrations"],"sub_categories":["ETL and Data Processing"],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"./assets/boxball.jpg\" width=\"50%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/release/droher/boxball.svg\"\u003e\n\u003ca href=\"https://circleci.com/gh/droher/boxball\"\u003e\n    \u003cimg src=\"https://circleci.com/gh/droher/boxball.svg?style=shield\u0026circle-token=2b78bfd4c600c640c479f2f2d9eaa38823ad8b96\"/\u003e\n\u003c/a\u003e\n\u003ca href=\"https://www.codacy.com?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=droher/boxball\u0026amp;utm_campaign=Badge_Grade\"\u003e\n    \u003cimg src=\"https://api.codacy.com/project/badge/Grade/9a163160d3db4621b941b3297bfb9edf\"/\u003e\n\u003c/a\u003e\n\u003ca href=\"https://www.codacy.com?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=droher/boxball\u0026amp;utm_campaign=Badge_Coverage\"\u003e\n    \u003cimg src=\"https://api.codacy.com/project/badge/Coverage/9a163160d3db4621b941b3297bfb9edf\"/\u003e\n\u003c/a\u003e\n\u003cimg alt=\"Docker Pulls\" src=\"https://img.shields.io/docker/pulls/doublewick/boxball.svg\"\u003e\n\u003ca href=\"https://opensource.org/licenses/Apache-2.0\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-blue.svg\" /\u003e\u003c/a\u003e\n\u003cbr\u003e\n\u003c/p\u003e\n\n**Update**: I have released a new project, [baseball.computer](https://baseball.computer), which is designed\nas the successor to boxball. It is much easier to use (no Docker required, runs entirely in your browser/program)\nand includes many more tables, features, and quality controls. The event schema is different, which will be the main migration pain point. _I aim to continue Boxball maintenence and updates as long as people are still using it,_ and I may try to rebase\nboxball on top of the new project to make maintaining both easier. Please let me know if there are things you can do in Boxball that you can't do yet in baseball.computer by filing an issue on the [repo](https://github.com/droher/baseball.computer) or reaching me at david.roher@baseball.computer. \n\n## Introduction\n**Boxball** creates prepopulated databases of the two most significant open source baseball datasets:\n[Retrosheet](http://retrosheet.org) and the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank).\nRetrosheet contains information on every major-league pitch since 2000, every play since 1928,\nevery box score since 1901, and every game since 1871.\nThe Databank (based on the [Lahman Database](http://www.seanlahman.com/baseball-archive/statistics/)) contains yearly\nsummaries for every player and team in history. In addition to the data and databases themselves, Boxball relies on the following tools:\n*   [Docker](https://docs.docker.com/engine/docker-overview/) for repeatable builds and easy distribution\n*   [SQLAlchemy](https://www.sqlalchemy.org/) for abstracting away DDL differences between databases\n*   [Chadwick](https://github.com/chadwickbureau/chadwick) for translating Retrosheet's complex event files into a relational format\n\nFollow the instructions below to install your distribution of choice. The full set of images is also available on\nDocker Hub.\n\nThe Retrosheet schema is extensively documented in the code; see the source [here](https://github.com/droher/boxball/blob/master/transform/src/schemas/retrosheet.py)\nuntil I find a prettier solution.\n\nIf you find the project useful, please consider donating to:\n*   The [Ali Forney Center](https://aliforneycenter.donordrive.com/index.cfm?fuseaction=donate.general) for homeless LGBTQ youth\n*   [350.org](https://act.350.org/donate/build/), a grassroots international climate change organization\n\nFeel free to [contact me](mailto:david@boxball.io) with questions or comments! \n\n## Requirements\n*   [Docker](https://docs.docker.com/install/) (v18.06, earlier versions may not work)\n*   2-20GB Disk space (depends on distribution choice)\n*   500MB-8GB RAM available to Docker (depends on distribution choice)\n\n## Distributions\n### Column-Oriented Databases\n#### Postgres cstore_fdw (Recommended)\nThis distribution uses the [cstore_fdw](https://github.com/citusdata/cstore_fdw) extension to turn PostgreSQL\ninto a column-oriented database. This means that you get the rich featureset of Postgres,\nbut with a huge improvement in speed and disk usage. To install and run the database server:\n\n`docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD=\"postgres\" -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-latest`\n\nRoughly an hour after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user `postgres`\nwith password `postgres` on port `5433`\n(either using the `psql` command line tool or a database client of your choice). The data will be persisted on your machine in\n`~/boxball/postgres-cstore-fdw` (~1.5GB), which means you can stop/remove the container without having to reload the data\nwhen you turn it back on.\n\n#### Clickhouse\n[Clickhouse](https://clickhouse.yandex/) is a database developed by Yandex with some very impressive performance benchmarks. It uses less\ndisk space than Postgres cstore_fdw, but significantly more RAM (~5GB). I've yet to run any query performance comparisons.\nTo install and run the database server:\n\n`docker run --name clickhouse -d -p 8123:8123 -v ~/boxball/clickhouse:/var/lib/clickhouse doublewick/boxball:clickhouse-latest`\n\n15-30 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it either by attaching the\ncontainer and using the `clickhouse-client` CLI or by using a local database client on port `8123` as the user `default`. \nThe data will be persisted on your machine in\n`~/boxball/clickhouse` (~700MB), which means you can stop/remove the container without having to reload the data\nwhen you turn it back on.\n\n#### Drill\n[Drill](https://drill.apache.org/) is a framework that allows for SQL queries directly on files, without having to declare any schema.\nIt is usually used on a computing cluster with massive datasets, but we use a single-node setup. To install and run:\n\n`docker run --name drill -id -p 8047:8047 -p 31010:31010 -v ~/boxball/drill:/data doublewick/boxball:drill-latest`\n \nData will be immediately available to query after the image is downloaded. Use port `8047` to access the Web UI \n(which includes a SQL runner) and port `31010` to connect via a database client.\nYou may also attach the container and query from the command line.\nThe data will be persisted on your machine in `~/boxball/drill` (~700MB).\n\n### Traditional (Row-oriented) Databases\nNote: these frameworks are likely to be prohibitively slow when querying play-by-play data, and they take up significantly\nmore disk space than their columnar counterparts.\n#### Postgres\nSimilar configuration to the cstore_fdw extended version above, but stored in the conventional way.\n\n`docker run --name postgres -d -p 5432:5432 -e POSTGRES_PASSWORD=\"postgres\" -v ~/boxball/postgres:/var/lib/postgresql/data doublewick/boxball:postgres-latest`\n\nRoughly 90 minutes after the image is downloaded, the data will be fully loaded into the database,\nand you can connect to it as the user `postgres` with password `postgres` on port `5432`\n(either using the `psql` command line tool or a database client of your choice). The data will be persisted on your machine in\n`~/boxball/postgres` (~12GB), which means you can stop/remove the container without having to reload the data\nwhen you turn it back on.\n\n#### MySQL\nTo install and run:\n\n`docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest`\n\nRoughly two hours after the image is downloaded, the data will be fully loaded into the database,\nand you can connect to it as the user `root` on port `3306`. The data will be persisted on your machine in\n`~/boxball/mysql` (~12GB), which means you can stop/remove the container without having to reload the data\nwhen you turn it back on.\n\n#### SQLite (with web UI)\nTo install and run:\n\n`docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-latest`\n\nRoughly two minutes after the image is downloaded, the data will be fully loaded into the database. `localhost:8080`\nwill provide a [web UI](https://github.com/coleifer/sqlite-web) where you can write queries and perform schema exploration.\n\n### Flat File Downloads\n\n#### Parquet\nParquet is a columnar data format originally developed for the Hadoop ecosystem. It has solid support in Spark, Pandas,\nand many other frameworks.\n[OneDrive](https://1drv.ms/u/s!AtpEocFNRNBWhAqZMaj40Bb8__6u?e=dNJiod)\n\n#### CSV\nThe original CSVs from the extract step (each CSV file is compressed in the ZSTD format).\n[OneDrive](https://1drv.ms/u/s!AtpEocFNRNBWhDLuZqcmXYOIieKQ?e=xP4Azs)\n\n## Acknowledgements\nTed Turocy's [Chadwick Bureau](http://chadwick-bureau.com/) developed the tools and repos that made this project possible. I am also grateful to [Sean\nLahman](http://www.seanlahman.com/) for creating his database, which I have been using for over 15 years. I was able\nto develop and host this project for free thanks to the generous open-source plans of [Jetbrains](https://www.jetbrains.com/?from=boxball), CircleCI, Github, and Docker Hub.\n\nRetrosheet represents the collective effort of thousands of baseball fans over 150 years of scorekeeping and data entry.\nI hope Boxball facilitates more historical research to continue this tradition.\n\n## Licence(s)\nAll code is released under the Apache 2.0 license. Baseball Databank data is distributed under the [CC-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)\nlicense. Retrosheet data is released under the condition that the below text appear prominently:\n\n``` \nThe information used here was obtained free of\ncharge from and is copyrighted by Retrosheet.  Interested\nparties may contact Retrosheet at \"www.retrosheet.org\".\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdroher%2Fboxball","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdroher%2Fboxball","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdroher%2Fboxball/lists"}