{"id":30886967,"url":"https://github.com/linde/retrosheet-spark","last_synced_at":"2026-05-16T09:35:33.382Z","repository":{"id":152654664,"uuid":"55888015","full_name":"linde/retrosheet-spark","owner":"linde","description":" This is a simple project to create data frames in spark for the data contained in retrosheets baseball archive. It is distributed as a python notebook.","archived":false,"fork":false,"pushed_at":"2021-12-13T09:58:53.000Z","size":2347,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-08T13:58:52.140Z","etag":null,"topics":["baseball","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linde.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2016-04-10T08:16:47.000Z","updated_at":"2018-06-02T20:35:43.000Z","dependencies_parsed_at":"2023-08-12T20:46:56.392Z","dependency_job_id":null,"html_url":"https://github.com/linde/retrosheet-spark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/linde/retrosheet-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linde%2Fretrosheet-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linde%2Fretrosheet-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linde%2Fretrosheet-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linde%2Fretrosheet-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linde","download_url":"https://codeload.github.com/linde/retrosheet-spark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linde%2Fretrosheet-spark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33097025,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["baseball","spark"],"created_at":"2025-09-08T13:53:24.910Z","updated_at":"2026-05-16T09:35:33.377Z","avatar_url":"https://github.com/linde.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Retrosheet Spark\n\nThis is a simple project to create data frames in spark for the data\ncontained in retrosheets baseball archive. It is distributed as a\npython notebook.\n\nThis project owes a lot to the helpful comment here:\nhttp://stackoverflow.com/questions/31227363/creating-spark-data-structure-from-multiline-record\n\n\n## Getting the retrosheets data\n\n```bash\n$ mkdir retrosheet-data\n$ cd retrosheet-data\n$ for yyyy in `seq 1910 10 2010`; do echo getting $yyyy; wget http://www.retrosheet.org/events/${yyyy}seve.zip; done\n$ for yyyy in `seq 1910 10 2010`; do mkdir ${yyyy}seve; done\n$ for yyyy in `seq 1910 10 2010`; do unzip -d ${yyyy}seve  ${yyyy}seve.zip; done\n```\n\n## Some notes on getting up and going...\n\nThis verion targets the latest python3 in brew at the moment (Python\n3.6.5) and spark v2.3.0 running pyspark within jupyter.. You can do\nsomething akin to the following to get them installed.\n\n\n```bash\n$ brew install apache-spark\n$ brew install python\n$ pip3 install virtualenv\n$ virtualenv .py\n$ . .py/bin/activate\n$ pip install jupyter \n```\n\nif you want to do the plotting example in [etrosheet-spark-queries.ipynb], then also install the following:\n\n```bash\n$ pip install plotly \n```\n\n## The finally, to run things\n\n```bash\nPYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark  --executor-memory 2GB \n```\n\nThen run python notebook and execute each buffer, then query and enjoy!\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinde%2Fretrosheet-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinde%2Fretrosheet-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinde%2Fretrosheet-spark/lists"}