{"id":15037975,"url":"https://github.com/jldbc/pybaseball","last_synced_at":"2025-05-14T12:10:49.878Z","repository":{"id":37344653,"uuid":"94890652","full_name":"jldbc/pybaseball","owner":"jldbc","description":"Pull current and historical baseball statistics using Python (Statcast, Baseball Reference, FanGraphs)","archived":false,"fork":false,"pushed_at":"2024-09-24T18:57:45.000Z","size":7452,"stargazers_count":1406,"open_issues_count":128,"forks_count":363,"subscribers_count":76,"default_branch":"master","last_synced_at":"2025-05-10T01:06:21.747Z","etag":null,"topics":["baseball","data","python","sabermetrics","statcast"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jldbc.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-20T12:50:19.000Z","updated_at":"2025-05-08T18:42:33.000Z","dependencies_parsed_at":"2023-02-16T20:31:46.429Z","dependency_job_id":"0a7da48b-f2c2-430d-b578-3774b971f6ac","html_url":"https://github.com/jldbc/pybaseball","commit_stats":{"total_commits":249,"total_committers":41,"mean_commits":6.073170731707317,"dds":0.642570281124498,"last_synced_commit":"a0e5c4a48be273e766c0338d5fbd21de00d09cf7"},"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jldbc%2Fpybaseball","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jldbc%2Fpybaseball/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jldbc%2Fpybaseball/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jldbc%2Fpybaseball/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jldbc","download_url":"https://codeload.github.com/jldbc/pybaseball/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254140760,"owners_count":22021219,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["baseball","data","python","sabermetrics","statcast"],"created_at":"2024-09-24T20:36:38.267Z","updated_at":"2025-05-14T12:10:49.831Z","avatar_url":"https://github.com/jldbc.png","language":"Python","readme":"# pybaseball\n\nBaseball data scraping and analysis tools in python\n\n## Overview\n\n`pybaseball` is a Python package for baseball data analysis. This package scrapes Baseball Reference, Baseball Savant, and FanGraphs so you don't have to. The package retrieves statcast data, pitching stats, batting stats, division standings/team records, awards data, and more. Data is available at the individual pitch level, as well as aggregated at the season level and over custom time periods. See the [docs](https://github.com/jldbc/pybaseball/tree/master/docs) for a comprehensive list of data acquisition functions.\n\n## Installation\n\nPybaseball can be installed via pip:\n\n```bash\npip install pybaseball\n```\n\nor from the repo (which may at times be more up to date):\n\n```bash\ngit clone https://github.com/jldbc/pybaseball\ncd pybaseball\npip install -e .\n```\n\nWe will try to publish periodic updates through the 'releases' and PyPI CI, but it may lag at times.\n\n## Community\n\nDiscussion about pybaseball use and development is hosted on our group Discord, sign up link [here](https://discord.gg/TnJVyUDDn8). Issues with the codebase should still be raised and addressed on GitHub.\n\n##  Documentation\n\nFull documentation on available functions and their arguments along with examples is located [docs](https://github.com/jldbc/pybaseball/tree/master/docs) folder. This section contains a brief overview of the main functionalities of this library.\n\n\n### Statcast: Pull advanced metrics from Major League Baseball's Statcast system\n\nStatcast data include pitch-level information, pulled from baseballsavant.com.\n\n```python\n\u003e\u003e\u003e from pybaseball import statcast\n\u003e\u003e\u003e statcast(start_dt=\"2019-06-24\", end_dt=\"2019-06-25\").columns\nIndex(['pitch_type', 'game_date', 'release_speed', 'release_pos_x',\n       'release_pos_z', 'player_name', 'batter', 'pitcher', 'events',\n       'description', 'spin_dir', 'spin_rate_deprecated',\n       'break_angle_deprecated', 'break_length_deprecated', 'zone', 'des',\n       'game_type', 'stand', 'p_throws', 'home_team', 'away_team', 'type',\n       'hit_location', 'bb_type', 'balls', 'strikes', 'game_year', 'pfx_x',\n       'pfx_z', 'plate_x', 'plate_z', 'on_3b', 'on_2b', 'on_1b',\n       'outs_when_up', 'inning', 'inning_topbot', 'hc_x', 'hc_y',\n       'tfs_deprecated', 'tfs_zulu_deprecated', 'fielder_2', 'umpire', 'sv_id',\n       'vx0', 'vy0', 'vz0', 'ax', 'ay', 'az', 'sz_top', 'sz_bot',\n       'hit_distance_sc', 'launch_speed', 'launch_angle', 'effective_speed',\n       'release_spin_rate', 'release_extension', 'game_pk', 'pitcher.1',\n       'fielder_2.1', 'fielder_3', 'fielder_4', 'fielder_5', 'fielder_6',\n       'fielder_7', 'fielder_8', 'fielder_9', 'release_pos_y',\n       'estimated_ba_using_speedangle', 'estimated_woba_using_speedangle',\n       'woba_value', 'woba_denom', 'babip_value', 'iso_value',\n       'launch_speed_angle', 'at_bat_number', 'pitch_number', 'pitch_name',\n       'home_score', 'away_score', 'bat_score', 'fld_score', 'post_away_score',\n       'post_home_score', 'post_bat_score', 'post_fld_score',\n       'if_fielding_alignment', 'of_fielding_alignment', 'spin_axis',\n       'delta_home_win_exp', 'delta_run_exp'],\n      dtype='object')\n```\n\nFor documentation on the definitions of these columns, see the [Statcast Search CSV Documentation](https://baseballsavant.mlb.com/csv-docs).\n\nIf `start_dt` and `end_dt` are supplied, it will return all statcast data between those two dates. If not, it will return yesterday's data. The optional argument `verbose` will control whether the library updates you on its progress while it pulls the data.\n\n#### Player-Specific Queries\n\nFor a player-specific statcast query, pull pitching or batting data using the `statcast_pitcher` and `statcast_batter` functions. These take the same `start_dt` and `end_dt` arguments as the statcast function, as well as a `player_id` argument. This ID comes from MLB Advanced Media, and can be obtained using the function `playerid_lookup`. The returned columns match the set above, but filtered to rows for that specific pitcher or batter. A complete example: \n\n```python\n# Find Clayton Kershaw's player id\nfrom pybaseball import  playerid_lookup\nfrom pybaseball import  statcast_pitcher\nplayerid_lookup('kershaw', 'clayton')\n  name_last name_first  key_mlbam key_retro  key_bbref  key_fangraphs  mlb_played_first  mlb_played_last\n0   kershaw    clayton     477132  kersc001  kershcl01           2036            2008.0           2022.0\n\n# His MLBAM ID is 477132, so we feed that as the player_id argument to the following function \nkershaw_stats = statcast_pitcher('2017-06-01', '2017-07-01', 477132)\nkershaw_stats.groupby(\"pitch_type\").release_speed.agg(\"mean\")\npitch_type\nCH    86.725000\nCU    73.133333\nFF    92.844622\nSI    94.515385\nSL    87.962381\nName: release_speed, dtype: float64\n```\n\n#### A note on Statcast data\n\nStatcast data is subject to change (even for prior seasons):\n\n\u003cdiv\u003e\n   \u003cblockquote class=\"twitter-tweet\"\u003e\n      \u003cp lang=\"en\" dir=\"ltr\"\u003e\n         Each season has 700,000+ pitches, and is subject to update. You should code accordingly.\n      \u003c/p\u003e\u0026mdash; Tangotiger (@tangotiger)\n      \u003ca href=\"https://twitter.com/tangotiger/status/1362064972025634821?ref_src=twsrc%5Etfw\"\u003eFebruary 17, 2021\u003c/a\u003e\n   \u003c/blockquote\u003e\n\u003c/div\u003e\n\n### Aggregate Statistics\n\nFor league-wide season-level pitching data, use the function `pitching_stats(start_season, end_season)`. This will return one row per player per season, and provide all metrics made available by FanGraphs.\n\nFor a fixed range, `pitching_stats_range(start_dt, end_dt)` pulls data for a specific time-interval from Baseball Reference. Note that all dates should be in `YYYY-MM-DD` format.\n\n```python\nfrom pybaseball import pitching_stats\ndata = pitching_stats(2014,2016)\ndata.columns\nIndex(['IDfg', 'Season', 'Name', 'Team', 'Age', 'W', 'L', 'WAR', 'ERA', 'G',\n       ...\n       'LA', 'Barrels', 'Barrel%', 'maxEV', 'HardHit', 'HardHit%', 'Events',\n       'CStr%', 'CSW%', 'xERA'],\n      dtype='object', length=334)\n```\n\nBatting stats are obtained similarly. The function call for getting a season-level stats is `batting_stats(start_season, end_season)`, and for a particular time range it is `batting_stats_range(start_dt, end_dt)`. The Baseball Reference equivalent for season-level data is `batting_stats_bref(season)`. \n\n(For season level queries, if you prefer Baseball Reference to FanGraphs, there is a third option, `pitching_stats_bref(season)`. This works the same as `pitching_stats`, but retrieves its data from Baseball Reference instead. This is *not recommended*, however, because the Baseball Reference query currently can only retrieve one season's worth of data per request.)\n\n### Game-by-Game Results and Schedule \nThe `schedule_and_record` function returns a team's game-by-game results for a given season. The function's only two arguments are `season` and `team`, where team is the team's abbreviation (i.e. NYY for New York Yankees).\n\n```python\n# Example: Say we want to know the 1927 Yankees record on May 16 \nfrom pybaseball import schedule_and_record\ndata = schedule_and_record(1927, 'NYY')\ndata.loc[data.Date.str.contains(\"May 16\"), :]\n              Date   Tm Home_Away  Opp W/L    R   RA  Inn   W-L  Rank      GB      Win      Loss   Save  Time D/N  Attendance   cLI  Streak Orig. Scheduled\n28  Monday, May 16  NYY         @  DET   W  6.0  2.0  9.0  19-8   1.0  up 3.0  Ruether  Holloway  Moore  2:28   D      4000.0  5.15       5            None\n```\n\n\n### Standings: up to date or historical division standings, W/L records\n\nThe `standings(season)` function gives division standings for a given season. If the current season is chosen, it will give the most current set of standings. Otherwise, it will give the end-of-season standings for each division for the chosen season. This function returns a list of dataframes. Each dataframe is the standings for one of MLB's six divisions. \n\n```python\n\u003e\u003e\u003e from pybaseball import standings\n\u003e\u003e\u003e data = standings(2016)[4]\n\u003e\u003e\u003e print(data)\n                    Tm    W   L  W-L%    GB\n1         Chicago Cubs  103  58  .640    --\n2  St. Louis Cardinals   86  76  .531  17.5\n3   Pittsburgh Pirates   78  83  .484  25.0\n4    Milwaukee Brewers   73  89  .451  30.5\n5      Cincinnati Reds   68  94  .420  35.5\n```\n\n### Caching\n\nTo facilitate faster data retrieval for repeated calls, a local data cache may be used to save a local copy of the\nrequested data. By default the cache is disabled so as to respect a user's potential desire to not have their hard drive\nspace used without their permission. However, enabling the cache is simple.\n\nCache can be turned on by including the pybaseball.cache module and enabling the cache option like so:\n\n```python\nfrom pybaseball import cache\n\ncache.enable()\n```\n\n## FAQ\n\n### Stale Cache\n\nIf you call a statcast method for a future date, the cache will log empty datasets for those dates. If you're not getting the results you expect for a given date, first try clearing your cache:\n\n```\nfrom pybaseball import cache\ncache.purge()\n```\n\n### Multiprocessing\n\nIf you're getting a error with `concurrent.futures.process.BrokenProcessPool`, wrap your call in a main function, e.g.\n\n```\nif __name__ == '__main__':\n    stats = statcast()\n```\n\nThis may be necessary on systems that use spawn-based processes (often Windows and OSX). \n\nFor other problems, please submit an issue.\n\n## Contributing\n\nSee [contributing.md](https://github.com/jldbc/pybaseball/tree/master/contributing.md) for a guide to contributing to this library.\n\n\n------\n\n## Credit\n\nThis package was developed by James LeDoux and is maintained by [Moshe Schorr](https://github.com/schorrm).\n\nThis package was inspired by Bill Petti's excellent R package [baseballr](https://github.com/billpetti/baseballr), which at the time of this package's development had no Python equivalent. Our hope is to fill that void with this package.\n\nThe Lahman data comes from [Sean Lahman's baseball database](http://www.seanlahman.com/baseball-archive/statistics/).\n\nAll other data comes from FanGraphs, Baseball Reference, the Chadwick Bureau, Retrosheet, and Baseball Savant.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjldbc%2Fpybaseball","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjldbc%2Fpybaseball","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjldbc%2Fpybaseball/lists"}