Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nflverse/nflfastr

A Set of Functions to Efficiently Scrape NFL Play by Play Data
https://github.com/nflverse/nflfastr

american-football cran cran-r football-data nfl nflstats nflverse r r-package sports-analytics

Last synced: 3 days ago
JSON representation

A Set of Functions to Efficiently Scrape NFL Play by Play Data

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/readme-"
)
```

# **nflfastR**

[![CRAN status](https://www.r-pkg.org/badges/version-last-release/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![CRAN downloads](https://cranlogs.r-pkg.org/badges/grand-total/nflfastR)](https://CRAN.R-project.org/package=nflfastR)
[![Dev status](https://img.shields.io/github/r-package/v/nflverse/nflfastR/master?label=dev%20version&style=flat-square&logo=github)](https://www.nflfastr.com/)
[![R-CMD-check](https://github.com/nflverse/nflfastR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/nflverse/nflfastR/actions/workflows/R-CMD-check.yaml)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![nflverse support](https://img.shields.io/discord/789805604076126219?color=7289da&label=nflverse%20support&logo=discord&logoColor=fff&style=flat-square)](https://discord.com/invite/5Er2FBnnQa)

`nflfastR` is a set of functions to efficiently scrape NFL play-by-play data. `nflfastR` expands upon the features of nflscrapR:

* The package contains NFL play-by-play data back to 1999
* As suggested by the package name, it obtains games **much** faster
* Includes completion probability (`cp`), completion percentage over expected (`cpoe`), and expected yards after the catch (`xyac_epa` and `xyac_mean_yardage`) in play-by-play going back to 2006
* Includes drive information, including drive starting position and drive result
* Includes series information, including series number and series success
* Hosts [a repository of play-by-play data going back to 1999](https://github.com/nflverse/nflverse-pbp) for very quick access
* Features models for Expected Points, Win Probability, Completion Probability, and Yards After the Catch (see section below)
* Includes a function `update_db()` that creates and updates a database

We owe a debt of gratitude to the original [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, without whose contributions and inspiration this package would not exist.

## Installation

The easiest way to get nflfastR is to install it from [CRAN](https://cran.r-project.org/package=nflfastR) with:

```{r, eval=FALSE}
install.packages("nflfastR")
```

To get a bug fix or to use a feature from the development version, you can install the development version of nflfastR either from [GitHub](https://github.com/nflverse/nflfastR/) with:

``` {r eval = FALSE}
if (!require("pak")) install.packages("pak")
pak::pak("nflverse/nflfastR")
```

or prebuilt from the [development repo](https://nflverse.r-universe.dev) with:

```{r eval = FALSE}
install.packages("nflfastR", repos = c("https://nflverse.r-universe.dev", getOption("repos")))
```

## Usage

We have provided some application examples in the **[Getting Started](https://www.nflfastr.com/articles/nflfastR.html)** article. However, these require a basic knowledge of R. For this reason we have the **[nflfastR beginner's guide](https://www.nflfastr.com/articles/beginners_guide.html)**, which we recommend to all those who are looking for an introduction to nflfastR with R.

You can find column names and descriptions in the **[Field Descriptions](https://www.nflfastr.com/articles/field_descriptions.html)** article, or by accessing the `field_descriptions` dataframe from the package.

## Data access

Even though `nflfastR` is very fast, **we recommend downloading the data from [here](https://github.com/nflverse/nflverse-data/releases/tag/pbp) or using the `nflreadr` package**. These data sets include play-by-play data of complete seasons going back to 1999 and are updated nightly during the season. The files contain both regular season and postseason data, and one can use game_type or week to figure out which games occurred in the postseason.

## nflfastR models

`nflfastR` uses its own models for Expected Points, Win Probability, Completion Probability, and Expected Yards After the Catch. To read about the models, please see [this post on Open Source Football](https://opensourcefootball.com/posts/2020-09-28-nflfastr-ep-wp-and-cp-models/). For a more detailed description of the motivation for Expected Points models, we highly recommend this paper [from the nflscrapR team located here](https://arxiv.org/pdf/1802.00998.pdf).

Here is a visualization of the Expected Points model by down and yardline.

``` {r epa-model, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 600, echo=FALSE, eval = FALSE}

# This code was used to create the ep model image. Since we don't want to include
# the resulting png file in the package for file size reasons it was uploaded to
# the nflfastR repo and embedded remotely with the next chunk

library(tidyverse)

df <- nflreadr::load_pbp(2014:2019) %>%
filter(!is.na(posteam) & !is.na(ep), !is.na(down)) %>%
select(ep, down, yardline_100, air_yards, pass_location, cp)

df %>%
ggplot(aes(x = yardline_100, y = ep, color = as.factor(down))) +
geom_smooth(size = 2) +
labs(x = "Yards from opponent's end zone",
y = "Expected points value",
color = "Down",
title = "Expected Points by Yardline and Down") +
theme_bw() +
scale_y_continuous(expand=c(0,0), breaks = scales::pretty_breaks(10)) +
scale_x_continuous(expand=c(0,0), breaks = seq(from = 5, to = 95, by = 10)) +
theme(
plot.title = element_text(size = 18, hjust = 0.5),
plot.subtitle = element_text(size = 16, hjust = 0.5),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16),
legend.position = c(.90, .80)) +
annotate("text", x = 14, y = -2.2, size = 3, label = "2014-2019 | Model: @nflfastR")
```

```{r echo=FALSE, fig.align='center', fig.cap='', out.width='100%'}
knitr::include_graphics('man/figures/readme-epa-model-1.png')
```

Here is a visualization of the Completion Probability model by air yards and pass direction.

``` {r cp-model, warning = FALSE, message = FALSE, results = 'hide', fig.keep = 'all', dpi = 600, echo=FALSE, eval = FALSE}

# This code was used to create the cp model image. Since we don't want to include
# the resulting png file in the package for file size reasons it was uploaded to
# the nflfastR repo and embedded remotely with the next chunk

df %>%
filter(!is.na(cp), between(air_yards, -5, 45)) %>%
mutate(pass_middle = if_else(pass_location == "middle", "Yes", "No")) %>%
ggplot(aes(x = air_yards, y = cp, color = as.factor(pass_middle))) +
geom_smooth(size = 2) +
labs(x = "Air yards",
y = "Expected completion %",
color = "Pass middle",
title = "Expected Completion % by Air Yards and Pass Direction") +
theme_bw() +
scale_y_continuous(expand=c(0,0), breaks = scales::pretty_breaks(5)) +
scale_x_continuous(expand=c(0,0)) +
theme(
plot.title = element_text(size = 18, hjust = 0.5),
plot.subtitle = element_text(size = 16, hjust = 0.5),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16),
legend.text = element_text(size = 16),
legend.title = element_text(size = 16),
legend.position = c(.80, .80)) +
annotate("text", x = 2, y = .32, size = 3, label = "2014-2019 | Model: @nflfastR")
```

```{r echo=FALSE, fig.align='center', fig.cap='', out.width='100%'}
knitr::include_graphics('man/figures/readme-cp-model-1.png')
```

`nflfastR` includes two win probability models: one with and one without incorporating the pre-game spread.

## Special thanks

* To Nick Shoemaker for [finding and making available JSON-formatted NFL play-by-play back to 1999](https://github.com/CroppedClamp/nfl_pbps) (`nflfastR` uses this source for 1999 and 2000 and previously also used it for 2001-2010)
* To Lau Sze Yui for developing a scraping function to access JSON-formatted NFL play-by-play beginning in 2001
* To Aaron Schatz and [FTN Fantasy](https://ftnfantasy.com/dvoa/nfl) for providing charting data to correctly mark scrambles in the 1999-2005 seasons
* To Lee Sharpe for curating a resource for game information
* To Timo Riske, Lau Sze Yui, Sean Clement, and Daniel Houston for many helpful discussions regarding the development of the new `nflfastR` models
* To Zach Feldman and Josh Hermsmeyer for many helpful discussions about CPOE models as well as Peter Owen for many helpful suggestions for the CP model
* To Florian Schmitt for the logo design
* The many users who found and reported bugs in `nflfastR` 1.0
* And of course, the original [`nflscrapR`](https://github.com/maksimhorowitz/nflscrapR) team, Maksim Horowitz, Ronald Yurko, and Samuel Ventura, whose work represented a dramatic step forward for the state of public NFL research