https://github.com/nacnudus/nzcrash
An R package to distribute New Zealand crash data in a convenient form
https://github.com/nacnudus/nzcrash
dataset government new-zealand r roads
Last synced: 14 days ago
JSON representation
An R package to distribute New Zealand crash data in a convenient form
- Host: GitHub
- URL: https://github.com/nacnudus/nzcrash
- Owner: nacnudus
- License: other
- Created: 2015-07-20T11:09:23.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2017-03-05T19:59:43.000Z (about 8 years ago)
- Last Synced: 2025-04-01T18:23:28.854Z (about 1 month ago)
- Topics: dataset, government, new-zealand, r, roads
- Language: R
- Homepage: https://github.com/nacnudus/nzcrash
- Size: 19.6 MB
- Stars: 4
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output:
md_document:
variant: markdown_github
---```{r, message = FALSE}
library(nzcrash)
library(dplyr)
library(tidyr)
library(magrittr)
library(stringr)
library(ggplot2)
library(scales)
library(lubridate)
```# nzcrash
This package redistributes [crash
statistics](http://www.nzta.govt.nz/resources/crash-analysis-system-data/)
already available from the New Zealand Transport Agency, but in a more
convenient form.It's a large package (over 20 megabytes, compressed).
## Datasets
The `crashes` dataset describes most facts about a crash. The datasets `causes`,
`vehicles`, and `objects_struck` describe facts that are in a many-to-one
relationship with crashes. They can be joined to the `crashes` dataset by the
common `id` column. The `causes` dataset can additionally be joined to the
`vehicles` dataset by the combination of the `id` and `vehicle_id` columns.
This is most useful when the resulting table is also joined to the `crashes`
dataset.## Up-to-date-ness
The data was last scraped from the NZTA website on `r Sys.Date()`. At
that time, the NZTA had published data up to `r max(crashes$date)`.```{r}
dim(crashes)
dim(causes)
dim(vehicles)
dim(objects_struck)
```## Accuracy
The [NZTA](http://www.transport.govt.nz/research/roadtoll/#5), doesn't agree with [itself](http://www.transport.govt.nz/research/roadtoll/annualroadtollhistoricalinformation/) about recent annual road tolls, and this dataset gives a third opinion.
```{r}
crashes %>%
filter(severity == "fatal") %>%
group_by(year = year(date)) %>%
summarize(fatalities = sum(fatalities))
```## Severity
Crashes categorised as "fatal", "serious", "minor" or "non-injury", based on the
casualties. If there are any fatalities, then the crash is a "fatal" crash,
otherwise if there are any 'severe' injuries, the crash is a "serious" crash.The definition of a 'severe' injury is not clear.
Minor and non-injury crashes are likely to be under-recorded since they often do
not involve the police, who write most of the crash reports upon which these
datasets are based.A common mistake is to confuse the number of fatal crashes with the number of
fatalities.```{r}
crashes %>% filter(severity == "fatal") %>% nrow
sum(crashes$fatalities)
```## Dates and times
Three columns of the `crashes` dataset describe the date and time of the crash
in the NZST time zone (Pacific/Auckland).* `date` gives the date without the time
* `time` gives the time where this is available, and NA otherwise. Times are
stored as date-times on the first of January, 1970.
* `datetime` gives the date and time in one value when both are available, and
NA otherwise. `date` is always available, however `time` is not.When aggregating by some function of the date, e.g. by year, then always start
from the `date` column unless you also need the time. This ensures against
accidentally discounting crashes where a time is not recorded.```{r, fig.show = "hold"}
crashes %>%
filter(is.na(time)) %>%
count(year = year(date)) %>%
ggplot(aes(year, n)) +
geom_line() +
ggtitle("Crashes missing\ntime-of-day information")crashes %>%
filter(is.na(time)) %>%
count(year = year(date)) %>%
mutate(percent = n/sum(n)) %>%
ggplot(aes(year, percent)) +
geom_line() +
scale_y_continuous(labels = percent) +
ggtitle("Percent of crashes missing\ntime-of-day information")
```## Location coordinates
`r percent(nrow(filter(crashes, !is.na(easting)))/nrow(crashes))` of
crashes have coordinates. These have been converted from the NZTM projection to
the WGS84 projection for convenience with packages like `ggmap`.Because New Zealand is tall and skinny, you can easily spot the main population
centres with a simple histogram.```{r}
crashes %>%
ggplot(aes(northing)) +
geom_histogram(binwidth = .1)
```## Vehicles
There can be many vehicles in one crash, so vehicles are recorded in a separate
`vehicles` dataset that can be joined to `crashes` by the common `id` column.```{r}
crashes %>%
inner_join(vehicles, by = "id") %>%
count(vehicle) %>%
arrange(desc(n))
```## Objects struck
There can be many objects struck in one crash, so these are recorded in a separate
`objects_struck` dataset that can be joined to `crashes` by the common `id` column.Q: What are more fatal, trees or lamp posts?
```{r}
crashes %>%
inner_join(objects_struck, by = "id") %>%
filter(object %in% c("Trees, shrubbery of a substantial nature"
, "Utility pole, includes lighting columns")
, severity != "non-injury") %>% # non-injury crashes are poorly recorded
count(object, severity) %>%
group_by(object) %>%
mutate(percent = n/sum(n)) %>%
select(-n) %>%
spread(severity, percent)
```A: Trees (Don't worry, I know it's harder than that.)
## Causes
Causes can be joined either to the `crashes` dataset (by the common `id`
column), or to the `vehicles` dataset (by both of the commont `id` and
`vehicle_id`) columns.The main cause groups are given in the `causes_category` column.
```{r}
crashes %>%
inner_join(causes, by = "id") %>%
group_by(cause_category, id) %>%
tally %>%
group_by(cause_category) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
mutate(cause_category = factor(cause_category, levels = cause_category)) %>%
ggplot(aes(cause_category, n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
```That's odd -- where are speed, alcohol, and restraints? They're given in `cause_subcategory`.
```{r}
causes %>%
filter(cause_subcategory == "Too fast for conditions") %>%
count(cause) %>%
arrange(desc(n))
```There's nothing there about speed limit violations, because it's impossible to tell what speed a
vehicle was going at when it crashed.More worryingly, how is "Alcohol test below limit" a cause for a crash?
Hopefully they filter those out when making policy decisions.```{r}
levels(causes$cause) <- # Wrap facet labels
str_wrap(levels(causes$cause), 13)
crashes %>%
inner_join(causes, by = "id") %>%
filter(cause_subcategory %in% c("Alcohol or drugs")) %>%
group_by(cause, id) %>%
tally %>%
group_by(cause) %>%
summarize(n = n()) %>% # This extra step deals with many causes per crash
arrange(desc(n)) %>%
mutate(cause= factor(cause, levels = cause)) %>%
ggplot(aes(cause, n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
rm(causes) # Because we messed up the factor levels
```This time, join `causes` to both `vehicles` and `crashes` to assess the
drunken cyclist menace.```{r}
crashes %>%
filter(severity == "fatal") %>%
select(id) %>%
inner_join(vehicles, by = "id") %>%
filter(vehicle == "Bicycle") %>%
inner_join(causes, by = c("id", "vehicle_id")) %>%
count(cause) %>%
arrange(desc(n))
```I think we all know what "Wandering or wobbling" means.