https://github.com/nacnudus/nzcrash

An R package to distribute New Zealand crash data in a convenient form
https://github.com/nacnudus/nzcrash

dataset government new-zealand r roads

Last synced: 14 days ago
JSON representation

An R package to distribute New Zealand crash data in a convenient form

Host: GitHub
URL: https://github.com/nacnudus/nzcrash
Owner: nacnudus
License: other
Created: 2015-07-20T11:09:23.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2017-03-05T19:59:43.000Z (about 8 years ago)
Last Synced: 2025-04-01T18:23:28.854Z (about 1 month ago)
Topics: dataset, government, new-zealand, r, roads
Language: R
Homepage: https://github.com/nacnudus/nzcrash
Size: 19.6 MB
Stars: 4
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

        ---

output:

  md_document:

    variant: markdown_github

---

```{r, message = FALSE}

library(nzcrash)

library(dplyr)

library(tidyr)

library(magrittr)

library(stringr)

library(ggplot2)

library(scales)

library(lubridate)

```

# nzcrash

This package redistributes [crash

statistics](http://www.nzta.govt.nz/resources/crash-analysis-system-data/)

already available from the New Zealand Transport Agency, but in a more

convenient form.

It's a large package (over 20 megabytes, compressed).

## Datasets

The `crashes` dataset describes most facts about a crash.  The datasets `causes`,

`vehicles`, and `objects_struck` describe facts that are in a many-to-one

relationship with crashes.  They can be joined to the `crashes` dataset by the

common `id` column.  The `causes` dataset can additionally be joined to the

`vehicles` dataset by the combination of the `id` and `vehicle_id` columns.

This is most useful when the resulting table is also joined to the `crashes`

dataset.

## Up-to-date-ness

The data was last scraped from the NZTA website on `r Sys.Date()`.  At

that time, the NZTA had published data up to `r max(crashes$date)`.

```{r}

dim(crashes)

dim(causes)

dim(vehicles)

dim(objects_struck)

```

## Accuracy

The [NZTA](http://www.transport.govt.nz/research/roadtoll/#5), doesn't agree with [itself](http://www.transport.govt.nz/research/roadtoll/annualroadtollhistoricalinformation/) about recent annual road tolls, and this dataset gives a third opinion.

```{r}

crashes %>% 

  filter(severity == "fatal") %>%

  group_by(year = year(date)) %>%

  summarize(fatalities = sum(fatalities))

```

## Severity

Crashes categorised as "fatal", "serious", "minor" or "non-injury", based on the

casualties.  If there are any fatalities, then the crash is a "fatal" crash,

otherwise if there are any 'severe' injuries, the crash is a "serious" crash.

The definition of a 'severe' injury is not clear.

Minor and non-injury crashes are likely to be under-recorded since they often do

not involve the police, who write most of the crash reports upon which these

datasets are based.

A common mistake is to confuse the number of fatal crashes with the number of

fatalities.

```{r}

crashes %>% filter(severity == "fatal") %>% nrow

sum(crashes$fatalities)

```

## Dates and times

Three columns of the `crashes` dataset describe the date and time of the crash

in the NZST time zone (Pacific/Auckland).

* `date` gives the date without the time

* `time` gives the time where this is available, and NA otherwise.  Times are

  stored as date-times on the first of January, 1970.

* `datetime` gives the date and time in one value when both are available, and

  NA otherwise.  `date` is always available, however `time` is not.

When aggregating by some function of the date, e.g. by year, then always start

from the `date` column unless you also need the time.  This ensures against

accidentally discounting crashes where a time is not recorded.

```{r, fig.show = "hold"}

crashes %>%

  filter(is.na(time)) %>%

  count(year = year(date)) %>%

  ggplot(aes(year, n)) +

  geom_line() +

  ggtitle("Crashes missing\ntime-of-day information")

crashes %>%

  filter(is.na(time)) %>%

  count(year = year(date)) %>%

  mutate(percent = n/sum(n)) %>%

  ggplot(aes(year, percent)) +

  geom_line() +

  scale_y_continuous(labels = percent) +

  ggtitle("Percent of crashes missing\ntime-of-day information")

```

## Location coordinates

`r percent(nrow(filter(crashes, !is.na(easting)))/nrow(crashes))` of

crashes have coordinates.  These have been converted from the NZTM projection to

the WGS84 projection for convenience with packages like `ggmap`.

Because New Zealand is tall and skinny, you can easily spot the main population

centres with a simple histogram.

```{r}

crashes %>%

  ggplot(aes(northing)) +

  geom_histogram(binwidth = .1)

```

## Vehicles

There can be many vehicles in one crash, so vehicles are recorded in a separate

`vehicles` dataset that can be joined to `crashes` by the common `id` column.

```{r}

crashes %>%

  inner_join(vehicles, by = "id") %>%

  count(vehicle) %>% 

  arrange(desc(n))

```

## Objects struck

There can be many objects struck in one crash, so these are recorded in a separate

`objects_struck` dataset that can be joined to `crashes` by the common `id` column.

Q: What are more fatal, trees or lamp posts?

```{r}

crashes %>%

  inner_join(objects_struck, by = "id") %>%

  filter(object %in% c("Trees, shrubbery of a substantial nature"

                               , "Utility pole, includes lighting columns")

  , severity != "non-injury") %>% # non-injury crashes are poorly recorded

  count(object, severity) %>% 

  group_by(object) %>%

  mutate(percent = n/sum(n)) %>%

  select(-n) %>%

  spread(severity, percent)

```

A: Trees (Don't worry, I know it's harder than that.)

## Causes

Causes can be joined either to the `crashes` dataset (by the common `id`

column), or to the `vehicles` dataset (by both of the commont `id` and

`vehicle_id`) columns.

The main cause groups are given in the `causes_category` column.

```{r}

crashes %>%

  inner_join(causes, by = "id") %>%

  group_by(cause_category, id) %>%

  tally %>%

  group_by(cause_category) %>%

  summarize(n = n()) %>%

  arrange(desc(n)) %>%

  mutate(cause_category = factor(cause_category, levels = cause_category)) %>%

  ggplot(aes(cause_category, n)) + 

  geom_bar(stat = "identity") +

  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

```

That's odd -- where are speed, alcohol, and restraints?  They're given in `cause_subcategory`.

```{r}

causes %>% 

  filter(cause_subcategory == "Too fast for conditions") %>%

  count(cause) %>% 

  arrange(desc(n))

```

There's nothing there about speed limit violations, because it's impossible to tell what speed a

vehicle was going at when it crashed.

More worryingly, how is "Alcohol test below limit" a cause for a crash?

Hopefully they filter those out when making policy decisions.

```{r}

levels(causes$cause) <-                # Wrap facet labels

  str_wrap(levels(causes$cause), 13)

crashes %>%

  inner_join(causes, by = "id") %>%

  filter(cause_subcategory %in% c("Alcohol or drugs")) %>%

  group_by(cause, id) %>%

  tally %>%

  group_by(cause) %>%

  summarize(n = n()) %>%               # This extra step deals with many causes per crash

  arrange(desc(n)) %>%

  mutate(cause= factor(cause, levels = cause)) %>%

  ggplot(aes(cause, n)) + 

  geom_bar(stat = "identity") +

  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

rm(causes)                             # Because we messed up the factor levels

```

This time, join `causes` to both `vehicles` and `crashes` to assess the

drunken cyclist menace.

```{r}

crashes %>%

  filter(severity == "fatal") %>%

  select(id) %>%

  inner_join(vehicles, by = "id") %>% 

  filter(vehicle == "Bicycle") %>%

  inner_join(causes, by = c("id", "vehicle_id")) %>% 

  count(cause) %>%

  arrange(desc(n))

```

I think we all know what "Wandering or wobbling" means.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nacnudus/nzcrash

Awesome Lists containing this project

README