Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tmasjc/atl-salary-data

Simpson's paradox: a case study exercise
https://github.com/tmasjc/atl-salary-data

opendata shinyapps

Last synced: 9 days ago
JSON representation

Simpson's paradox: a case study exercise

Awesome Lists containing this project

README

        

---
title: "Atlanta City Employee Salaries 2015"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.width = 7)
```

## Overview

This [Shiny app](https://tmasjc.shinyapps.io/atl_salary_data/) explores the Atlanta city employee salaries data for the Year 2015. In particular, the relationship between employee ``salary`` and variables such as ``gender``, ``age``, and ``ethnic group``.

Dataset obtained from [data.world](https://data.world/brentbrewington/atlanta-city-employee-salaries), contributed by [Brent Brewington](https://github.com/bbrewington/atlanta-salary-data).

## Components

- **Selector Panel** which allows user to select points using cursor (click-and-drag)
- **Plot I** displays the relationship of employee ``median salary`` by ``age``, seperated by ``gender``
- **Plot II** displays the relationship of employee ``median salary`` by ``ethnic group``, seperated by ``gender``
- **Table I** which tabulates count at each combination of ``ethnic group`` and ``gender``

![Screen Shot](screen.gif)

## Motivation

This Shiny app is meant as a simple demonstration of how unbalanced dataset can be misleading at times. In statistics, this phenomenon is known as **[Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)** or the Yule–Simpson effect.

If we look at male and female ``median salary`` at various ``age group``, we might observe that *in general, male earns higher pay than female*.

```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(stringr)
library(ggplot2)

raw <- readRDS("Data/atl_2015.RDS") %>%
rename(ethnic = ethnic.origin, job = job.title, salary = annual.salary) %>%
mutate(gender = factor(sex), ethnic = factor(ethnic)) %>%
as_tibble()

which_less_than <- function(vec, n) {
x <- table(vec) < n
sort(unique(vec))[x]
}
sub_age <- with(raw, age %>% which_less_than(10))
sub_ethnic <- with(raw, ethnic %>% which_less_than(30))

dat <- raw %>%
filter(!(age %in% sub_age), !(ethnic %in% sub_ethnic)) %>%
mutate(ethnic = factor(str_extract(ethnic, "^[A-Z]?[a-z]+")))
```

```{r}
old <- theme_set(theme_light() + theme(plot.title = element_text(size = 11), legend.position = "none"))
```

```{r}
dat %>%
group_by(age, sex) %>%
summarise(salary = median(salary)) %>%
ggplot(aes(age, salary, col = sex, group = sex)) +
scale_x_continuous(breaks = seq(20, 100, 10), limits = c(20, 70)) +
geom_line() +
labs(x = "Age", y = "Median Salary", col = "Sex") +
theme(legend.position = "bottom")
```

However, if we look at male and female ``median salary`` by ``ethnic groups``, we notice that female's median salary does not seem to lag behind. As a matter of fact, *both Asian and White women median salary is actually higher than men by an obvious margin*.

```{r}
dat %>%
group_by(ethnic, sex) %>%
summarise(salary = median(salary)) %>%
ggplot(aes(ethnic, salary, col = sex)) +
geom_line(col = "gray") +
geom_point() +
labs(x = "Ethnic Group", y = "Median Salary", col = "Sex") +
theme(legend.position = "bottom")
```

Tinker around the Shiny app to explore the data and understand the logic behind.

## Deployment

This Shiny app is shipped with Docker, built on [rocker/tidyverse](https://hub.docker.com/r/rocker/tidyverse/) image.

```bash
# from this repo
git clone [email protected]:tmasjc/atl-salary-data.git

# move inside directory
cd atl-salary-data

# name your image
# it may take a while to build
docker build -t atl_salary_data .

# start your container
docker run -dp 3838:3838 atl_salary_data

## You are set. Go to localhost:3838/atl_salary_data to view application. ##
```

Or you can simply deploy it to Shiny server as per normal.