https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity
OCS (BP): Examine global patterns of obesity across rural and urban regions
https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity
bloomberg-ocs-rural bmi case-study data-science data-visualization data-wrangling histograms obesity ocs t-test
Last synced: about 1 month ago
JSON representation
OCS (BP): Examine global patterns of obesity across rural and urban regions
- Host: GitHub
- URL: https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity
- Owner: opencasestudies
- Created: 2019-10-29T20:36:41.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-04-27T20:31:44.000Z (about 3 years ago)
- Last Synced: 2025-04-13T15:16:54.955Z (about 1 month ago)
- Topics: bloomberg-ocs-rural, bmi, case-study, data-science, data-visualization, data-wrangling, histograms, obesity, ocs, t-test
- Language: HTML
- Homepage: https://opencasestudies.github.io/ocs-bp-rural-and-urban-obesity/
- Size: 207 MB
- Stars: 2
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: md_document
---# OpenCaseStudies
[](https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity/actions)
[](https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity/actions)### Important links
- Static version: https://www.opencasestudies.org/ocs-bp-rural-and-urban-obesity
- Interactive version: https://rsconnect.biostat.jhsph.edu/ocs-bp-rural-and-urban-obesity-interactive/
- GitHub: https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies### Disclaimer
The purpose of the [Open Case
Studies](https://www.opencasestudies.org/) project is **to demonstrate
the use of various data science methods, tools, and software in the
context of messy, real-world data**. A given case study does not cover
all aspects of the research process, is not claiming to be the most
appropriate way to analyze a given dataset, and should not be used in
the context of making policy decisions without external consultation
from scientific experts.### License
This case study is part of the [OpenCaseStudies](https://www.opencasestudies.org/) project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 ([CC BY-NC 3.0](https://creativecommons.org/licenses/by-nc/3.0/us/)) United States License.
### Citation
To cite this case study:
Wright, Carrie and Meng, Qier and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-rural-and-urban-obesity. Exploring global patterns of obesity across rural and urban regions (Version v1.0.0).
### Acknowledgments
We would like to acknowledge [Jessica Fanzo](https://www.jhsph.edu/faculty/directory/profile/3380/jessica-fanzo) for assisting in framing the major direction of the case study.
We would like to acknowledge [Michael Breshock](https://mbreshock.github.io/) for his contributions to this case study and developing the `OCSdata` package.
We would also like to acknowledge the [Bloomberg American Health Initiative](https://americanhealth.jhu.edu/) for funding this work.
### Reading Metrics
The total reading time for this case study was calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **About 70 minutes**
The Flesch-Kincaid Readability Index was also calculated with [koRpus](https://github.com/unDocUMeantIt/koRpus): **Grade 9, Age 14**
### Title
Exploring Global Patterns of Obesity from 1985 to 2017
### Motivation
Body Mass Index (BMI) is often used as a proxy for adiposity with classifications based on BMI to define "underweight", "normal", "overweight" and "obese", where higher BMI has been associated with increased mortality, rates of type 2 diabetes, cancer, heart disease, and stroke.
A recent [paper](https://www.nature.com/articles/s41586-019-1171-x.pdf) showed that contrary to a widely reported view (that urbanization is one of the most important drivers in the global rise of obesity), in fact BMI is increasing at the same rate or faster in rural areas (compared to cities), in particular in low- and middle-income regions.
Also, there a gender-discrepancy (women have a higher BMI in rural communities).Here, we explore this data to understand global patterns in obesity.
This analysis is important because it may indicate the need to provide better access (financial and physical access) to healthy foods in rural communities, especially in low-income countries, to address the obesity crisis.### Motivating questions
1. Is there a difference between rural and urban BMI estimates around the world? In particular, what does this difference look like for women?
2. How have BMI estimates changed from 1985 to 2017? In particular, what does this change over time look like for women?
3. How do different countries compare for BMI estimates? In particular, how does the United States compare to the rest of the world?### Data
The data used in this analysis comes from a [supplementary table](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-019-1171-x/MediaObjects/41586_2019_1171_MOESM1_ESM.pdf) for the following article:
[NCD Risk Factor Collaboration (NCD-RisC). Rising rural body-mass index is the main driver of the global obesity epidemic in adults. *Nature* **569**, 260–264 (2019).](https://www.nature.com/articles/s41586-019-1171-x)
This article can be found freely available online.
While [gender](https://www.genderspectrum.org/quick-links/understanding-gender/){target="_blank"} and [sex](https://www.who.int/genomics/gender/en/index1.html){target="_blank"} are not actually binary, the data presented that is used in this analysis only contain data for groups of individuals described as men or women.
#### Learning Objectives
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
**Data science Learning Objectives:**
1. Importing data from a PDF (`pdftools`)
2. Subsetting and filtering data (`dplyr`)
3. Working with character strings (`stringr`)
4. Reshaping data into different formats (`tidyr`)
5. Applying functions to all columns of a tibble (`purrr`)
6. Creating data visualizations (`ggplot2`) with labels (`ggrepel`)
7. Combining multiple plots (`cowplot` and `patchwork`)**Statistical Learning Objectives:**
1. Familiarity with the use of Quantile-Quantile plots to assess normality
2. Define and understand the utility of alpha and the p value
3. Describe the difference between nonparametric and parametric tests
4. Be able to identify paired data
5. Implementation of a paired $t$-test
6. Interpretation of a paired $t$-test
7. Implementation of a Wilcoxon signed-rank test
8. Interpretation of a Wilcoxon signed-rank test
9. Understanding of the need for multiple testing correction### Analysis
In this case study, we will largely focus on methods for comparing two groups using parametric and nonparametric hypothesis tests. We also cover multiple testing correction and fairly advanced data visualization methods using ggplot2.
#### Data import
Data is imported from a PDF using `pdftools` to obtain data from a large table. The beginning of this table looks like this:

#### Data wrangling
This case study covers many wrangling techniques and largely involves using the package `stringr`.
1. Dividing data into separate lines
2. Removing excess white-space
3. Removing redundant header information
4. Correcting spacing issues
5. Dealing with `NA` values that are labeled in an unusual manner
6. Splitting the data into columns using a delimiter
7. Changing variable names
8. Sorting the data
9. Converting to long format
10. Separating a column into multiple columns#### Data exploration
To explore the data we use the `summarize()` function as well as plots to look at the distribution of the data. Quantile-Quantile plots are used to evaluate the distribution and compare it to the theoretical normal distribution.
#### Statistical concepts
This case study covers fundamental concepts in statistics such as type 1 error, alpha threshold, p-values, hypothesis testing, parametric two sample mean tests, and nonparametric two sample tests, as well as the assumptions of the various included statistical tests and what to do when data is paired.
### Other notes and resources
[BMI](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html){target="_blank"}
[Long and Wide Data Formats](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}
[Distributions](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"}
[Normal Distribution](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"}
[Skewed Distributions](http://onlinestatbook.com/2/glossary/skew.html){target="_blank"}
[Bimodal Distribution](http://onlinestatbook.com/2/introduction/distributions.html){target="_blank"}
[ggplot2](https://opencasestudies.github.io/ocs-healthexpenditure/ocs-healthexpenditure.html){target="_blank"}
[Q-Q Plots](http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html){target="_blank"}
[Student $t$-test](https://stattrek.com/statistics/dictionary.aspx?definition=two-sample%20t-test)
[Paired Data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5579465/){target="_blank"}
[Welch's $t$-test](https://www.statisticshowto.datasciencecentral.com/welchs-test-for-unequal-variances/){target="_blank"}
[Parametric and Nonparametric Methods](https://www.mayo.edu/research/documents/parametric-and-nonparametric-demystifying-the-terms/doc-20408960)
[Variance](https://stattrek.com/statistics/dictionary.aspx?definition=variance){target="_blank"}
[Balanced Study Design](https://www.statisticshowto.datasciencecentral.com/balanced-and-unbalanced-designs/){target="_blank"}
[Independent Observations](https://www.stat.cmu.edu/~cshalizi/36-220/lecture-5.pdf){target="_blank"}
[Transformation](https://www.statisticshowto.datasciencecentral.com/transformation-statistics/){target="_blank"}
[Permutation/Resampling Methods](https://jhu-advdatasci.github.io/2019/lectures/21-resampling-techniques.html){target="_blank"}
[Central Limit Theorem](https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/){target="_blank"}
[Mood's Two-Sample Scale Test](https://files.eric.ed.gov/fulltext/ED065559.pdf){target="_blank"}
[Wilcoxon Signed Rank Test](http://www.biostathandbook.com/wilcoxonsignedrank.html)
[Wilcoxon Rank Sum Test](http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_nonparametric/BS704_Nonparametric4.html){target="_blank"}
[Two-sample Kolmogorov-Smirnov Test](https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ks2samp.htm){target="_blank"}
[Type 1 Error](https://web.ma.utexas.edu/users/mks/statmistakes/errortypes.html){target="_blank"}
[p-value](https://towardsdatascience.com/p-values-explained-by-data-scientist-f40a746cfc8){target="_blank"}
[Multiple Testing](https://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf){target="_blank"}
[Bonferroni Method of Multiple Testing Correction](http://mathworld.wolfram.com/BonferroniCorrection.html){target="_blank"}**Packages used in this case study:**
Package | Use in this case study
---------- |-------------
[here](https://github.com/jennybc/here_here){target="_blank"} | to easily load and save data with relative paths
[pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf){target="_blank"} | to read a text from pdf into R
[stringr](https://stringr.tidyverse.org/articles/stringr.html){target="_blank"} | to manipulate the text data
[readr](https://readr.tidyverse.org/){target="_blank"} | to manipulate the text data within the pdf into individual lines
[dplyr](https://dplyr.tidyverse.org/){target="_blank"} | to arrange/filter/select subsets of the data
[tibble](https://tibble.tidyverse.org/){target="_blank"} | to create data objects that we can manipulate with `dplyr`/`stringr`/`tidyr`/`purrr`
[magrittr](https://magrittr.tidyverse.org/articles/magrittr.html){target="_blank"} | to use the `%<>%` piping operator
[glue](https://www.tidyverse.org/blog/2017/10/glue-1.2.0/){target="_blank"} | to paste or combine character strings and data together
[purrr](https://purrr.tidyverse.org/){target="_blank"} | to perform functions on all columns of a tibble
[tidyr](https://tidyr.tidyverse.org/){target="_blank"} | to convert data from 'wide' to 'long' format
[ggplot2](https://ggplot2.tidyverse.org/) | to make visualizations with multiple layers
[ggrepel](https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html){target="_blank"} | to allow labels in figures not to overlap
[cowplot](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html){target="_blank"} and [patchwork](https://github.com/thomasp85/patchwork){target="_blank"} | to allow plots to be combined#### For users
There is a [`Makefile`](Makefile) in this folder that allows you to type `make` to knit the case study contained in the `index.Rmd` to `index.html` and it will also knit the [`README.Rmd`](README.Rmd) to a markdown file (`README.md`).
#### For instructors
Our goal is for instructors to use this case study as the starting point for a set of lectures.
We provide one R Markdown file ([`index.Rmd`](index.Rmd)) for an instructor to use.
However, we anticipate the instructor may either break this file up into smaller R Markdown files for multiple lectures or extract only a portion of the material (e.g. the Data Wrangling or Data Analysis sections) to use in the classroom.
With the latter goal in mind, we save a `Wrangled_data.rda` object at the end of the Data Wrangling section, which is loaded at the start of the Data Exploration section.#### Target audience
This case study is designed for undergraduate students who have not taken a statistics course. While we do not discuss the theoretical aspects of the statistics concepts used in this case study, the case study discusses the motivation behind them.
#### Suggested homework
Students can repeat a similar analysis, but evaluate the change in BMI over time using the global data available for each year between 2015 and 2017.
#### Estimate of RMarkdown Compilation Time:
~ About 31 - 41 seconds
This compilation time was measured on a PC machine operating on Windows 10. This range should only be used as an estimate as compilation time will vary with different machines and operating systems.