https://github.com/utdata/rwd-billboard-data

Billboard charts data
https://github.com/utdata/rwd-billboard-data

actions r

Last synced: 8 months ago
JSON representation

Billboard charts data

Host: GitHub
URL: https://github.com/utdata/rwd-billboard-data
Owner: utdata
License: mit
Created: 2019-12-25T19:03:35.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2025-05-01T16:16:30.000Z (about 1 year ago)
Last Synced: 2025-05-01T16:30:02.201Z (about 1 year ago)
Topics: actions, r
Language: Jupyter Notebook
Homepage:
Size: 454 MB
Stars: 48
Watchers: 1
Forks: 5
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

---
output:
html_document:
df_print: paged
knit: (function(inputFile, encoding) { rmarkdown::render(
inputFile,
encoding = encoding,
output_dir = "docs",
output_file='index.html'
) })
---

# rwd-billboard-data

> **June 2023**: The combine action was updated to create a modified version of Hot 100 for an assignment.

This project archives [Billboard Hot 100](https://www.billboard.com/charts/hot-100/) and [Billboard 200](https://www.billboard.com/charts/billboard-200/) charts data.

If you are here looking for a current archive, here are the files of interest:

- [data-out/hot-100-current.csv](data-out/hot-100-current.csv) has the [Billboard Hot 100](https://www.billboard.com/charts/hot-100/) back to its inception in 1958.
- [data-out/billboard-200-current.csv](data-out/billboard-200-current.csv) is the [Billboard 200](https://www.billboard.com/charts/billboard-200/) from its inception in 1967.

> There are minor data errors in both archives. See details below.

This project has been ... an adventure. Details below.

## Charts scraping and combining

There are two Github Actions that call scripts to scrape a list of charts each week and then combine each chart's files with some processed archives from other sources. I currently collect for the Hot 100 and Billboard 200 charts.

- `.github/workflows/scrap_charts.yml` is a Github Action that is scheduled on a cron to run `action_scrape_charts.R`. That scrapes the current chart and saves it.
- `.github/workflows/combine_charts.yml` is a Github Action that is scheduled on a cron to run `action_combine_charts.R`. This combines scraped charts with any `previous_archives` files, if any.

The actions run Tuesday through Friday, though the charts usually update on Tuesdays (or Wednesdays on weeks with a Monday holiday.) There are also sometimes corrections.

### Exploration and maintenance

There are some RMarkdown notebooks used to explore and maintain those scripts: `01-scrape-charts.Rmd`, `02-combine-charts.Rmd` and `03-check-charts.Rmd`. There are some details recorded there that can help explain what is happening in the scrap/combine charts scripts.

## Hot 100

### 2022 to current

The Github Action script saves data into `data-scraped/hot-100` based on the chart date. These files cover 2022 and forward.

### Archive from before 2022

Where the data comes from:

- We downloaded this [kaggle](https://www.kaggle.com/dhruvildave/billboard-the-hot-100-songs) data straight from the web page. It is saved as `data-download/hot100_kaggle_195808_20211106.csv`. It has charts into November 2021. There are some missing records (at least 13).
- Since the kaggle data is stale, some gap data was collected with a Data Miner Chrome plugin and [saved as a Google Sheet](https://docs.google.com/spreadsheets/d/1in--HfDYfijzQha8PSP4ItaKND9_rzx8pFPVHaZi-hE/edit?usp=sharing). It's possible this will replaced in the future.
- Another source of Billboard Hot 100 data is on [data.world](https://data.world/kcmillersean/billboard-hot-100-1958-2017) and it is used to fill in the data missing from kaggle. It only goes through June 2021 and also has gaps, but not the same gaps as the kaggle data.

How it comes together:

- **notebooks/02-hot100-archive**: Combines different data sources to create the complete archive, saved into the `data-out` folder.

### Known Hot 100 data errors

TLDR: My data matches what is currently online.

- There are a couple of records for "Rainy Night In Georgia/Rubberneckin'" by Brook Benton, which [some think](https://data.world/kcmillersean/billboard-hot-100-1958-2017/discuss/billboard-hot-100-1958-2017/me2tkmbx#kex5mx5n) is a mistake. Elvis' "Rubberneckin'" appears higher in these same weeks. As of 2022-07-23 the data appears this way online for [1970-01-10](https://www.billboard.com/charts/hot-100/1970-01-10/) and [1970-01-17](https://www.billboard.com/charts/hot-100/1970-01-17/) charts.
- [This comment](https://data.world/kcmillersean/billboard-hot-100-1958-2017/discuss/billboard-hot-100-1958-2017/me2tkmbx#emfy2p2n) on the data.world collection: "Just another heads up for anyone using this dataset. The charts for 1961 contain another error. The Pips "Every Beat Of My Heart" is duplicated twice in some of the weekly charts, except the duplicates are credited to Gladys Knight & The Pips. The original 1961 release was credited to only Pips or The Pips, later re-releases of the song in the 70s reflect the band's change of name." I have confirmed the double entries in this data set and currently online at Billboard, but have not researched the possible reasons why.

## Billboard 200

The chart scraping script also collects the [Billboard 200](https://www.billboard.com/charts/billboard-200/) each week in to `data-scraped/billboard-200` stored by date, and the chart combine script builds a current archive saved as [data-out/billboard-200-current.csv](data-out/billboard-200-current.csv).

The combine script taps a processed archive file for charts pre-2020, explained below.

> The following scripts won't work anymore. It would be nice to build the pre-2020 archive from my own R scrapes, but I haven't done that as yet.

- 01-build-archive-billboard200 is a python Jupyter Notebook that downloads the files one year at a time. The resulting files are saved in `data`.
There are two significant issues to be aware of:
- The downloaded data had errors dealing with quote escaping (I don't recall exactly). Errors were manually fixed in a text editor as they were discovered.
- **This process will no longer work because the python package is broken.** It no longer understands previousDate. The original data has been moved to `data-download/py-billboard-200` and the fixed data is in `data-process/billboard200`.
- [02-billboard200-combine](https://utdata.github.io/rwd-billboard-data/02-billboard200-combine.html) is an R notebook used to combine the data. I used this notebook to find problems and then manually cleaned files, which are stored in `data-process/billboard200/`. Combined data is in `data-out/billboard200.csv`.

### Known Billboard 200 data errors

- The first five weeks in the history have only 175 rows but that matches what is online. This is not really an error, but of note.
- There are **only 191 records for 1967-09-16**. The chart is also incorrect online, missing records 153, 154, 182, 184, 192, 193, 195, 196 and 197.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/utdata/rwd-billboard-data

Awesome Lists containing this project

README