https://github.com/abehandler/chidottsv

Scripts and instructions to make a csv file of all CHI papers by year, 1981-present
https://github.com/abehandler/chidottsv

Last synced: 2 months ago
JSON representation

Scripts and instructions to make a csv file of all CHI papers by year, 1981-present

Host: GitHub
URL: https://github.com/abehandler/chidottsv
Owner: AbeHandler
Created: 2021-10-02T13:29:47.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2022-01-20T23:30:00.000Z (over 4 years ago)
Last Synced: 2026-01-01T12:08:58.544Z (6 months ago)
Language: HTML
Size: 4.43 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# chidotcsv

This repo holds code and data to make a big csv file of all CHI paper titles going back to 1981. The main script to create the dataset is `main.sh`.

## Data collection

There are two primary data collection methods:

1. For old conferences (up to late 90s), it is possible to manually export the CHI proceedings as a bib file from the [ACM LIbrary](https://dl.acm.org/proceedings). I did that by hand and save the bib files to the `bib` directory.
2. For more recent conferences, the bib files get to big for the ACM website to export (ha!) so you have to scrape and parse html proceedings from the conference websites. There are a few different formats for the websites, which change every few years. And a couple of standalone conference years require ad-hoc scraping scripts. See `collect_data.sh` to see the scripting details.

## Data processing

Once data is collected by hand and by scripts, it needs to be processed to create the csv. The processing scripts are in `process_data.sh`. The data is processed into a column store format: each paper title is shown as one line in the file `titles.txt` and its year is shown as a corresponding line in `years.txt`. I use the paste command to join the two columns to make the csv.

## Data validation

ACM posts acceptance rates by year in the CHI proceedings section of the ACM library. I copied/pasted this information by hand into `acm_acceptance.txt` and wrote a script to check the count of papers by year collected in this repo vs. the ACM count. For now, the main validation that I am oding is checking to see that the count in this repo is either higher or very slightly than the ACM count. Help with validation would be great! The validation script is in `validate_data.sh`.

The other validation I do is make sure I get the same number of years and titles. Parsing should return the same number of each.

## Other notes

See this similar effort focused on [viz](https://sites.google.com/site/vispubdata/home) papers

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abehandler/chidottsv

Awesome Lists containing this project

README