https://github.com/centre-for-humanities-computing/chicago_corpus
https://github.com/centre-for-humanities-computing/chicago_corpus
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/centre-for-humanities-computing/chicago_corpus
- Owner: centre-for-humanities-computing
- License: mit
- Created: 2024-04-07T08:19:16.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-09T09:08:59.000Z (almost 2 years ago)
- Last Synced: 2025-09-10T00:03:07.114Z (9 months ago)
- Size: 19.1 MB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# The Chicago Corpus [](https://aclanthology.org/2024.lrec-main.71/)

As part of the efforts of the [Fabula-NET project](https://centre-for-humanities-computing.github.io/fabula-net/) at the [Center for Humanities Computing](https://chc.au.dk), ร
rhus University, we present a dataset of quality judgments on 9,000 19th and 20th century English-language literary novels by 3,166 predominantly Anglophone authors.
The data includes annotation of expert opinions and crowd-based resources to allow comparative analyses between different literary quality evaluations, as well as several textual metrics chosen for their connection with literary reception. A large part of the corpus is subjected to copyright (see the [available pre-1924 works here](https://artflsrv04.uchicago.edu/philologic4.7/chicago_novel_corpus_pre1923_12-20/)). **We release quality and reception measures together with stylometric and sentiment data** for each of the 9,000 novels to promote future research and comparison. Read the [Paper] presenting this resource.
## โก Data included
- 9,000 titles
- Author, title & year
- Various textual metrics
- Various reception metrics
For an overview of all included data, see the corpus [documentation](https://github.com/centre-for-humanities-computing/chicago_corpus/blob/66da8be26cfccb3c24c16abcf003d10695e34385/data/corpus_description.md).
Available formats: [.xlsx](https://docs.google.com/spreadsheets/d/1mIMZw1dcoVZOQX3qtPOTmzmQZ9vm7dK1Sj2eZcgArvA/edit?usp=sharing), [.json](https://github.com/centre-for-humanities-computing/chicago_corpus/raw/main/data/CHICAGO_CORPUS_DATA.json)
## ๐ Example
| BOOK_ID | TITLE | AUTH_FIRST | AUTH_LAST | PUBL_DATE | ... | AVG_RATING | SCIFI_AWARDS | PULITZER | TRANSLATIONS | ... | PERPLEXITY | MEAN_SENT | READABILITY |
| ---------------- | --------------- |------------------- |-----------|--------------|-----|------------------|----------------|----------|--------------|------|------------|-----------|-------------|
| 6913 |A Clash of Kings | George R. R. | Martin | 1999 | ... | 4.41 | 1 | 0 | 38 | ... | 79.97| -0.002 | 92.73 |
| 20636 | Dune | Frank | Herbert | 1965 | ... | 4.25 | 1 | 0 | 398 | ... | 72.74 | -0.007 | 85.18 |
| 22741 | Beloved | Toni | Morrison | 1987 | ... | 3.92 | 0 | 1 | 68 | ... | 68.78 | 0.030 | 91.71 |
| 5778 | Misery | Stephen | King | 1987 | ... | 4.20 | 0 | 0 | 74 | ... | 68.09 | -0.032 | 82.54 |
| 86 | The Portrait of a Lady | Henry | James | 1881 | ... | 3.78 | 0 | 0 | 53 | ... | 80.35 | 0.150 | 71.65 |
**Above**: _Example of titles and corresponding values for selected metrics_
## ๐ Corpus statistics

The corpus of texts from which we constructed our dataset was assembled by Hoyt Long and Richard Jean So in the [Textual Optics Lab]; it encompasses 9088 novels published in the United States between 1880 and 2000 and was compiled based on the number of libraries holding each title (based on the [WorldCat](https://search.worldcat.org) catalogue), favoring works with a higher number of library holdings.
| Titles | Authors | Titles per author |
| -------------------------- | --------------------| -------------------------------------------------------------- |
| 9088 | 3166 | 2.88 |
**Above**: _Number of titles/authors in the corpus_
**Below**: _Mean & SD of some of the included features_
| Metric | Wordcount | Sentence Length | Wordlength | Type/Token Ratio | Compressibility | Bigram Entropy | Word Entropy | Flesch Ease | Dale Chall New | Mean Sentiment | Std Sentiment | End Sentiment | Beginning Sentiment | Hurst Exponent | Approximate Entropy |
|----------------------|-------------|-------------------|--------------|--------------------|-------------------|-------------------|-----------------|----------------|------------------|------------------|-----------------|----------------|------------------------|-------------------|-------------------------|
| Mean (ยต) | 118584.71 | 86.56 | 3.67 | 0.69 | 2.92 | 14.63 | 9.69 | 82.70 | 5.10 | 0.03 | 0.35 | 0.03 | 0.04 | 0.61 | 1.75 |
| St. dev. (ยฑ) | 64746.05 | 29.44 | 0.18 | 0.02 | 0.14 | 0.55 | 0.30 | 6.48 | 0.33 | 0.04 | 0.04 | 0.07 | 0.05 | 0.04 | 0.15 |
## ๐ "Quality", "reader appreciation" or "popularity" metrics

Beyond textual features, we present various **"quality proxies"**, that is, ways of estimating valuation in literary culture, such as whether or not titles are included in Bestseller or Canon lists. We also include what we call "continuous" proxies, that is, scores per title, for example of GoodReads ratings or translation numbers (see the corpus [documentation]).
Because of the library holdings selection criteria, the corpus comprises much high-quality fiction from authors who have received prestigious distinctions, such as the Nobel Prize (i.a., Toni Morrison), the National Book Award (i.a., Don DeLillo). Yet, library holdings appear to indicate **both high distinction and mass popularity**, reflecting library users' demand and preferences. So the corpus also comprises widely popular novels from mainstream literature (i.a., Agatha Christie), and notable works on the broad spectrum of so-called "genre literature", from Mystery to Science Fiction (i.a., Tolkien, Philip K. Dick etc.). An examination of the relation between various proxies in this corpus is [forthcoming](https://jcls.io/site/ccls2024/).
## ๐ Documentation
| | |
| --------------------------- | --------------------------------------------------------------------------------- |
| ๐ **[Paper]** | The Chicago resource paper. |
| โ๏ธ **[Documentation]** | Detailed description of measures and proxies included in the dataset. |
| ๐๏ธ **[Previous works]** | Publications that have previously used the Chicago Corpus. |
| ๐ฌ **[Textual Optics Lab]** | The Chicago Corpus at the Textual Optics Lab, University of Chicago. |
| ๐ **[Citation]** | Bibtex citation. |
| ๐ฅ **[EmotionArcs]** | Emotion Arcs of the Chicago Corpus (a linked dataset). |
| ๐ฌ **[CHC]** | Center for Humanities Computing, hosting the FabulaNET project. |
[Paper]: https://github.com/centre-for-humanities-computing/chicago_corpus/blob/3822b3f2d29775f7565c982b7bdaad160a6153ac/documentation/LREC_COLING_2024_CHICAGO.pdf
[Citation]: https://github.com/centre-for-humanities-computing/chicago_corpus/blob/8b813a9b904d7293853fdae28adb884f753bd9bd/documentation/citation.bib
[Previous works]: https://github.com/centre-for-humanities-computing/chicago_corpus/blob/e5e4762e05020f7ea1518a03d6680133c98dddf6/documentation/chicago_publications.md
[Textual Optics Lab]: https://textual-optics-lab.uchicago.edu/us_novel_corpus
[documentation]: data/corpus_description.md
[EmotionArcs]: https://github.com/yuri-bizzoni/EmoArc
[CHC]: https://chc.au.dk