{"id":19813951,"url":"https://github.com/erickpeirson/jhb-data","last_synced_at":"2026-03-04T17:32:21.139Z","repository":{"id":152185390,"uuid":"103768465","full_name":"erickpeirson/jhb-data","owner":"erickpeirson","description":"Data from the forthcoming paper: Quantitative Perspectives on Fifty Years of the Journal of the History of Biology","archived":false,"fork":false,"pushed_at":"2017-09-17T14:35:11.000Z","size":56033,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-28T19:11:44.816Z","etag":null,"topics":["data","geolocation","history-of-biology","named-entity-recognition","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erickpeirson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-09-16T16:53:40.000Z","updated_at":"2017-09-17T14:39:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"b9c29302-0a44-4e68-9a87-b1e84f047504","html_url":"https://github.com/erickpeirson/jhb-data","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/erickpeirson/jhb-data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erickpeirson%2Fjhb-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erickpeirson%2Fjhb-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erickpeirson%2Fjhb-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erickpeirson%2Fjhb-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erickpeirson","download_url":"https://codeload.github.com/erickpeirson/jhb-data/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erickpeirson%2Fjhb-data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30087355,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T15:40:14.053Z","status":"ssl_error","status_checked_at":"2026-03-04T15:40:13.655Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","geolocation","history-of-biology","named-entity-recognition","topic-modeling"],"created_at":"2024-11-12T09:37:44.525Z","updated_at":"2026-03-04T17:32:21.101Z","avatar_url":"https://github.com/erickpeirson.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data: Quantitative Perspectives on Fifty Years of the Journal of the History of Biology\n\n[![DOI](https://zenodo.org/badge/103768465.svg)](https://zenodo.org/badge/latestdoi/103768465)\n\nData from the forthcoming paper:\n\n\u003e Peirson, B. R. Erick, Erin Bottino, Julia L. Damerow, and Manfred D. Laubichler. 2017. Quantitative perspectives on fifty years of the *Journal of the History of Biology* 50(4).\n\nUnless otherwise specified, data are comma-delimited. Missing values are left empty.\n\n### How to Cite\n\u003e Peirson, B. R. Erick, Erin Bottino, Julia L. Damerow, and Manfred D. Laubichler. 2017. Data: Quantitative Perspectives on Fifty Years of the Journal of the History of Biology (revision 1). https://github.com/erickpeirson/jhb-data/releases/tag/1.0. doi:10.5281/zenodo.893499\n\n### Questions?\n\nErick Peirson ([orcid:0000-0002-0564-9939](http://orcid.org/0000-0002-0564-9939))\n- Web: https://erickpeirson.github.io\n- Twitter: [@undercaffeinatd](https://twitter.com/undercaffeinatd)\n\n\n## License\n\n\u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png\" /\u003e\u003c/a\u003e\u003cbr /\u003eThis work is licensed under a \u003ca rel=\"license\" href=\"http://creativecommons.org/licenses/by-nc-sa/4.0/\"\u003eCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International License\u003c/a\u003e.\n\n## Article metadata\n\n### ``article_metadata.csv``\nFrom JSTOR. DOIs can be used for joins across other tables in this dataset.\n\n- ``Title``: string\n- ``Date``: integer\n- ``Volume``: integer\n- ``Issue``: integer\n- ``StartPage``: integer\n- ``EndPage``: integer\n- ``DOI``: string\n\n### ``author_names.csv``\nMaps author indices (used in other tables) to readable names.\n\n- ``Author``: integer (author index)\n- ``Name``: string (readable name)\n\n### ``document_authors.csv``\nRelations between ``article_metadata.csv`` (by DOI) and ``document_authors.csv``\n(by author index).\n\n- ``DOI``: string\n- ``Author``: integer\n\n## Geography\n\n\u003e We examined each of the articles in JHB over its entire run, and attempted to\n\u003e identify the physical location of the author at the time of publication, and to\n\u003e determine locations that were discussed in the article content. To locate\n\u003e references to locales, we took a single visual pass over each article and noted\n\u003e any references to municipalities, regions, or states, taking care to spend an\n\u003e equitable amount of time on each article. We assume that we found a subset of\n\u003e the total references to locations. We then found the closest match to that\n\u003e location in the GeoNames geographical database (http://www.geonames.org/), and\n\u003e recorded the corresponding Uniform Resource Identifier (URI).\n\n\u003e For the sake of consistency, we used current location identifiers and\n\u003e geopolitical boundaries, which is in some cases highly anachronistic. For\n\u003e example, the name \"Czechia\" was only officially adopted by the Czech Republic in\n\u003e 2016, and the Republic itself has only existed since 1993, yet we have used the\n\u003e current term to tag articles published as early as the 1960s that refer to\n\u003e (historical) Czechoslovakia. For the present high-level analysis this does not\n\u003e have a substantial impact on our results or conclusions. In other studies,\n\u003e however, historians incorporating digital geographic data in their research may\n\u003e find it fruitful to use regional databases of historical place names (e.g. The\n\u003e Historical Gazatteer of England's Place Names [http://placenames.org.uk/]).\n\u003e GeoNames itself also has increasing support for historical place names.\n\n\n### ``article_localizations.csv``\nLocation tags for article content and and authors.\n\n- ``DOI``: string\n- ``Relation``: string (``content`` or ``author``)\n- ``GeoNamesID``: integer\n\n### ``location_metadata.csv``\nLocation information retrieved from [GeoNames](http://www.geonames.org/).\n\n- ``GeoNamesID``: integer\n- ``Latitude``: float (degrees north of the equator)\n- ``Longitude``: float (degrees east of the prime meridian)\n- ``Name``: string\n- ``CountryCode``: string (ISO 3166-1 alpha-2)\n- ``CountryID``: integer (GeoNames country identifier; missing values are ``-1``)\n\n## Organisms\n\n\u003e We used the LINNAEUS NER model (Gerner et al 2010) to tag references to\n\u003e organisms on each page of JHB. NER is a problem in information retrieval in\n\u003e which the goal is to identify words or phrases in a text that refer to\n\u003e instances of a particular class of entities, such as people, places,\n\u003e institutions, or dates. NER is usually achieved through supervised machine\n\u003e learning, in which a \"training set\" of human-annotated documents is used to\n\u003e train a classifier. LINNAEUS is a dictionary-based NER application, in which a\n\u003e large collection of documents from Medline and PubMed Central that had already\n\u003e been tagged with entries from the NCBI Taxonomy database were used to generate\n\u003e a lexicon of phrases that refer to specific taxa. LINNAEUS matches both formal\n\u003e taxonomic terms (e.g. species binomial names) and common names (e.g. \"mouse\").\n\n### ``organisms.csv``\nEntities recognized on each page.\n\n- ``Entity``: integer (NCBI Taxonomy identifier)\n- ``DOI``: string\n- ``Page``: integer (0-indexed)\n- ``Text``: string (matching word or phrase in document)\n- ``Start``: integer (character offset of phrase start)\n- ``End``: integer (character offset of phrase end)\n\n### ``taxon_labels.csv``\nHuman-readable labels for entities in ``organisms.csv``. Taken from the\n``ScientificName`` field in NCBI Taxonomy database.\n\n- ``Entity``: integer (NCBI Taxonomy identifier)\n- ``Label``: string (usually a binomial name)\n\n## Topics\n\n\u003e Prior to model fitting we applied several preparatory transformations to the\n\u003e paginated full text provided by JSTOR. Within each document, we removed running\n\u003e headers from each page. We used an MCMC simulation to locate the bibliography\n\u003e within each article (if one was present) based on the distribution of specific\n\u003e punctuation characters and key terms; if a bibliography was identified, that\n\u003e content was excluded from analysis. We tokenized each page at the level of\n\u003e individual words, removing all punctuation, whitespace, and numeric characters.\n\u003e Since many concepts of interest in JHB are represented by multi-word phrases,\n\u003e we extracted two- to six-word phrases by applying (in three sequential passes)\n\u003e the criterion $\\frac{N_{ij}-5}{N_i+N_j }\u003e0.1N$ where Nij is the number of\n\u003e occurrences of the bigram (word i\n\u003e followed by word j), Ni is the total number of occurrences of word i, and N is\n\u003e the total number of tokens in the whole corpus (Řehůřek and Sojka 2010). After\n\u003e conjoining word-parts into phrases as described, we removed tokens of any\n\u003e individual words that (a) occur in the Natural Language ToolKit stopwords list\n\u003e (Bird et al 2009), or (b) occur on more than 6,000 pages (about 1/4 of the\n\u003e corpus).\n\n\u003e We fit a series of topic models to the articles in JHB, treating each page as a\n\u003e separate document and varying the number of topics with . We used the\n\u003e parallelized collapsed Gibbs sampler implemented in the InPhO Vector Space\n\u003e Model package (Murdock 2015), and ran each simulation for 10,000 iterations. In\n\u003e each case the simulation converged successfully.\n\n### ``word_labels.csv``\nThis is the corpus vocabulary; maps \"word\" indices (integer) to human-readable\nlabels. Note that many of the \"words\" in the vocabulary are actually multi-word\nphrases.\n\n- ``Word``: integer\n- ``Label``: string\n\n### ``topic_page_assignments__k[N].csv``\nPosterior probability of topic assignments for each page.\n\n- ``Topic``: integer (0-N)\n- ``DOI``: string\n- ``Page``: integer\n- ``Probability``: float\n- ``Assignments``: float (number of words on page assigned to topic)\n- ``Characteristic``: float (number of times more likely topic is to occur on\n  this page than on a random page in the corpus).\n\n### ``word_topic_assignments__k[N].csv``\nPosterior probability of words given each topic.\n\n- ``Topic``: integer (0-N)\n- ``Word``: integer (word index)\n- ``Probability``: float\n\n### ``topic_article_assignments__k[N].csv``\nData from ``topic_page_assignments__k[N].csv`` aggregated and re-normalized at\nthe article level.\n\n- ``Topic``: integer (0-N)\n- ``DOI``: string\n- ``Probability``: float\n- ``Assignments``: float (number of words in article assigned to topic)\n- ``Characteristic``: float (number of times more likely topic is to occur in\n  this article than on a random page in the corpus).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferickpeirson%2Fjhb-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferickpeirson%2Fjhb-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferickpeirson%2Fjhb-data/lists"}