https://github.com/brandonleekramer/diversity
In this project, my colleague Catherine Lee (Rutgers) and I employ computational text analysis to examine quantitative trends in the use of diversity terms, OMB/Census terms, and other population labels in a sample of 2.6+ million biomedical abstracts spanning the last 30 years.
https://github.com/brandonleekramer/diversity
diversity pubmed python r sql text-mining word-embeddings
Last synced: 7 months ago
JSON representation
In this project, my colleague Catherine Lee (Rutgers) and I employ computational text analysis to examine quantitative trends in the use of diversity terms, OMB/Census terms, and other population labels in a sample of 2.6+ million biomedical abstracts spanning the last 30 years.
- Host: GitHub
- URL: https://github.com/brandonleekramer/diversity
- Owner: brandonleekramer
- License: mit
- Created: 2019-08-15T19:26:59.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-01-02T19:56:58.000Z (almost 3 years ago)
- Last Synced: 2025-01-13T06:29:11.718Z (9 months ago)
- Topics: diversity, pubmed, python, r, sql, text-mining, word-embeddings
- Language: HTML
- Homepage: https://riseofdiversity.netlify.app/
- Size: 96.7 MB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
#### The Rise of Diversity and Population Terminology in Biomedical Research
As of: 05-17-2021
This repository provides the source code for the Brandon Kramer and Catherine Lee's "The Rise of Diversity and Population Terminology in Biomedical Research." After uploading the PubMed/MEDLINE database with `PubMedPortable` in `Python`, we used `R`'s `tidytext` package to examine trends in the use of diversity in more than 2.5 million scientific abstracts from 1990-2020. Overall, our analyses demonstrate that various types of "diversity" and other population terminiology, including race and ethnicity, are rising over time. While we provide some prelimiary results and a full appendix on our [project website](https://riseofdiversity.netlify.app/), the source code, database, and outputs are detailed below. This project is still in progress, but is updated often.
#### Code structure
├── content (website)
├── overview.Rmd
├── methods.Rmd
├── analyses
├── hypothesis1.Rmd
├── hypothesis2.Rmd
├── hypothesis3.Rmd
├── data
├── dictionaries
├── preprocessing
├── compoundR.csv
├── polysemeR.csv
├── humanizeR.csv
├── h1_dictionary.csv
├── h2_dictionary.csv
├── h3_dictionary.csv
├── tree_data.csv
├── journal_rankings
├── regression_analyses
├── sensitivity_checks
├── text_results
├── h1_results
├── h2_results
├── h3_results
├── word_embeddings
├── src
├── 01_pubmed_db
├── 01_download_medline.sh
├── 02_pubmed_parser.ipynb
├── 03_clean_db.sql
├── 04_pubmed_abstract_db.sql
├── 05_filtered_publications.R
├── 06_articles_per_journal.sql
├── 07_articles_per_year.sql
├── 08_biomedical_abstracts.sql
├── 09_check_abstracts_tbl.sql
├── 02_text_trends
├── 01_hypothesis1.R
├── 02_hypothesis2.R
├── 03_hypothesis3.R
├── 04_all_hypotheses.slurm
├── 05_pub_figures.Rmd
├── supplementary_analyses
├── 06_aggregate_ids.R
├── 07_diversity_abstracts.sql
├── 08_diversity_abstracts.R
├── 09_soc_diversity_eda.R
├── 10_human_abstracts.R
├── 03_word_embeddings
├── 01_w2v_train.ipynb
├── 02_w2v_results.ipynb
├── 04_text_relations
├── unfinished_analyses
├── 05_collaborations
├── unfinished_analyses#### Database structure
├── pubmed_2021
├── abstract_data
├── articles_per_journal
├── articles_per_year
├── biomedical_abstracts
├── filtered_publications