Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-scholarly-data-analysis
A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
https://github.com/napsternxg/awesome-scholarly-data-analysis
Last synced: 1 day ago
JSON representation
-
Journals
- Scientometrics
- Frontiers in Research Metrics and Analytics
- Scientometrics
- Journal of Informetrics
- Quantitative Science Studies
- Science, technology and human values
- Social Studies of Science
- Science and Public Policy
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Scientometrics
- Quantitative Science Studies
- Scientometrics
-
Publication and Citation
- Citation string parsing data for social sciences for English and German citations - [comparison with Grobid and Cermine](https://github.com/exciteproject/Exparser/tree/master/Evaluation/Ours)
- DBLP Discovery Dataset (D3)
- A dataset of publication records for Nobel laureates
- Microsoft Academic Graph
- OpenAlex - Replacement for MAG
- Open Academic Graph - MAG + AMiner
- OpenAIRE Research Graph - More info [here](https://graph.openaire.eu)
- Semantic Scholar Corpus
- CORA datasets for citation string parsing
- Humanities and multilingual citation string parsing Flux-CiM and ICONIP - 018-0242-1) for details
- CrossRef DOI URLs
- DOIboost (Crossres + MAG + ORCID + Unpaywall)
- DBLP Citation dataset
- DBLP XML data
- Scopus Citation Database
- Papers, patents, and grants from Indiana University
- Small Network Data - Mark Newman's Lab
- The Koblenz Network Collection
- Google Scholar citation relations
- Google Scholar Citations data set - download](http://homes.sice.indiana.edu/filiradi/Data/gsc_data.tar.bz2)
- Open citations project
- Wikicite Project
- Ecnonomic Papers
- ArXiv data dump
- ArXiv data on Kaggle
- Complete ACL anthology as bibtex file
- ACL Anthology Reference Corpus
- Astrophysics data system (ADS) - All physics papers
- CORE 37M full text open access papers
- Inspire database for high energy physics articles
- Scholarly Data of workshops and conferences in RDF triplets
- The Collection of Computer Science Bibliographies
- OpenCitations corpus
- COCI Doi-Doi citation data
- DOAJ API (Directory of Open Access Journals)
- ROAD (Directory of Open Access Scholarly Resources)
- Sherpa/Romeo (Publisher copyright policies & self-archiving)
- OpenAPC (fees paid for open access journal articles)
- OSF API (Open Science Framework)
- Digital tools for researchers
- Fatcat - versioned, publicly-editable catalog of research publications
- arXiv CS citation in context
- arXiv fulltext + citations dataset
- Self-citation analysis data based on PubMed Central subset (2002-2005)
- Unpaywalled Corpus - PDF to 23M DOIs - format)
- OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects - [About the data](http://scholexplorer.openaire.eu/index.html#/about)
- Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background
- iCite - NIH Open Citation Collection
- MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML
- American Physical Society Data Sets for Research
- Co-citation networks of all Nature papers
- Structured citations in the English Wikipedia
- COVID-19 Open Research Dataset (CORD-19)
- Citations to scholarly data in various language wikipedias - utilities/python-mwcites)
- 800K publications matched from CrossRef, CORE, and Mendeley with data on publication and open access dates
- Crossref dumps - data](https://github.com/greenelab/crossref)
- Initiative for Open Abstracts
- Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers
- Open Syllabus Project
- Sci-Hub Download Logs - [Latest](https://sci-hub.se/stats)
- Sci-Hub databases
- SAGE Rejected article tracker dataset from ArXiv - [Github](https://github.com/sagepublishing/rejected_article_tracker_pkg)
- The Open Research Knowledge Graph (ORKG)
- ACADEMIA INDUSTRY DYNAMICS
- Papers and patents are becoming less disruptive over time - [Paper](https://www.nature.com/articles/s41586-022-05543-x)
- OpenAIRE Research Graph Dump
- OpCitance: Citation contexts identified from the PubMed Central open access articles
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- CiteSeer
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- NBER Patent Citations
- Microsoft Academic Knowledge Graph - RDF dump
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Digital tools for researchers
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- Arnet Miner
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- OpenAPC (fees paid for open access journal articles)
- Co-citation networks of all Nature papers
- DOAJ API (Directory of Open Access Journals)
- OpenAPC (fees paid for open access journal articles)
-
Information Extraction and NLP
- The MAPLE Benchmark for Scientific Literature Tagging
- Citation Parsing
- Sentences tagged for Drug Disease pairs
- ACL Anthology human summaries for 1000 papers
- Biomedical NLP annotated datasets
- Chemical compound and drug name recognition task
- Semantic Scholar Dataset
- ScienceIE
- ACL RD TEC 2.0 - 1661)
- SEPID Corpus - Segmended ACL ARC 1.0
- PubMed Central Open Access - BioC
- BioNLP - Argo
- Biomedical NLP - Stav
- GENIA - BioNLP 2011
- Genia Treebank used for SciSpacy training - [SciSpacy link](https://allenai.github.io/scispacy/)
- Full GENIA corpus
- Anatomical Entity Mention (AnEM) corpus
- CellFinder - Entity detection
- Multi-Level Event Extraction (MLEE)
- Biomedical sentence simplification
- Biomedical NER datasets - 017-1776-8)
- Lunar and Planetary Science abstracts for NER and Relations
- ACM data affiliations
- ACM - DBLP database entry matching
- CLEF datasets for multilingual Biomedical NLP+IE
- Colright Initiatve - Rich text competition
- SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts
- NER, Parsing, Classification datasets from SciBert
- ACA Wiki - Paper summaries of more than 1600 papers
- SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
- A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web
- PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track - [Train](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/train-set_1.1.zip) - [Dev](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/dev-set_1.1.zip) - [Test](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/test-set_1.1.zip) - [Background Test set](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/05/background-set.zip)
- Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction
- Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus
- The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation
- SeminalSurveyDBLP - Classification of seminal or survey papers
- MedTag: A Collection of Biomedical Annotations - [Download](ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/)
- Open Biomedical corpora
- Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext - Also see [other NLM curated biomedical resources](https://www.nlm.nih.gov/databases/download/data_distrib_main.html)
- SciDTB corpus annotated for argumentation mining - [Paper](https://www.aclweb.org/anthology/W19-4505.pdf)
- ART corpus - 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result"). - [Browse files](http://www.ukoln.ac.uk/projects/ART_Corpus/menu.html) - [Project details](http://www.ukoln.ac.uk/projects/ART_Corpus/index.html)
- Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence. - [Project details](http://www.sapientaproject.com/links)
- BioNER corpus
- A Large Parallel Corpus of Full-Text Scientific Articles
- Annotated Corpus of Scientific Conference's Homepages for Information Extraction
- Chi QA - Health Question Answering dataset from NLM
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- Open Research Knowledge Graph project - [Website](https://www.orkg.org/orkg/)
- A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
- A manual corpus of annotated main findings of clinical case reports
- TREC Precision Medicine / Clinical Decision Support Track
- Materials Science Named Entity Recognition: train/development/test sets
- Entities in 3.27 million materials science abstracts
- Normalized entities in material science papers
- Named Entity Recognition for Bacterial Type IV Secretion Systems - [Paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780#s5)
- Annotating and detecting phenotypic information for chronic obstructive pulmonary disease
- MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction
- Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics
- Multiple PUBMED annotated corpora from iProLink project
- Annotation of phenotypes using ontologies
- The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text - [SPECIES Direct Download](https://species.jensenlab.org/files/S800-1.0.tar.gz) - [ORGANISMS Direct Download](https://organisms.jensenlab.org/Downloads)
- Entity mention in articles used for benchmark
- RAMBO 800+: A Corpus for the Development of Gene/Protein Recognition from Rare and Ambiguous Abbreviations
- GOBRID Sequence Labeling data
- Citation Context Classification based on purpose
- Citation Context Classification based on influence
- PubMed knowledge graph (PKG)
- Gobrid-NER data
- Multiple NER and Entity Linking data for science
- EuropePMC annotations for entities and relationships
- NLPContributionGraph - Structuring Scholarly NLP Contributions in the Open Research Knowledge Graph
- GOBRID NER
- The General Index - Metadata, Ngrams, and Keyphrases in 107,233,728 journal articles
- Pubtrends Review Dataset
- PubMedCentral Author Manuscript Collection
- Paper analyzer pubmed
- SoMeSci - Software Mentions in Science
- NLMChem a new resource for chemical entity recognition in PubMed full-text literature
- PubMed Classification
- SoftwareKG_Social and SoftwareKG_PubMed - Software mentions in articles
- Bioinformatics Named Entity Recogniser for Databases and Software
- The CodeMeta Project: preservation, discovery, reuse, and attribution of software
- SoMeSci - A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
- SoftwareKG-PMC:a Knowledge Graph of Software mentions extracted from articles of the PMC Open Access Dataset
- DEAL: Detecting Entities in the Astrophysics Literature
- COMPUTER SCIENCE KNOWLEDGE GRAPH
- SCIERC: Multi-Task Identification of Entities, Relations, and Coreferencefor Scientific Knowledge Graph Construction - [Code](https://bitbucket.org/luanyi/scierc/src/master/)
- University of Washington BIO NLP datasets
- Microsoft Academic Knowledge Graph (MAKG) - [Zenodo](http://doi.org/10.5281/zenodo.3936556) [ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences](https://makg.org/entity-embeddings/)
- Wikidata:WikiProject Clinical Trials
- PubMed-OA-Extraction-dataset
- Chemical compound and drug name recognition task
- Medical Information Extraction dataset
- Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets
- Full GENIA corpus
- PubTator Central (PTC) - NLP annotated PMC datasets
-
Networks
- ACL Anthology Network
- I³ Open Innovation Dataset Index - Multiple datasets related to patent networks, inventor careers, etc.
-
Taxonomies and Ontologies of Research Concepts
- Medical Subject Headings
- Computer Science Ontology
- Physics Subject Headings (PhySH) - org/PhySH)
- Open Biological and Biomedical Ontology (OBO)
- ACM Computing Classification System
- Physics and Astronomy Classification Scheme (PACS)
- Mathematics Subject Classification (MSC) - database) and [zbMATH](https://zbmath.org)
- Journal of Economic Literature (JEL)
- Australian and New Zealand Standard Research Classification (ANZSRC) - classification schemes:
- Fields of Research (FoR)
- Research Fields, Courses and Disciplines (RFCD)
- Socio-Economic Objective (SEO)
- Library of Congress Classification (LCC)
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing - [dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LXQXAO)
- SciGraph Springer Nature
-
Affiliations
-
Altmetrics and Dimensions
- Altmetrics API
- Dimensions.ai API - [documentation](https://figshare.com/articles/Dimensions_Metrics_API_Documentation/5783694), [example](http://metrics-api.dimensions.ai/doi/10.7717/peerj-cs.119)
- Core Conference Rankings
- China Computer Federation Conference Rankings
-
User interface to publication datasets and analysis
- Google Scholar
- Semantic Scholar
- Microsoft Academic Graph
- GitXiv
- ACL Anthology
- NIPS papers
- Abel tools for PubMed data
- infolis: linking research data and publications
- Metrics toolkit
- Rscopus (R library)
- Scholar (R library)
- Bibliometrix (R library)
- CITAN (R library)
- BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
- CiteSeer tools
- Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar
- CiteSeerX
- Data Set Knowledge Graph (DSKG) - a RDF data set about data sets
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
- Citation Gecko - Find related papers
-
Tools for classifying research papers
-
Visualizations
-
Citation and metadata extraction
-
Publication and Publisher Info
- Interactive sheet for deciding publication strategy and open science - [Tweet](https://twitter.com/jeroenbosman/status/1492876367976968193)
-
Conferences
- Joint Conference on Digital Libraries (JCDL)
- European Semantic Web Conference (ESWC), Research of Research Track
- STI Conference series (Science and Technology indicators, e.g., 2018)
- ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)
- International Conference on Theory and Practice of Digital Libraries (TPDL)
-
Workshops
-
Summer Schools
-
Courses
-
Associations & Community
-
Research Groups
-
Blogs
-
Peer Review
- CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations - Contact Author
- Elsevier's Peer Review Workbench
- ACL-18 Numerical Peer Review Dataset
- Argument Mining for Understanding Peer Reviews
- Argument Mining Driven Analysis of Peer-Reviews Dataset
- Publons review length dataset with 498K reviews - anonymized
- NLPEER: A Unified Resource for the Computational Study of Peer Review
- eLife Open Peer Review Corpus
- PLoS Open Peer Review Corpus
- MDPI Open Peer Review Corpus
-
Grants and Funding
-
Academic Genealogy
- Mathematics Genealogy Project
- Academic Tree - Cross discipline academic genealogies
- MPACT project - Library Sciences
- PhDTree
- Chemistry Genealogy - curated at UIUC
- Notre Dame Genealogy Project
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- Software Engineering Academic Genealogy
- Other lists of genealogy projects
- Wikipedia - Computer Science Genealogy
- Wikipedia - Theorecical Physicits Genealogy
- Wikipedia - Chemists Genealogy
- SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
- Economic Geneology - personal.umich.edu/~alandear/tree/Tree-In.txt)
- A dataset of mentorship in science with semantic and demographic estimations - Used in The academic Great Gatsby Curve paper
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- PhDTree
- Notre Dame Genealogy Project
-
Author Profiles
- Temporal profiles of PubMed authors
- ORCID data dump
- National Library of Medicine Profiles
- UIUC Professors database - Publications, Affiliations
- Author Profiles of scholarly authors in Wikipedia
- Author name gender and ethnicity dataset based on PubMed
- MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide
- Conceptual novelty scores for PubMed articles
- 100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator
- Canadian PhD career survey - [Science report](https://www.sciencemag.org/careers/2018/03/trend-toward-transparency-phd-career-outcomes)
- Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US - [Used in this blog](https://socialsciences.nature.com/users/325112-diego-a-reinero/posts/55118-the-path-to-professorship-by-the-numbers-and-why-mentorship-matters)
- The 4 Universities Data Set - Web pages of CS departments classified for author role (faculty, student, etc.)
- Career long various citation metrics for 100,000 top-scientists
- Open dataset of scholars on Twitter - 500K OpenAlex Author ID to Twitter User Id
- Journal editors dataset
-
Author name disambiguation
- Lee Giles dataset
- Cleaner version of Lee Giles dataset
- DBLP Korean Authors
- Arnet Miner
- Arnet Miner - Manual Name Disambiguation data 210 authors
- Dedped author names on IEEE Vis papers 1990-2018
- Author-ity dataset for PubMed 2009
- ACL Anthology dataset
- Base data for estimating precision and recall of Author-ity among NIH-funded scientists
- ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale
- BibTex Dataset for 1M authors
- Pre-processed PubMed data for a study of coauthorship
- WhoIsWho: Web-Scale Academic Name Disambiguation:the WhoIsWho Benchmark,Leaderboard,and Toolkit - [https://www.aminer.cn/whoiswho](https://www.aminer.cn/whoiswho) - [WhoIsWho Toolkit GitHub](https://github.com/THUDM/WhoIsWho)
- LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation - [Github](https://github.com/carmanzhang/LAGOS-AND)
- Chain Dream : Name Disambiguation Task2
- Arnet Miner
-
Thesis datasets
- Open Access Theses and Dissertations
- The Networked Digital Library of Theses and Dissertations (NDLTD)
- PhD Dissertations in the Area of Software Engineering
- ProQuest Dissertations & Theses Global
- History Dissertation Analysis
- ETDs: Virginia Tech Electronic Theses and Dissertations
- DSpace@MIT: a digital repository for MIT's research, including peer-reviewed articles, technical reports, working papers, theses, and more
- The ScanBank Dataset: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations
- ETDMiner: extract metadata from scanned ETD
Categories
Publication and Citation
117
Information Extraction and NLP
96
Journals
45
User interface to publication datasets and analysis
27
Academic Genealogy
18
Taxonomies and Ontologies of Research Concepts
16
Author name disambiguation
16
Author Profiles
15
Peer Review
10
Thesis datasets
9
Blogs
8
Visualizations
5
Workshops
5
Conferences
5
Associations & Community
4
Altmetrics and Dimensions
4
Grants and Funding
4
Summer Schools
2
Citation and metadata extraction
2
Networks
2
Affiliations
1
Courses
1
Publication and Publisher Info
1
Tools for classifying research papers
1
Research Groups
1
Sub Categories