https://github.com/napsternxg/awesome-scholarly-data-analysis

A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
https://github.com/napsternxg/awesome-scholarly-data-analysis
List: awesome-scholarly-data-analysis
Last synced: 3 months ago
JSON representation
A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
Host: GitHub
URL: https://github.com/napsternxg/awesome-scholarly-data-analysis
Owner: napsternxg
Created: 2016-11-15T18:37:43.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2023-11-02T19:02:53.000Z (over 1 year ago)
Last Synced: 2024-05-22T03:00:34.331Z (about 1 year ago)
Homepage: https://shubhanshu.com/awesome-scholarly-data-analysis/
Size: 229 KB
Stars: 170
Watchers: 18
Forks: 26
Open Issues: 2
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

awesome-computational-social-science - Awesome Scholarly Data Analysis
ultimate-awesome - awesome-scholarly-data-analysis - A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources. . (Other Lists / Julia Lists)
README

        [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

[![License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://licensebuttons.net/l/zero/1.0/88x31.png)](https://creativecommons.org/publicdomain/zero/1.0/)

# Awesome Scholarly Data Analysis

List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.

Available online at https://shubhanshu.com/awesome-scholarly-data-analysis/

# Table of Contents

- [Awesome Scholarly Data Analysis](#awesome-scholarly-data-analysis)

- [Table of Contents](#table-of-contents)

- [Datasets](#datasets)

  * [Publication and Citation](#publication-and-citation)

  * [Peer Review](#peer-review)

  * [Grants and Funding](#grants-and-funding)

  * [Academic Genealogy](#academic-genealogy)

  * [Author Profiles](#author-profiles)

  * [Author name disambiguation](#author-name-disambiguation)

  * [Thesis datasets](#thesis-datasets)

  * [Information Extraction and NLP](#information-extraction-and-nlp)

  * [Networks](#networks)

  * [Taxonomies and Ontologies of Research Concepts](#taxonomies-and-ontologies-of-research-concepts)

  * [Affiliations](#affiliations)

  * [Altmetrics and Dimensions](#altmetrics-and-dimensions)

- [Tools](#tools)

  * [User interface to publication datasets and analysis](#user-interface-to-publication-datasets-and-analysis)

  * [Tools for collecting open access papers](#tools-for-collecting-open-access-papers)

  * [Tools for classifying research papers](#tools-for-classifying-research-papers)

  * [Visualizations](#visualizations)

  * [Language Processing and Information Extraction](#language-processing-and-information-extraction)

  * [Citation and metadata extraction](#citation-and-metadata-extraction)

  * [Publication and Publisher Info](#publication-and-publisher-info)

- [Community](#community)

  * [Journals](#journals)

  * [Conferences](#conferences)

  * [Workshops](#workshops)

  * [Summer Schools](#summer-schools)

  * [Courses](#courses)

  * [Associations & Community](#associations---community)

  * [Research Groups](#research-groups)

  * [Blogs](#blogs)

- [Contributions](#contributions)

Table of contents generated with markdown-toc

# Datasets

## Publication and Citation

* [Arnet Miner](http://aminer.org/citation)

* [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)

* [OpenAlex - Replacement for MAG](https://openalex.org/)

* [Open Academic Graph](https://www.openacademic.ai/oag/) - MAG + AMiner

* [OpenAIRE Research Graph](https://zenodo.org/record/4279381) - More info [here](https://graph.openaire.eu)

* [Semantic Scholar Corpus](https://www.semanticscholar.org/product/api)

* [CiteSeer](https://csxstatic.ist.psu.edu/downloads/data)

* [PubMed](https://www.ncbi.nlm.nih.gov/pubmed)

* [CORA datasets for citation string parsing](https://people.cs.umass.edu/~mccallum/data.html)

* [Humanities and multilingual citation string parsing Flux-CiM and ICONIP](https://github.com/knmnyn/ParsCit/tree/master/doc) see [Neural ParsCit paper](https://link.springer.com/article/10.1007/s00799-018-0242-1) for details

* [Citation string parsing data for social sciences for English and German citations](https://github.com/exciteproject/EXgoldstandard) - [comparison with Grobid and Cermine](https://github.com/exciteproject/Exparser/tree/master/Evaluation/Ours)

* [CrossRef DOI URLs](https://archive.org/details/doi-urls)

* [DOIboost (Crossres + MAG + ORCID + Unpaywall)](https://zenodo.org/record/3559699)

* [DBLP Citation dataset](https://kdl.cs.umass.edu/display/public/DBLP)

* [DBLP XML data](https://dblp.org/xml/release/)

* [DBLP Discovery Dataset (D3)](https://github.com/gipplab/d3-dataset)

* [NBER Patent Citations](http://nber.org/patents/)

* [Scopus Citation Database](https://www.elsevier.com/solutions/scopus)

* [Papers, patents, and grants from Indiana University](http://iv.slis.indiana.edu/db/index.html)

* [Small Network Data - Mark Newman's Lab](http://www-personal.umich.edu/~mejn/netdata/)

* [The Koblenz Network Collection](http://konect.uni-koblenz.de/)

* [Google Scholar citation relations](http://www3.cs.stonybrook.edu/~leman/data/gscholar.db)

* [Google Scholar Citations data set](http://homes.sice.indiana.edu/filiradi/resources.html) [direct-download](http://homes.sice.indiana.edu/filiradi/Data/gsc_data.tar.bz2)

* [Open citations project](http://opencitations.net/)

* [Wikicite Project](https://meta.wikimedia.org/wiki/WikiCite)

* [Ecnonomic Papers](http://repec.org/)

* [ArXiv data dump](https://arxiv.org/help/bulk_data)

* [ArXiv data on Kaggle](https://www.kaggle.com/Cornell-University/arxiv)

* [EuropePMC](http://europepmc.org/)

* [Complete ACL anthology as bibtex file](http://aclanthology.info/anthology.bib)

* [ACL Anthology Reference Corpus](http://acl-arc.comp.nus.edu.sg/)

* [Astrophysics data system (ADS) - All physics papers](https://ui.adsabs.harvard.edu/)

* [CORE 37M full text open access papers](https://core.ac.uk/services#dataset)

* [Inspire database for high energy physics articles](http://inspirehep.net/?ln=en)

* [Scholarly Data of workshops and conferences in RDF triplets](http://www.scholarlydata.org/)

* [The Collection of Computer Science Bibliographies](http://liinwww.ira.uka.de/bibliography/)

* [OpenCitations corpus](http://opencitations.net/corpus)

* [COCI Doi-Doi citation data](https://figshare.com/articles/Crossref_Open_Citation_Index_CSV_dataset_of_all_the_citation_data/6741422/2)

* [DOAJ API (Directory of Open Access Journals)](https://doaj.org/api/v1/docs)

* [ROAD (Directory of Open Access Scholarly Resources)](https://road.issn.org/)

* [Sherpa/Romeo (Publisher copyright policies & self-archiving)](http://www.sherpa.ac.uk/romeo/index.php)

* [OpenAPC (fees paid for open access journal articles)](https://www.intact-project.org/openapc/)

* [OSF API (Open Science Framework)](https://developer.osf.io/)

* [Digital tools for researchers](http://connectedresearchers.com)

* [Fatcat - versioned, publicly-editable catalog of research publications](https://fatcat.wiki/)

* [Microsoft Academic Knowledge Graph - RDF dump](http://ma-graph.org/)

* [arXiv CS citation in context](http://citation-recommendation.org/publications/#A_High-Quality_Gold_Standard_for_Citation-based_Tasks)

* [arXiv fulltext + citations dataset](https://zenodo.org/record/2609187#.XKe86JhKh3g)

* [Self-citation analysis data based on PubMed Central subset (2002-2005)](https://doi.org/10.13012/B2IDB-9665377_V1)

* [Unpaywalled Corpus - PDF to 23M DOIs](https://unpaywall.org/products/snapshot) [Data Schema](https://unpaywall.org/data-format)

* [A dataset of publication records for Nobel laureates](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6NJ5RN) - [paper](https://www.nature.com/articles/s41597-019-0033-6#Abs1)

* [OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects](https://zenodo.org/record/2674330#.XOU6gshKh3g) - [About the data](http://scholexplorer.openaire.eu/index.html#/about)

* [Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background](http://jurgens.people.si.umich.edu/citation-function/)

* [iCite - NIH Open Citation Collection](https://nih.figshare.com/collections/iCite_Database_Snapshots_NIH_Open_Citation_Collection_/4586573)

* [MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML](https://mbr.nlm.nih.gov)

* [American Physical Society Data Sets for Research](https://journals.aps.org/datasets)

* [Co-citation networks of all Nature papers](https://www.nature.com/immersive/d41586-019-03165-4/index.html)

* [Semantic Scholar Graph of References in Context (GORC) dataset](https://github.com/allenai/s2-gorc/)

* [Multiple journal publication datasets](https://github.com/dmsquare/tube)

* [Structured citations in the English Wikipedia](https://zenodo.org/record/55004#.XgLol7dOmdM)

* [ICSR Lab (free for researchers) for scopus and plumx use](https://www.elsevier.com/icsr/icsrlab/features)

* [COVID-19 Open Research Dataset (CORD-19)](https://pages.semanticscholar.org/coronavirus-research)

* [PaperRobot - includes PubMed Paper Reading Dataset](https://github.com/EagleW/PaperRobot)

* [SciMag - Microsoft Academic Linked to SciMago Journals](https://github.com/scimag/sciMAG2015) - [WebPage](https://scimag.github.io/sciMAG2015/)

* [SciGraph Springer Nature](https://scigraph.springernature.com/explorer/downloads/)

* [Citations to scholarly data in various language wikipedias](https://figshare.com/articles/Citations_with_identifiers_in_Wikipedia/1299540) [Code](https://github.com/mediawiki-utilities/python-mwcites)

* [800K publications matched from CrossRef, CORE, and Mendeley with data on publication and open access dates](https://zenodo.org/record/2605409#.Xr2jZxNKjRY)

* [Coronavirus Open Citations Dataset](https://opencitations.github.io/coronavirus/)

* [Crossref dumps](https://archive.org/search.php?query=creator%3A%22Crossref%22) [DOI meta-data](https://github.com/greenelab/crossref)

* [S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers](https://github.com/allenai/s2orc/)

* [Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia](https://github.com/Harshdeep1996/cite-classifications-wiki)

* [Microsoft Academic Data for conducting covid-19 research](https://github.com/microsoft/mag-covid19-research-examples)

* [Initiative for Open Abstracts](https://i4oa.org/#openabstracts)

* [Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers](https://www.kaggle.com/googleai/dataset-search-metadata-for-datasets)

* [Open Syllabus Project](https://opensyllabus.org/)

* [Journal Causal effect in Citations](https://github.com/vtraag/journal-causal-effect-replication)

* [Sci-Hub Download Logs](https://zenodo.org/record/1158301#.YgiebbrMJ3g) - [Latest](https://sci-hub.se/stats)

* [Sci-Hub databases](https://sci-hub.se/database)

* [SAGE Rejected article tracker dataset from ArXiv](https://zenodo.org/record/5122848) - [Github](https://github.com/sagepublishing/rejected_article_tracker_pkg)

* [The Open Research Knowledge Graph (ORKG)](https://www.orkg.org/orkg/)

* [ACADEMIA INDUSTRY DYNAMICS](https://aida.kmi.open.ac.uk/)

* [Test of Time Awards](https://github.com/LCS2-IIITD/influence-dispersion)

* [ACL-Cite-Net](https://github.com/iamjanvijay/acl-cite-net)

* [The DBLP Discovery Dataset (D3): A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research](https://github.com/jpwahle/lrec22-d3-dataset) [Zenodo](https://zenodo.org/record/7069915)

* [Papers and patents are becoming less disruptive over time](https://zenodo.org/record/7258379) - [Paper](https://www.nature.com/articles/s41586-022-05543-x)

* [OpenAIRE Research Graph Dump](https://zenodo.org/record/7488618)

* [OpCitance: Citation contexts identified from the PubMed Central open access articles](https://databank.illinois.edu/datasets/IDB-7312599)

* [A large dataset of scientific text reuse in Open-Access publications](https://github.com/webis-de/scidata22-stereo-scientific-text-reuse)

* [A dataset of publication records for Nobel laureates](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6NJ5RN)

* [Rich Context Text Analysis Competition](https://github.com/Coleridge-Initiative/rich-context-competition)

## Peer Review

* [PeerRead - paper drafts, reviews, and accept/reject decision](https://github.com/allenai/PeerRead)

* [CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations - Contact Author](http://ceur-ws.org/Vol-2414/paper12.pdf)

* [Elsevier's Peer Review Workbench](https://lab.icsr.net/icsr_lab/workbenches.html)

* [ACL-18 Numerical Peer Review Dataset](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2639?locale-attribute=en_US)

* [Argument Mining for Understanding Peer Reviews](https://xinyuhua.github.io/Resources/naacl19/)

* [APE: Argument Pair Extraction - Annotated ICLR 2013-2020 review-rebuttal argument pair](https://github.com/LiyingCheng95/ArgumentPairExtraction)

* [Argument Mining Driven Analysis of Peer-Reviews Dataset](https://doi.org/10.5281/zenodo.4314390)

* [Publons review length dataset with 498K reviews - anonymized](https://clarivate.com/blog/its-not-the-size-that-matters/)

* [Peer review analyze: A novel benchmark resource for computational analysis of peer reviews](https://github.com/Tirthankar-Ghosal/Peer-Review-Analyze-1.0)

* [Open Editors: data about scholarly journals' editors and editorial board members](https://openeditors.ooir.org/) - [Github](https://github.com/andreaspacher/openeditors)

* [NLPEER: A Unified Resource for the Computational Study of Peer Review](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/3618)

* [eLife Open Peer Review Corpus](https://doi.org/10.18150/FKPEQN)

* [PLoS Open Peer Review Corpus](https://doi.org/10.18150/KZHVGE)

* [MDPI Open Peer Review Corpus](https://doi.org/10.18150/D5L2EK)

## Grants and Funding

* [GrantExplorer: a free, open-source tool for examining the phrases funded by U.S. federal agencies](https://www.grantexplorer.org/?about=1&org=dod)

* [USASpending.gov: Award Data Archive](https://www.usaspending.gov/download_center/award_data_archive)

* [NIH research funding](https://report.nih.gov/databases)

* [Authors linked to PIs in NIH Grants](http://abel.lis.illinois.edu/cgi-bin/authorlink/search.pl)

## Academic Genealogy

* [Mathematics Genealogy Project](https://genealogy.math.ndsu.nodak.edu/index.php)

* [Academic Tree - Cross discipline academic genealogies](http://academictree.org)

* [MPACT project - Library Sciences](http://www.ibiblio.org/mpact/)

* [PhDTree](http://phdtree.org/)

* [Chemistry Genealogy - curated at UIUC](http://www.scs.illinois.edu/~mainzv/Web_Genealogy/index.php)

* [Notre Dame Genealogy Project](http://library.nd.edu/physics/resources/genealogy/genealogyproject.shtml)

* [UIUC Chemistry, Chemical Engineering, and Biochemistry](http://www.scs.illinois.edu/alumnilist/)

* [Software Engineering Academic Genealogy](http://web.engr.illinois.edu/~taoxie/sefamily.htm)

* [Other lists of genealogy projects](http://libguides.caltech.edu/c.php?g=512661&p=3502496)

* [Wikipedia - Computer Science Genealogy](https://en.wikipedia.org/wiki/Academic_genealogy_of_computer_scientists)

* [Wikipedia - Theorecical Physicits Genealogy](https://en.wikipedia.org/wiki/Academic_genealogy_of_theoretical_physicists)

* [Wikipedia - Chemists Genealogy](https://en.wikipedia.org/wiki/Academic_genealogy_of_chemists)

* [SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics](http://www.careerchem.com/NAMED/Genealogy-List.pdf)

* [Economic Geneology](http://www-personal.umich.edu/~alandear/tree/INDEX.HTM) [Text Format](http://www-personal.umich.edu/~alandear/tree/Tree-In.txt)

* [S2AMP : Semantic Scholar Analysis of Mentorship Dataset](https://github.com/allenai/S2AMP-data)

* [MENTORSHIP - A dataset of mentorship in science with semantic and demographic estimations](https://doi.org/10.5281/zenodo.4917086) - [Code](https://github.com/sciosci/AFT-MAG)

* [A dataset of mentorship in science with semantic and demographic estimations - Used in The academic Great Gatsby Curve paper](https://zenodo.org/records/4917086)

## Author Profiles

* [Temporal profiles of PubMed authors](http://abel.lis.illinois.edu/legolas/)

* [ORCID data dump](https://orcid.org/content/download-file)

* [National Library of Medicine Profiles](https://profiles.nlm.nih.gov/)

* [UIUC Professors database - Publications, Affiliations](https://experts.illinois.edu/)

* [Author Profiles of scholarly authors in Wikipedia](https://scholia.toolforge.org/)

* [Career Transitions of CS students](https://github.com/tsafavi/career-transitions-data)

* [Author name gender and ethnicity dataset based on PubMed](https://doi.org/10.13012/B2IDB-9087546_V1)

* [MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide](https://doi.org/10.13012/B2IDB-4354331_V1)

* [Conceptual novelty scores for PubMed articles](https://doi.org/10.13012/B2IDB-5060298_V1)

* [100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator](https://data.mendeley.com/datasets/btchxktzyw/1)

* [Canadian PhD career survey](https://www.sgs.utoronto.ca/about/10000-phds-project-overview/10kphds-dashboard/) - [Science report](https://www.sciencemag.org/careers/2018/03/trend-toward-transparency-phd-career-outcomes)

* [Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US](https://osf.io/z8rhg/) - [Used in this blog](https://socialsciences.nature.com/users/325112-diego-a-reinero/posts/55118-the-path-to-professorship-by-the-numbers-and-why-mentorship-matters)

* [Wikidata Author Disambiguation Dataset](https://github.com/arthurpsmith/author-disambiguator)

* [The 4 Universities Data Set - Web pages of CS departments classified for author role (faculty, student, etc.)](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/)

* [Journal editors dataset](https://openeditors.ooir.org/)

* [Career long various citation metrics for 100,000 top-scientists](https://data.mendeley.com/datasets/btchxktzyw/2)

* [Network-Data-Career-Transitions - two anonymized network datasets of post-PhD career transitions and trajectories in computing research](https://github.com/GemsLab/Network-Data-Career-Transitions)

* [Open dataset of scholars on Twitter - 500K OpenAlex Author ID to Twitter User Id](https://zenodo.org/record/7013518#.ZAsZei9Xac4)

* [Gender Inequities in the Online Dissemination of Scholars’ Work](https://github.com/LINK-NU/PNAS-Online-Dissemination-Gender)

## Author name disambiguation

* [INSPIRE dataset](https://github.com/glouppe/paper-author-disambiguation)

* [Lee Giles dataset](http://clgiles.ist.psu.edu/data/nameset_author-disamb.tar.zip)

* [Cleaner version of Lee Giles dataset](https://figshare.com/articles/DBLPderived_labeled_data_for_author_name_disambiguation/6840281)

* [DBLP Korean Authors](http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/DBLP.tar.gz/at_download/file)

* [Arnet Miner](http://arnetminer.org/lab-datasets/disambiguation/rich-author-disambiguation-data.zip)

* [Arnet Miner - Manual Name Disambiguation data 210 authors](https://aminer.org/na-data)

* [DBLP Name disambiguation dataset](https://github.com/yaya213/DBLP-Name-Disambiguation-Dataset) - [Error corrected version](https://figshare.com/articles/dataset/DBLP-derived_labeled_data_for_author_name_disambiguation/6840281)

* [rexa-coref-data](https://github.com/tapilab/rexa-coref-data)

* [Dedped author names on IEEE Vis papers 1990-2018](https://sites.google.com/site/vispubdata/home)

* [Author-ity dataset for PubMed 2009](https://doi.org/10.13012/B2IDB-4370459_V1)

* [ACL Anthology dataset](https://github.com/acl-org/acl-anthology/blob/master/data/yaml/name_variants.yaml)

* [Base data for estimating precision and recall of Author-ity among NIH-funded scientists](https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1)

* [ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale](https://figshare.com/articles/dataset/ORCID-Linked_Labeled_Data_for_Evaluating_Author_Name_Disambiguation_at_Scale/13404986)

* [S2AND - Semantic Scholar Author Name Disambiguation Tool and Dataset](https://github.com/allenai/S2AND)

* [BibTex Dataset for 1M authors](http://www.iesl.cs.umass.edu/data/data-bibtex)

* [Ethnicity sensitive author disambiguation from INSPIRE HEP](https://github.com/glouppe/paper-author-disambiguation)

* [Pre-processed PubMed data for a study of coauthorship](https://zenodo.org/record/345934)

* [WhoIsWho: Web-Scale Academic Name Disambiguation:the WhoIsWho Benchmark,Leaderboard,and Toolkit](http://whoiswho.biendata.xyz/) - [https://www.aminer.cn/whoiswho](https://www.aminer.cn/whoiswho) - [WhoIsWho Toolkit GitHub](https://github.com/THUDM/WhoIsWho)

* [LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation](https://zenodo.org/record/7313380) - [Github](https://github.com/carmanzhang/LAGOS-AND)

* [Chain Dream : Name Disambiguation Task2](https://www.biendata.xyz/competition/chaindream_nd_task2/)

## Thesis datasets

* [Open Access Theses and Dissertations](https://oatd.org/)

* [The Networked Digital Library of Theses and Dissertations (NDLTD)](http://www.ndltd.org/)

* [PhD Dissertations in the Area of Software Engineering](http://www.sigsoft.org/dissertations.php)

* [ProQuest Dissertations & Theses Global](http://www.proquest.com/products-services/pqdtglobal.html)

* [History Dissertation Analysis](https://osf.io/v4ysh/)

* [Peer-making: the interconnections between PhD Thesis Committee membership and co-publishing](https://github.com/Marion-Mai/peer-making) - [Zenodo](https://doi.org/10.5281/zenodo.4966081)

* [DISAPERE: A Dataset for DIscourse Structure in Academic PEer REview](https://github.com/nnkennard/DISAPERE)

* [ETDs: Virginia Tech Electronic Theses and Dissertations](https://vtechworks.lib.vt.edu/handle/10919/5534)

* [DSpace@MIT: a digital repository for MIT's research, including peer-reviewed articles, technical reports, working papers, theses, and more](https://dspace.mit.edu/)

* [The ScanBank Dataset: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations](https://zenodo.org/record/4663578)

* [ETDMiner: extract metadata from scanned ETD](https://github.com/lamps-lab/ETDMiner/tree/master) [Google Drive](https://drive.google.com/drive/folders/1y6cADt2JJvNA10wnmlGBeMBJJrrBo6RV)

## Information Extraction and NLP

* [Citation Parsing](http://csxstatic.ist.psu.edu/about/scholarly-information-extraction)

* [Citation Parsing in humanities](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing)

* [Sentences tagged for Drug Disease pairs](https://github.com/roamanalytics/roamresearch/tree/master/BlogPosts/Features_for_healthcare)

* [Document Summarization and citation span identification](https://github.com/WING-NUS/scisumm-corpus)

* [ACL Anthology human summaries for 1000 papers](https://michiyasunaga.github.io/projects/scisumm_net/)

* [Keyphrase Extraction](https://github.com/snkim/AutomaticKeyphraseExtraction)

* [Related Work Summarization](https://github.com/WING-NUS/RelatedWorkSummarizationDataset)

* [Biomedical NLP annotated datasets](https://www.ncbi.nlm.nih.gov/research/bionlp/Data/)

* [Chemical compound and drug name recognition task](http://www.biocreative.org/tasks/biocreative-iv/chemdner/)

* [Semantic Scholar Dataset](https://allenai.org/data/data-all.html)

* [ScienceIE](https://scienceie.github.io/)

* [ACL RD TEC 2.0](http://pars.ie/lr/acl_rd-tec) also at [@CLARIN](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1661)

* [SEPID Corpus - Segmended ACL ARC 1.0](http://pars.ie/lr/sepid-corpus)

* [PubMed Central Open Access - BioC](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/)

* [PubMed Fulltext - protein-protein and genetic interactions](http://bioc.sourceforge.net/BioC-BioGRID.html)

* [BioNLP - Argo](http://argo.nactem.ac.uk/bioc/)

* [Biomedical NLP - Stav](http://corpora.informatik.hu-berlin.de/)

* [GENIA - BioNLP 2011](http://2011.bionlp-st.org/home)

* [Genia Treebank used for SciSpacy training](https://nlp.stanford.edu/~mcclosky/biomedical.html) - [SciSpacy link](https://allenai.github.io/scispacy/)

* [Full GENIA corpus](http://www.geniaproject.org/genia-corpus/term-corpus)

* [Anatomical Entity Mention (AnEM) corpus](http://www.nactem.ac.uk/anatomy/)

* [CellFinder - Entity detection](https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/resources/cellfinder)

* [Multi-Level Event Extraction (MLEE)](http://nactem.ac.uk/MLEE/)

* [Biomedical sentence simplification](https://research.bioinformatics.udel.edu/isimp/corpus.html)

* [PubMed - Colorado Richly Annotated Full-Text](https://github.com/UCDenver-ccp/CRAFT)

* [Biomedical NER datasets](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data) [related publication](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1776-8)

* [BioVerbNet](https://github.com/cambridgeltl/bio-verbnet)

* [Lunar and Planetary Science abstracts for NER and Relations](https://zenodo.org/record/1048419#.XAW0m2hKh3h)

* [ACM data affiliations](https://dbs.uni-leipzig.de/en/research/projects/bibliometrics)

* [ACM - DBLP database entry matching](https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution)

* [Colorado Richly Annotated Full-Text](https://github.com/UCDenver-ccp/CRAFT) - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms. 

* [CLEF datasets for multilingual Biomedical NLP+IE](https://sites.google.com/site/clefehealth/home)

* [MedMentions - UMLS entities in PubMed](https://github.com/chanzuckerberg/MedMentions)

* [Colright Initiatve - Rich text competition](https://coleridgeinitiative.org/richcontextcompetition#phase1)

* [SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts](http://nlp.cs.washington.edu/sciIE/)

* [PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions](https://github.com/Franck-Dernoncourt/pubmed-rct)

* [NER, Parsing, Classification datasets from SciBert](https://github.com/allenai/scibert/tree/master/data)

* [ACA Wiki - Paper summaries of more than 1600 papers](https://acawiki.org/Home)

* [SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers](https://competitions.codalab.org/competitions/17422#learn_the_details-subtasks)

* [A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web](http://arrowsmith.psych.uic.edu/arrowsmith_uic/tools.html)

* [Medical Information Extraction from PubMed abstracts](https://www.figure-eight.com/dataset/medical-sentence-summary-and-relation-extraction/)

* [Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets](http://sempub.taln.upf.edu/dricorpus)

* [PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track](http://temu.bsc.es/pharmaconer/index.php/data/) - [Train](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/train-set_1.1.zip) - [Dev](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/dev-set_1.1.zip) - [Test](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/test-set_1.1.zip) - [Background Test set](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/05/background-set.zip)

* [Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction](https://sites.google.com/view/bb-2019/dataset?authuser=0)

* [Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus](https://sites.google.com/view/bionlp-ost19-agac-track/description?authuser=0)

* [The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation](https://sites.google.com/view/seedev2019/home?authuser=0)

* [TalkSumm - Summary of papers via alignment to talks](https://github.com/levguy/talksumm)

* [SeminalSurveyDBLP - Classification of seminal or survey papers](https://zenodo.org/record/3258164#.XWac_-hKh3g)

* [Supp.ai - PubMed supplement-drug interactions and supplement-supplement interactions](https://github.com/lucylw/supp-ai-extracted-sdi-data/)

* [GENETAG](https://github.com/openbiocorpora/genetag) - More recent versions [Publication](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-S1-S3) and [Download 2005](ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz)

* [MedTag: A Collection of Biomedical Annotations](https://www.aclweb.org/anthology/W05-1305) - [Download](ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/)

* [Open Biomedical corpora](https://github.com/openbiocorpora)

* [Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext](https://web.archive.org/web/20170501120500/http://amr.isi.edu/download.html) - Also see [other NLM curated biomedical resources](https://www.nlm.nih.gov/databases/download/data_distrib_main.html)

* [SciDTB: Discourse Dependency TreeBank for Scientific Abstracts](https://github.com/PKU-TANGENT/SciDTB)

* [SciDTB corpus annotated for argumentation mining](http://scientmin.taln.upf.edu/argmin/) - [Paper](https://www.aclweb.org/anthology/W19-4505.pdf)

* [Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets](http://sempub.taln.upf.edu/dricorpus)

* [ART corpus - 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result").](https://www.aber.ac.uk/en/media/departmental/computerscience/cb/art/gz/ART_Corpus.tar.gz)- [Browse files](http://www.ukoln.ac.uk/projects/ART_Corpus/menu.html) - [Project details](http://www.ukoln.ac.uk/projects/ART_Corpus/index.html)

* [Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence.](http://www.sapientaproject.com/wp-content/uploads/2016/05/consensus_annotated.zip) - [Project details](http://www.sapientaproject.com/links)

* [PubMedQA - Question answering on PubMed](https://github.com/pubmedqa/pubmedqa)

* [Corposaurus - Collection of biomedical corpus for NER](https://corposaurus.github.io/corpora/)

* [BioNER corpus](https://github.com/xhuang28/NewBioNer/tree/master/corpus)

* [NeuroQuery - 14,000 full-text publications and 400,000 peak activations](https://github.com/neuroquery/neuroquery_data) - [NeuroQuery website](https://neuroquery.org/about)

* [Medical Information Extraction dataset](https://www.figure-eight.com/dataset/medical-sentence-summary-and-relation-extraction/)

* [A Large Parallel Corpus of Full-Text Scientific Articles](https://figshare.com/s/091fcaf8ad66a3304e90)

* [Annotated Corpus of Scientific Conference's Homepages for Information Extraction](https://archive.org/details/conferences-data-0.2)

* [Chi QA - Health Question Answering dataset from NLM](https://chiqa.nlm.nih.gov/)

* [Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine - Includes wikification data](https://github.com/elsevierlabs/OA-STM-Corpus)

* [Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources](https://gitlab.com/TIBHannover/orkg/orkg-nlp/tree/master/STEM-ECR-v1.0)

* [Open Research Knowledge Graph project](https://gitlab.com/TIBHannover/orkg) - [Website](https://www.orkg.org/orkg/)

* [Academic PhraseBank](http://www.phrasebank.manchester.ac.uk/)

* [SciKG - Statement extraction datasets](https://github.com/dmsquare/SciKG) 

* [A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology](https://www.aclweb.org/anthology/C12-2103/)

* [A manual corpus of annotated main findings of clinical case reports](https://academic.oup.com/database/article/doi/10.1093/database/bay143/5290151#supplementary-data)

* [TREC Precision Medicine / Clinical Decision Support Track](http://www.trec-cds.org/2019.html)

* [Lots of biomedical entity linking and entity identification datasets](https://github.com/izuna385/datasets)

* [Materials Science Named Entity Recognition: train/development/test sets](https://doi.org/10.6084/m9.figshare.8184428.v1)

* [Entities in 3.27 million materials science abstracts](https://figshare.com/articles/Entities_database/8184413)

* [Normalized entities in material science papers](https://figshare.com/articles/Entity_Normalization/8184365)

* [Named Entity Recognition for Bacterial Type IV Secretion Systems](https://doi.org/10.1371/journal.pone.0014780.s002) - [Paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780#s5)

* [Annotating and detecting phenotypic information for chronic obstructive pulmonary disease](https://datadryad.org/stash/dataset/doi:10.5061/dryad.g35948t)

* [MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction](https://doi.org/10.5281/zenodo.3234811)

* [Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics](https://doi.org/10.6084/m9.figshare.7633343.v1)

* [Multiple PUBMED annotated corpora from iProLink project](https://research.bioinformatics.udel.edu/iprolink/corpora.php)

* [Mars Target Encyclopedia - LPSC abstracts labeled data set](https://doi.org/10.5281/zenodo.1048418)

* [Annotation of phenotypes using ontologies](https://doi.org/10.5281/zenodo.1246697)

* [The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390) - [SPECIES Direct Download](https://species.jensenlab.org/files/S800-1.0.tar.gz) - [ORGANISMS Direct Download](https://organisms.jensenlab.org/Downloads)

* [Entity mention in articles used for benchmark](https://figshare.com/articles/Entity_mention_in_articles_used_for_benchmark/5620417)

* [RAMBO 800+: A Corpus for the Development of Gene/Protein Recognition from Rare and Ambiguous Abbreviations](https://doi.org/10.4119/unibi/2673424)

* [Medical Relation Extraction - CrowdTruth](https://github.com/CrowdTruth/Medical-Relation-Extraction)

* [KP20k - Kehphrase extraction on 20k abstracts](https://github.com/memray/seq2seq-keyphrase)

* [Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition](https://github.com/dmis-lab/biobert#datasets)

* [Relation Extraction: (2.5 MB), 2 datasets on biomedical relation extraction](https://github.com/dmis-lab/biobert#datasets)

* [Question Answering: (5.23 MB), 3 datasets on biomedical question answering task](https://github.com/dmis-lab/biobert#datasets)

* [SciREX : A Challenge Dataset for Document-Level Information Extraction](https://github.com/allenai/SciREX)

* [Papers with Code - Links between papers and repositories and extraction of SOTA results](https://github.com/paperswithcode/paperswithcode-data)

* [Citation Context Classification based on purpose](https://www.kaggle.com/c/3c-shared-task-purpose/)

* [Citation Context Classification based on influence](https://www.kaggle.com/c/3c-shared-task-influence/)

* [PubMed knowledge graph (PKG)](http://er.tacc.utexas.edu/datasets/ped) [Figshare](https://figshare.com/s/6327a55355fc2c99f3a2)

* [Citation and Header Datasets](https://csxstatic.ist.psu.edu/downloads/data)

* [Gobrid-NER data](https://github.com/kermitt2/grobid-ner/tree/master/grobid-ner/resources/dataset)

* [Multiple NER and Entity Linking data for science](https://github.com/kermitt2/entity-fishing/tree/master/data)

* [Scitation Context Classification](https://github.com/allenai/scicite)

* [S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers](https://github.com/allenai/s2orc/)

* [EuropePMC annotations for entities and relationships](http://europepmc.org/AnnotationsApi)

* [NLPContributionGraph - Structuring Scholarly NLP Contributions in the Open Research Knowledge Graph](https://ncg-task.github.io/)

* [GOBRID NER](https://github.com/kermitt2/grobid-ner/tree/master/resources/dataset)

* [GOBRID Sequence Labeling data](https://github.com/kermitt2/delft/tree/master/data/sequenceLabelling/grobid)

* [The General Index - Metadata, Ngrams, and Keyphrases in 107,233,728 journal articles](https://archive.org/details/GeneralIndex)

* [Pubtrends Review Dataset](https://github.com/JetBrains-Research/pubtrends-review/tree/master/review)

* [PubTator Central (PTC) - NLP annotated PMC datasets](https://www.ncbi.nlm.nih.gov/research/pubtator/)

* [PubMedCentral Author Manuscript Collection](https://ftp.ncbi.nlm.nih.gov/pub/pmc/manuscript/)

* [Paper analyzer pubmed](https://research.jetbrains.org/groups/paper_analyzer/projects/)

* [NER on Material Science Papers](https://github.com/olivettigroup/annotated-materials-syntheses)

* [SoMeSci - Software Mentions in Science](https://zenodo.org/record/4968738)

* [NLMChem a new resource for chemical entity recognition in PubMed full-text literature](https://zenodo.org/record/4628233#.Yd_YRL3MJ3g)

* [Scientific summarization datasets](https://github.com/Santosh-Gupta/ScientificSummarizationDataSets)

* [PubMed Classification](https://figshare.com/articles/dataset/PubMed_classification_v1_202102/16601402)

* [Annotated scientific findings with sentence-level and aspect-level certainty](https://github.com/Jiaxin-Pei/Certainty-in-Science-Communication)

* [SoftwareKG_Social and SoftwareKG_PubMed - Software mentions in articles](https://data.gesis.org/softwarekg/site/)

* [Bioinformatics Named Entity Recogniser for Databases and Software](https://sourceforge.net/projects/bionerds/files/goldstandard/)

* [The CodeMeta Project: preservation, discovery, reuse, and attribution of software](https://codemeta.github.io/)

* [Social Science Software Citation Dataset](https://github.com/f-krueger/SoSciSoCi)

* [SoMeSci - A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles](https://data.gesis.org/somesci/)

* [Softcite dataset: A gold-standard dataset of software mentions in research publications for supervised learning based named entity recognition](https://github.com/howisonlab/softcite-dataset)

* [SoftwareKG-PMC:a Knowledge Graph of Software mentions extracted from articles of the PMC Open Access Dataset](https://zenodo.org/record/5780121#.Yh3QO-jMJ3h)

* [DEAL: Detecting Entities in the Astrophysics Literature](https://ui.adsabs.harvard.edu/WIESP/2022/SharedTasks)

* [COMPUTER SCIENCE KNOWLEDGE GRAPH](https://scholkg.kmi.open.ac.uk/)

* [SCIERC: Multi-Task Identification of Entities, Relations, and Coreferencefor Scientific Knowledge Graph Construction](https://nlp.cs.washington.edu/sciIE/) - [Code](https://bitbucket.org/luanyi/scierc/src/master/) 

* [University of Washington BIO NLP datasets](http://depts.washington.edu/bionlp/index.html?corpora)

* [multimodal_summ: Multimodal summarization of research papers](https://github.com/LCS2-IIITD/multimodal_summ)

* [ACL Anthology Corpus - Full Text](https://github.com/shauryr/ACL-anthology-corpus)

* [Entity Linking of Crossref Funding Orgs in Acknowledgements](https://github.com/SEYED7037/EDFund_sample_dataset) - [paper](https://arxiv.org/abs/2209.00351)

* [Microsoft Academic Knowledge Graph (MAKG)](https://makg.org/) - [Zenodo](http://doi.org/10.5281/zenodo.3936556) [ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences](https://makg.org/entity-embeddings/)

* [Wikidata:WikiProject Clinical Trials](https://www.wikidata.org/wiki/Wikidata:WikiProject_Clinical_Trials)

* [A Dataset of Alt Texts from HCI Publications](https://github.com/allenai/hci-alt-texts)

* [PubMed-OA-Extraction-dataset](https://zenodo.org/record/6330817)

* [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://github.com/allenai/scirepeval)

* [The MAPLE Benchmark for Scientific Literature Tagging](https://zenodo.org/record/7611544)

## Networks

* [ACL Anthology Network](http://clair.eecs.umich.edu/aan/index.php)

* [I³ Open Innovation Dataset Index](https://iiindex.org/) - Multiple datasets related to patent networks, inventor careers, etc. 

* [Science4cast Competition](https://github.com/iarai/science4cast) - capture the evolution of scientific concepts and predict which research topics will emerge in the coming years

## Taxonomies and Ontologies of Research Concepts

* [SciGraph Springer Nature](https://scigraph.springernature.com/explorer/downloads/)

* [Medical Subject Headings](https://meshb.nlm.nih.gov/search) maintained by the [National Library of Medicine of the United States](https://www.nlm.nih.gov)

* [Computer Science Ontology](https://cso.kmi.open.ac.uk/home) maintained by [Scholarly Knowledge: Modeling, Mining and Sense Making](http://skm.kmi.open.ac.uk)

* [Physics Subject Headings (PhySH)](https://physh.aps.org/) maintained by [American Physical Society (APS)]() [GitHub](https://github.com/physh-org/PhySH)

* [Open Biological and Biomedical Ontology (OBO)](http://obofoundry.org/) maintained by the [OBO Foundry](http://obofoundry.org)

* [ACM Computing Classification System](https://www.acm.org/publications/class-2012) maintained by the [Association for Computing Machinery](https://www.acm.org)

* [Physics and Astronomy Classification Scheme (PACS)](https://web.archive.org/web/20131122200802/http://www.aip.org/pacs/pacs2010/about.html) maintained by [American Institute of Physics (AIP)]() *discontinued* in 2010 and replaced by [Physics Subject Headings](https://physh.aps.org/)

* [Mathematics Subject Classification (MSC)](https://mathscinet.ams.org/msc/msc2010.html) mantained by [Mathematical Reviews](http://www.ams.org/mr-database) and [zbMATH](https://zbmath.org)

* [Journal of Economic Literature (JEL)](https://www.aeaweb.org/econlit/jelCodes.php) maintained by the [American Economic Association](https://www.aeaweb.org)

* [STW Thesaurus for Economics](http://zbw.eu/stw/version/latest/about) maintained by [ZBW - Leibniz Information Centre for Economics](http://www.zbw.eu/de/)

* [Australian and New Zealand Standard Research Classification (ANZSRC)](https://www.arc.gov.au/grants/grant-application/classification-codes-rfcd-seo-and-anzsic-codes) maintained by [Australian Bureau of Statistics](http://www.abs.gov.au), it consists of 3 sub-classification schemes:

  * [Fields of Research (FoR)](http://www.abs.gov.au/Ausstats/[email protected]/Latestproducts/6BB427AB9696C225CA2574180004463E?opendocument) classification

  * [Research Fields, Courses and Disciplines (RFCD)](http://www.abs.gov.au/ausstats/[email protected]/66f306f503e529a5ca25697e0017661f/955FFA4EB1B23847CA25697E0018FB14?opendocument) classification

  * [Socio-Economic Objective (SEO)](http://www.abs.gov.au/Ausstats/[email protected]/Latestproducts/CF7ADB06FA2DFD69CA2574180004CB82?opendocument) classification

* [Library of Congress Classification (LCC)](https://www.loc.gov/catdir/cpso/lcc.html) maintained by [Library of Congress](https://www.loc.gov)

* [Fields of Study (FoS)](https://academic.microsoft.com/#/topics/0/) maintained by [Microsoft Academic](https://academic.microsoft.com)

* [CrossRef Open Funder's Registry](https://gitlab.com/crossref/open_funder_registry)

* [Scientific Keyphrase Extraction Datasets - KP20k, NUS, MAG_KP](https://github.com/memray/OpenNMT-kpg-release)

* [Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources](https://fdm.luis.uni-hannover.de/tr/dataset/stem-ecr-v1-0)

* [XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages](https://github.com/cambridgeltl/sapbert)

* [IteraTeR: Understanding Iterative Revision from Human-Written Text based on ArXiv abstract edit versions](https://github.com/vipulraheja/IteraTeR)

* [CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation](https://github.com/morningmoni/CiteSum)

* [AckExtract: Acknowledgement and its name entities extraction from scholarly papers](https://github.com/lamps-lab/ackextract)

* [The MSVEC Dataset: Multi-Domain Scientific Claim Verification Evaluation Corpus (MSVEC)](https://github.com/lamps-lab/msvec)

* [GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing](https://isg.beel.org/blog/2019/12/10/giant-the-1-billion-annotated-synthetic-bibliographic-reference-string-dataset-for-deep-citation-parsing-pre-print/) - [dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LXQXAO)

## Affiliations

* [Global Research Identifier Database (GRID)](www.grid.ac)

* [CS Rankings with people linked to institutes](http://csrankings.org/#/index?all)

## Altmetrics and Dimensions

* [Altmetrics API](https://api.altmetric.com/)

* [Dimensions.ai API](https://metrics-api.dimensions.ai) - [documentation](https://figshare.com/articles/Dimensions_Metrics_API_Documentation/5783694), [example](http://metrics-api.dimensions.ai/doi/10.7717/peerj-cs.119)

* [Core Conference Rankings](http://www.core.edu.au/conference-portal/2018-conference-rankings-1)

* [China Computer Federation Conference Rankings](https://www.ccf.org.cn/xspj/rgzn/)

# Tools

## User interface to publication datasets and analysis

* [Google Scholar](https://scholar.google.com/)

* [Semantic Scholar](https://www.semanticscholar.org/)

* [Microsoft Academic Graph](http://academic.research.microsoft.com/)

* [OpenAIRE Explore](https://explore.openaire.eu)

* [AceMap](http://acemap.sjtu.edu.cn/)

* [GitXiv](http://www.gitxiv.com/)

* [ACL Anthology](http://aclanthology.info/)

* [NIPS papers](https://papers.nips.cc/)

* [Abel tools for PubMed data](http://abel.lis.illinois.edu/resources.html)

* [infolis: linking research data and publications](http://infolis.github.io/)

* [Metrics toolkit](http://www.metrics-toolkit.org/)

* [Rcrossref (R library)](https://github.com/ropensci/rcrossref)

* [Rscopus (R library)](https://cran.r-project.org/web/packages/rscopus/index.html)

* [Scholar (R library)](https://cran.r-project.org/web/packages/scholar/index.html)

* [Bibliometrix (R library)](http://www.bibliometrix.org/)

* [CITAN (R library)](https://cran.r-project.org/web/packages/CITAN/index.html)

* [BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)](https://yangliufr.shinyapps.io/BibeR/)

* [scihub.py (Python library)](https://github.com/zaytoun/scihub.py)

* [SoPaper (Python library)](https://github.com/ppwwyyxx/SoPaper)

* [CiteSeer tools](https://github.com/SeerLabs)

* [Novelty quantification in PubMed articles](https://github.com/napsternxg/Novelty)

* [TidyPMC - R based PMC XML parser](https://github.com/cstubben/tidypmc)

* [PublicationHarvester - Download PubMed publications of an author](https://github.com/andrewstellman/PublicationHarvester)

* [Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar](https://harzing.com/resources/publish-or-perish)

* [Affiliation string parser](https://github.com/titipata/affiliation_parser)

* [CiteSeerX](https://csxstatic.ist.psu.edu/downloads/software)

* [Data Set Knowledge Graph (DSKG) -  a RDF data set about data sets](http://dskg.org/)

* [Citation Gecko - Find related papers](https://www.citationgecko.com)

* [pySciSci - Python tool for working with MAG, PubMed, etc.](https://github.com/SciSciCollective/pyscisci)

* [ACM Digital Library](https://dl.acm.org/)

## Tools for collecting open access papers

* [ContentMine - getpapers](https://github.com/ContentMine/getpapers)

* [rcoreoa](https://github.com/ropensci/rcoreoa) - [CORE](core.ac.uk) API R client

* [metaknowledge - A Python library for doing bibliometric and network analysis in science and health policy research](https://github.com/networks-lab/metaknowledge)

* [PubMedPortable - PubMed to Postgres](https://github.com/KerstenDoering/PubMedPortable)

* [medic - Parsing MEDLINE and storing into a DB](https://github.com/fnl/medic)

## Tools for classifying research papers

* [CSO-Classifier](https://github.com/angelosalatino/cso-classifier)

* [WikiCSSH](https://uiuc-ischool-scanr.github.io/WikiCSSH/)

* [SAGE Rejected Article Tracker](https://github.com/ad48/rejected_article_tracker_pkg)

## Visualizations

* [Rexplore](https://technologies.kmi.open.ac.uk/rexplore/)

* [VOSviewer](http://www.vosviewer.com)

* [CitNetExplorer](http://www.citnetexplorer.nl/)

* [CiteSpace](http://cluster.cis.drexel.edu/~cchen/citespace/)

* [Nobel nominations and recipients](https://ria.ru/infografika/20151210/1339535142.html?lang=en)

* [WOS2Pajek](http://vladowiki.fmf.uni-lj.si/doku.php?id=pajek:wos2pajek)

## Language Processing and Information Extraction

* [Biomedical - BioSentVec Embeddings](https://github.com/ncbi-nlp/BioSentVec)

* [Biomedical embeddings - CambridgeLTL](https://github.com/cambridgeltl/BioNLP-2016)

* [NIH scientific paper pre-processing](https://github.com/NIHOPA/NLPre)

* [SciSpacy - Spacy models for Biomedical NLP from AllenAI](https://allenai.github.io/scispacy/)

* [Multitask Biomedical NER](https://github.com/yuzhimanhua/Multi-BioNER)

* [SciBERT - Bert LM for Biomedical and CS papers](https://github.com/allenai/scibert)

## Citation and metadata extraction

* [CERMINE](https://github.com/CeON/CERMINE)

* [Grobid](https://grobid.readthedocs.io/en/latest/)

* [EXCITE (Extraction of Citations from PDF Documents)](http://excite.west.uni-koblenz.de/website/)

* [Science-Parse](https://github.com/allenai/science-parse)

* [unarXiv (Citation in context from arXiv)](https://github.com/IllDepence/unarXive)

* [Biblio-Glutton](https://github.com/kermitt2/biblio-glutton)

* [PDF/LaTeX to JSON](https://github.com/allenai/s2orc-doc2json)

* [CrossRef Reference Matching code and evaluation data](https://github.com/CrossRef/reference-matching-evaluation)

* [Citation style classifier and evaluation data](https://gitlab.com/crossref/citation_style_classifier)

* [refextract - extracting references used in scholarly communication](https://github.com/inspirehep/refextract)

## Publication and Publisher Info

* [Interactive sheet for deciding publication strategy and open science](https://docs.google.com/spreadsheets/d/1ALIr6i-ufawnR1_tZBbF_Io7ihWpLZKBOD0aLhRRp0U/edit#gid=165846403) - [Tweet](https://twitter.com/jeroenbosman/status/1492876367976968193)

## Author Name Disambiguation

* [Bibliographic Entity Automatic Recognition and Disambiguation](https://github.com/inspirehep/beard) - [paper](https://arxiv.org/abs/1508.07744)

# Community

## Journals

* [Frontiers in Research Metrics and Analytics](https://www.frontiersin.org/journals/research-metrics-and-analytics) 

* [Scientometrics](https://link.springer.com/journal/11192) 

* [Journal of Informetrics](https://www.journals.elsevier.com/journal-of-informetrics)

* [Quantitative Science Studies](https://www.mitpressjournals.org/loi/qss) (Open Access)

* [Science, technology and human values](https://journals.sagepub.com/home/sth)

* [Social Studies of Science](https://journals.sagepub.com/home/sss)

* [Science and Public Policy](https://academic.oup.com/spp)

## Conferences

* [Joint Conference on Digital Libraries (JCDL)](http://www.jcdl.org)

* [International Conference on Theory and Practice of Digital Libraries (TPDL)](http://www.tpdl.eu)

* [European Semantic Web Conference (ESWC), Research of Research Track](https://2019.eswc-conferences.org/call-for-papers-research-of-research-track/)

* [STI Conference series (Science and Technology indicators, e.g., 2018)](http://sti2018.cwts.nl/)

* [ISSI Conference series (INTERNATIONAL CONFERENCE ON  SCIENTOMETRICS & INFORMETRICS, e.g., 2019)](https://www.issi2019.org/)

## Workshops

* [SIGMET - Metrics workshop](https://www.asist.org/SIG/SIGMET/workshop/)

* [International Workshop on Mining Scientific Publications](https://wosp.core.ac.uk/)

* [Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)](https://save-sd.github.io/2018/)

* [Workshop on Reframing Research (RefResh)](http://refresh.kmi.open.ac.uk)

* [Enabling Open Semantic Science (SemSci)](https://semsci.github.io/SemSci2018/)

* [Workshop on Scholarly Document Processing](https://ornlcda.github.io/SDProc/index.html)

## Summer Schools

* [CWTS Scientometrics Spring School (CS3)](https://www.cwts.nl/education/cwts-scientometrics-spring-school)

* [European Summer School of Scientometrics (ESSS)](https://www.scientometrics-school.eu/)

## Courses

* [SI 710: Science of Science - University of Michigan School of Information](https://docs.google.com/document/d/1j-S5k-KHa0mNt3eqJU-bcM4s615z62Ky5c8upBaggKo/edit#heading=h.bvzc4stuveot)

## Associations & Community

* [International Society for Informetrics and Scientometrics (ISSI)](http://issi-society.org)

* [European Network of Indicator Designers (ENID)](http://www.forschungsinfo.de/ENID/)

* [4S (Society for Social Studies of Science)](http://4sonline.org/)

* [SIG/MET - Special Interest Group for the measurement of information production and use](https://www.asist.org/SIG/SIGMET/)

## Research Groups

* [Science of Science and Computational Discovery Lab - Colorado University, Boulder](https://scienceofscience.org/)

## Blogs

* [Clarivate Blog](https://clarivate.com/blog/)

* [Elsevier Connect](https://www.elsevier.com/connect)

* [The Scholarly Kitchen](https://scholarlykitchen.sspnet.org/)

# Contributions

The following people have contributed to the items on this list. 

* [Shubhanshu Mishra](https://shubhanshu.com) - Maintainer of the list. 

* [Angelo Antonio Salatino](https://github.com/angelosalatino)

* [Philipp Zumstein](https://github.com/zuphilip)

* [Ali (Aliakbar Akbaritabar)](http://akbaritabar.netlify.com)

* [Andrea Mannocci](https://github.com/andremann)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/napsternxg/awesome-scholarly-data-analysis

Awesome Lists containing this project

README