Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/napsternxg/awesome-scholarly-data-analysis

A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
https://github.com/napsternxg/awesome-scholarly-data-analysis

List: awesome-scholarly-data-analysis

Last synced: about 2 months ago
JSON representation

A curated collection of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.

Awesome Lists containing this project

README

        

[![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
[![License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication](https://licensebuttons.net/l/zero/1.0/88x31.png)](https://creativecommons.org/publicdomain/zero/1.0/)

# Awesome Scholarly Data Analysis

List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources.
Available online at https://shubhanshu.com/awesome-scholarly-data-analysis/

# Table of Contents
- [Awesome Scholarly Data Analysis](#awesome-scholarly-data-analysis)
- [Table of Contents](#table-of-contents)
- [Datasets](#datasets)
* [Publication and Citation](#publication-and-citation)
* [Peer Review](#peer-review)
* [Grants and Funding](#grants-and-funding)
* [Academic Genealogy](#academic-genealogy)
* [Author Profiles](#author-profiles)
* [Author name disambiguation](#author-name-disambiguation)
* [Thesis datasets](#thesis-datasets)
* [Information Extraction and NLP](#information-extraction-and-nlp)
* [Networks](#networks)
* [Taxonomies and Ontologies of Research Concepts](#taxonomies-and-ontologies-of-research-concepts)
* [Affiliations](#affiliations)
* [Altmetrics and Dimensions](#altmetrics-and-dimensions)
- [Tools](#tools)
* [User interface to publication datasets and analysis](#user-interface-to-publication-datasets-and-analysis)
* [Tools for collecting open access papers](#tools-for-collecting-open-access-papers)
* [Tools for classifying research papers](#tools-for-classifying-research-papers)
* [Visualizations](#visualizations)
* [Language Processing and Information Extraction](#language-processing-and-information-extraction)
* [Citation and metadata extraction](#citation-and-metadata-extraction)
* [Publication and Publisher Info](#publication-and-publisher-info)
- [Community](#community)
* [Journals](#journals)
* [Conferences](#conferences)
* [Workshops](#workshops)
* [Summer Schools](#summer-schools)
* [Courses](#courses)
* [Associations & Community](#associations---community)
* [Research Groups](#research-groups)
* [Blogs](#blogs)
- [Contributions](#contributions)

Table of contents generated with markdown-toc

# Datasets

## Publication and Citation
* [Arnet Miner](http://aminer.org/citation)
* [Microsoft Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)
* [OpenAlex - Replacement for MAG](https://openalex.org/)
* [Open Academic Graph](https://www.openacademic.ai/oag/) - MAG + AMiner
* [OpenAIRE Research Graph](https://zenodo.org/record/4279381) - More info [here](https://graph.openaire.eu)
* [Semantic Scholar Corpus](https://www.semanticscholar.org/product/api)
* [CiteSeer](https://csxstatic.ist.psu.edu/downloads/data)
* [PubMed](https://www.ncbi.nlm.nih.gov/pubmed)
* [CORA datasets for citation string parsing](https://people.cs.umass.edu/~mccallum/data.html)
* [Humanities and multilingual citation string parsing Flux-CiM and ICONIP](https://github.com/knmnyn/ParsCit/tree/master/doc) see [Neural ParsCit paper](https://link.springer.com/article/10.1007/s00799-018-0242-1) for details
* [Citation string parsing data for social sciences for English and German citations](https://github.com/exciteproject/EXgoldstandard) - [comparison with Grobid and Cermine](https://github.com/exciteproject/Exparser/tree/master/Evaluation/Ours)
* [CrossRef DOI URLs](https://archive.org/details/doi-urls)
* [DOIboost (Crossres + MAG + ORCID + Unpaywall)](https://zenodo.org/record/3559699)
* [DBLP Citation dataset](https://kdl.cs.umass.edu/display/public/DBLP)
* [DBLP XML data](https://dblp.org/xml/release/)
* [DBLP Discovery Dataset (D3)](https://github.com/gipplab/d3-dataset)
* [NBER Patent Citations](http://nber.org/patents/)
* [Scopus Citation Database](https://www.elsevier.com/solutions/scopus)
* [Papers, patents, and grants from Indiana University](http://iv.slis.indiana.edu/db/index.html)
* [Small Network Data - Mark Newman's Lab](http://www-personal.umich.edu/~mejn/netdata/)
* [The Koblenz Network Collection](http://konect.uni-koblenz.de/)
* [Google Scholar citation relations](http://www3.cs.stonybrook.edu/~leman/data/gscholar.db)
* [Google Scholar Citations data set](http://homes.sice.indiana.edu/filiradi/resources.html) [direct-download](http://homes.sice.indiana.edu/filiradi/Data/gsc_data.tar.bz2)
* [Open citations project](http://opencitations.net/)
* [Wikicite Project](https://meta.wikimedia.org/wiki/WikiCite)
* [Ecnonomic Papers](http://repec.org/)
* [ArXiv data dump](https://arxiv.org/help/bulk_data)
* [ArXiv data on Kaggle](https://www.kaggle.com/Cornell-University/arxiv)
* [EuropePMC](http://europepmc.org/)
* [Complete ACL anthology as bibtex file](http://aclanthology.info/anthology.bib)
* [ACL Anthology Reference Corpus](http://acl-arc.comp.nus.edu.sg/)
* [Astrophysics data system (ADS) - All physics papers](https://ui.adsabs.harvard.edu/)
* [CORE 37M full text open access papers](https://core.ac.uk/services#dataset)
* [Inspire database for high energy physics articles](http://inspirehep.net/?ln=en)
* [Scholarly Data of workshops and conferences in RDF triplets](http://www.scholarlydata.org/)
* [The Collection of Computer Science Bibliographies](http://liinwww.ira.uka.de/bibliography/)
* [OpenCitations corpus](http://opencitations.net/corpus)
* [COCI Doi-Doi citation data](https://figshare.com/articles/Crossref_Open_Citation_Index_CSV_dataset_of_all_the_citation_data/6741422/2)
* [DOAJ API (Directory of Open Access Journals)](https://doaj.org/api/v1/docs)
* [ROAD (Directory of Open Access Scholarly Resources)](https://road.issn.org/)
* [Sherpa/Romeo (Publisher copyright policies & self-archiving)](http://www.sherpa.ac.uk/romeo/index.php)
* [OpenAPC (fees paid for open access journal articles)](https://www.intact-project.org/openapc/)
* [OSF API (Open Science Framework)](https://developer.osf.io/)
* [Digital tools for researchers](http://connectedresearchers.com)
* [Fatcat - versioned, publicly-editable catalog of research publications](https://fatcat.wiki/)
* [Microsoft Academic Knowledge Graph - RDF dump](http://ma-graph.org/)
* [arXiv CS citation in context](http://citation-recommendation.org/publications/#A_High-Quality_Gold_Standard_for_Citation-based_Tasks)
* [arXiv fulltext + citations dataset](https://zenodo.org/record/2609187#.XKe86JhKh3g)
* [Self-citation analysis data based on PubMed Central subset (2002-2005)](https://doi.org/10.13012/B2IDB-9665377_V1)
* [Unpaywalled Corpus - PDF to 23M DOIs](https://unpaywall.org/products/snapshot) [Data Schema](https://unpaywall.org/data-format)
* [A dataset of publication records for Nobel laureates](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6NJ5RN) - [paper](https://www.nature.com/articles/s41597-019-0033-6#Abs1)
* [OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects](https://zenodo.org/record/2674330#.XOU6gshKh3g) - [About the data](http://scholexplorer.openaire.eu/index.html#/about)
* [Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background](http://jurgens.people.si.umich.edu/citation-function/)
* [iCite - NIH Open Citation Collection](https://nih.figshare.com/collections/iCite_Database_Snapshots_NIH_Open_Citation_Collection_/4586573)
* [MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML](https://mbr.nlm.nih.gov)
* [American Physical Society Data Sets for Research](https://journals.aps.org/datasets)
* [Co-citation networks of all Nature papers](https://www.nature.com/immersive/d41586-019-03165-4/index.html)
* [Semantic Scholar Graph of References in Context (GORC) dataset](https://github.com/allenai/s2-gorc/)
* [Multiple journal publication datasets](https://github.com/dmsquare/tube)
* [Structured citations in the English Wikipedia](https://zenodo.org/record/55004#.XgLol7dOmdM)
* [ICSR Lab (free for researchers) for scopus and plumx use](https://www.elsevier.com/icsr/icsrlab/features)
* [COVID-19 Open Research Dataset (CORD-19)](https://pages.semanticscholar.org/coronavirus-research)
* [PaperRobot - includes PubMed Paper Reading Dataset](https://github.com/EagleW/PaperRobot)
* [SciMag - Microsoft Academic Linked to SciMago Journals](https://github.com/scimag/sciMAG2015) - [WebPage](https://scimag.github.io/sciMAG2015/)
* [SciGraph Springer Nature](https://scigraph.springernature.com/explorer/downloads/)
* [Citations to scholarly data in various language wikipedias](https://figshare.com/articles/Citations_with_identifiers_in_Wikipedia/1299540) [Code](https://github.com/mediawiki-utilities/python-mwcites)
* [800K publications matched from CrossRef, CORE, and Mendeley with data on publication and open access dates](https://zenodo.org/record/2605409#.Xr2jZxNKjRY)
* [Coronavirus Open Citations Dataset](https://opencitations.github.io/coronavirus/)
* [Crossref dumps](https://archive.org/search.php?query=creator%3A%22Crossref%22) [DOI meta-data](https://github.com/greenelab/crossref)
* [S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers](https://github.com/allenai/s2orc/)
* [Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia](https://github.com/Harshdeep1996/cite-classifications-wiki)
* [Microsoft Academic Data for conducting covid-19 research](https://github.com/microsoft/mag-covid19-research-examples)
* [Initiative for Open Abstracts](https://i4oa.org/#openabstracts)
* [Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers](https://www.kaggle.com/googleai/dataset-search-metadata-for-datasets)
* [Open Syllabus Project](https://opensyllabus.org/)
* [Journal Causal effect in Citations](https://github.com/vtraag/journal-causal-effect-replication)
* [Sci-Hub Download Logs](https://zenodo.org/record/1158301#.YgiebbrMJ3g) - [Latest](https://sci-hub.se/stats)
* [Sci-Hub databases](https://sci-hub.se/database)
* [SAGE Rejected article tracker dataset from ArXiv](https://zenodo.org/record/5122848) - [Github](https://github.com/sagepublishing/rejected_article_tracker_pkg)
* [The Open Research Knowledge Graph (ORKG)](https://www.orkg.org/orkg/)
* [ACADEMIA INDUSTRY DYNAMICS](https://aida.kmi.open.ac.uk/)
* [Test of Time Awards](https://github.com/LCS2-IIITD/influence-dispersion)
* [ACL-Cite-Net](https://github.com/iamjanvijay/acl-cite-net)
* [The DBLP Discovery Dataset (D3): A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research](https://github.com/jpwahle/lrec22-d3-dataset) [Zenodo](https://zenodo.org/record/7069915)
* [Papers and patents are becoming less disruptive over time](https://zenodo.org/record/7258379) - [Paper](https://www.nature.com/articles/s41586-022-05543-x)
* [OpenAIRE Research Graph Dump](https://zenodo.org/record/7488618)
* [OpCitance: Citation contexts identified from the PubMed Central open access articles](https://databank.illinois.edu/datasets/IDB-7312599)
* [A large dataset of scientific text reuse in Open-Access publications](https://github.com/webis-de/scidata22-stereo-scientific-text-reuse)
* [A dataset of publication records for Nobel laureates](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6NJ5RN)

## Peer Review

* [PeerRead - paper drafts, reviews, and accept/reject decision](https://github.com/allenai/PeerRead)
* [CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations - Contact Author](http://ceur-ws.org/Vol-2414/paper12.pdf)
* [Elsevier's Peer Review Workbench](https://lab.icsr.net/icsr_lab/workbenches.html)
* [ACL-18 Numerical Peer Review Dataset](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2639?locale-attribute=en_US)
* [Argument Mining for Understanding Peer Reviews](https://xinyuhua.github.io/Resources/naacl19/)
* [APE: Argument Pair Extraction - Annotated ICLR 2013-2020 review-rebuttal argument pair](https://github.com/LiyingCheng95/ArgumentPairExtraction)
* [Argument Mining Driven Analysis of Peer-Reviews Dataset](https://doi.org/10.5281/zenodo.4314390)
* [Publons review length dataset with 498K reviews - anonymized](https://clarivate.com/blog/its-not-the-size-that-matters/)
* [Peer review analyze: A novel benchmark resource for computational analysis of peer reviews](https://github.com/Tirthankar-Ghosal/Peer-Review-Analyze-1.0)
* [Open Editors: data about scholarly journals' editors and editorial board members](https://openeditors.ooir.org/) - [Github](https://github.com/andreaspacher/openeditors)
* [NLPEER: A Unified Resource for the Computational Study of Peer Review](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/3618)
* [eLife Open Peer Review Corpus](https://doi.org/10.18150/FKPEQN)
* [PLoS Open Peer Review Corpus](https://doi.org/10.18150/KZHVGE)
* [MDPI Open Peer Review Corpus](https://doi.org/10.18150/D5L2EK)

## Grants and Funding
* [GrantExplorer: a free, open-source tool for examining the phrases funded by U.S. federal agencies](https://www.grantexplorer.org/?about=1&org=dod)
* [USASpending.gov: Award Data Archive](https://www.usaspending.gov/download_center/award_data_archive)
* [NIH research funding](https://report.nih.gov/databases)
* [Authors linked to PIs in NIH Grants](http://abel.lis.illinois.edu/cgi-bin/authorlink/search.pl)

## Academic Genealogy
* [Mathematics Genealogy Project](https://genealogy.math.ndsu.nodak.edu/index.php)
* [Academic Tree - Cross discipline academic genealogies](http://academictree.org)
* [MPACT project - Library Sciences](http://www.ibiblio.org/mpact/)
* [PhDTree](http://phdtree.org/)
* [Chemistry Genealogy - curated at UIUC](http://www.scs.illinois.edu/~mainzv/Web_Genealogy/index.php)
* [Notre Dame Genealogy Project](http://library.nd.edu/physics/resources/genealogy/genealogyproject.shtml)
* [UIUC Chemistry, Chemical Engineering, and Biochemistry](http://www.scs.illinois.edu/alumnilist/)
* [Software Engineering Academic Genealogy](http://web.engr.illinois.edu/~taoxie/sefamily.htm)
* [Other lists of genealogy projects](http://libguides.caltech.edu/c.php?g=512661&p=3502496)
* [Wikipedia - Computer Science Genealogy](https://en.wikipedia.org/wiki/Academic_genealogy_of_computer_scientists)
* [Wikipedia - Theorecical Physicits Genealogy](https://en.wikipedia.org/wiki/Academic_genealogy_of_theoretical_physicists)
* [Wikipedia - Chemists Genealogy](https://en.wikipedia.org/wiki/Academic_genealogy_of_chemists)
* [SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics](http://www.careerchem.com/NAMED/Genealogy-List.pdf)
* [Economic Geneology](http://www-personal.umich.edu/~alandear/tree/INDEX.HTM) [Text Format](http://www-personal.umich.edu/~alandear/tree/Tree-In.txt)
* [S2AMP : Semantic Scholar Analysis of Mentorship Dataset](https://github.com/allenai/S2AMP-data)
* [MENTORSHIP - A dataset of mentorship in science with semantic and demographic estimations](https://doi.org/10.5281/zenodo.4917086) - [Code](https://github.com/sciosci/AFT-MAG)
* [A dataset of mentorship in science with semantic and demographic estimations - Used in The academic Great Gatsby Curve paper](https://zenodo.org/records/4917086)

## Author Profiles
* [Temporal profiles of PubMed authors](http://abel.lis.illinois.edu/legolas/)
* [ORCID data dump](https://orcid.org/content/download-file)
* [National Library of Medicine Profiles](https://profiles.nlm.nih.gov/)
* [UIUC Professors database - Publications, Affiliations](https://experts.illinois.edu/)
* [Author Profiles of scholarly authors in Wikipedia](https://scholia.toolforge.org/)
* [Career Transitions of CS students](https://github.com/tsafavi/career-transitions-data)
* [Author name gender and ethnicity dataset based on PubMed](https://doi.org/10.13012/B2IDB-9087546_V1)
* [MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide](https://doi.org/10.13012/B2IDB-4354331_V1)
* [Conceptual novelty scores for PubMed articles](https://doi.org/10.13012/B2IDB-5060298_V1)
* [100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator](https://data.mendeley.com/datasets/btchxktzyw/1)
* [Canadian PhD career survey](https://www.sgs.utoronto.ca/about/10000-phds-project-overview/10kphds-dashboard/) - [Science report](https://www.sciencemag.org/careers/2018/03/trend-toward-transparency-phd-career-outcomes)
* [Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US](https://osf.io/z8rhg/) - [Used in this blog](https://socialsciences.nature.com/users/325112-diego-a-reinero/posts/55118-the-path-to-professorship-by-the-numbers-and-why-mentorship-matters)
* [Wikidata Author Disambiguation Dataset](https://github.com/arthurpsmith/author-disambiguator)
* [The 4 Universities Data Set - Web pages of CS departments classified for author role (faculty, student, etc.)](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/)
* [Journal editors dataset](https://openeditors.ooir.org/)
* [Career long various citation metrics for 100,000 top-scientists](https://data.mendeley.com/datasets/btchxktzyw/2)
* [Network-Data-Career-Transitions - two anonymized network datasets of post-PhD career transitions and trajectories in computing research](https://github.com/GemsLab/Network-Data-Career-Transitions)
* [Open dataset of scholars on Twitter - 500K OpenAlex Author ID to Twitter User Id](https://zenodo.org/record/7013518#.ZAsZei9Xac4)
* [Gender Inequities in the Online Dissemination of Scholars’ Work](https://github.com/LINK-NU/PNAS-Online-Dissemination-Gender)

## Author name disambiguation
* [INSPIRE dataset](https://github.com/glouppe/paper-author-disambiguation)
* [Lee Giles dataset](http://clgiles.ist.psu.edu/data/nameset_author-disamb.tar.zip)
* [Cleaner version of Lee Giles dataset](https://figshare.com/articles/DBLPderived_labeled_data_for_author_name_disambiguation/6840281)
* [DBLP Korean Authors](http://www.lbd.dcc.ufmg.br/lbd/collections/disambiguation/DBLP.tar.gz/at_download/file)
* [Arnet Miner](http://arnetminer.org/lab-datasets/disambiguation/rich-author-disambiguation-data.zip)
* [Arnet Miner - Manual Name Disambiguation data 210 authors](https://aminer.org/na-data)
* [DBLP Name disambiguation dataset](https://github.com/yaya213/DBLP-Name-Disambiguation-Dataset) - [Error corrected version](https://figshare.com/articles/dataset/DBLP-derived_labeled_data_for_author_name_disambiguation/6840281)
* [rexa-coref-data](https://github.com/tapilab/rexa-coref-data)
* [Dedped author names on IEEE Vis papers 1990-2018](https://sites.google.com/site/vispubdata/home)
* [Author-ity dataset for PubMed 2009](https://doi.org/10.13012/B2IDB-4370459_V1)
* [ACL Anthology dataset](https://github.com/acl-org/acl-anthology/blob/master/data/yaml/name_variants.yaml)
* [Base data for estimating precision and recall of Author-ity among NIH-funded scientists](https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1)
* [ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale](https://figshare.com/articles/dataset/ORCID-Linked_Labeled_Data_for_Evaluating_Author_Name_Disambiguation_at_Scale/13404986)
* [S2AND - Semantic Scholar Author Name Disambiguation Tool and Dataset](https://github.com/allenai/S2AND)
* [BibTex Dataset for 1M authors](http://www.iesl.cs.umass.edu/data/data-bibtex)
* [Ethnicity sensitive author disambiguation from INSPIRE HEP](https://github.com/glouppe/paper-author-disambiguation)
* [Pre-processed PubMed data for a study of coauthorship](https://zenodo.org/record/345934)
* [WhoIsWho: Web-Scale Academic Name Disambiguation:the WhoIsWho Benchmark,Leaderboard,and Toolkit](http://whoiswho.biendata.xyz/) - [https://www.aminer.cn/whoiswho](https://www.aminer.cn/whoiswho) - [WhoIsWho Toolkit GitHub](https://github.com/THUDM/WhoIsWho)
* [LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation](https://zenodo.org/record/7313380) - [Github](https://github.com/carmanzhang/LAGOS-AND)
* [Chain Dream : Name Disambiguation Task2](https://www.biendata.xyz/competition/chaindream_nd_task2/)

## Thesis datasets
* [Open Access Theses and Dissertations](https://oatd.org/)
* [The Networked Digital Library of Theses and Dissertations (NDLTD)](http://www.ndltd.org/)
* [PhD Dissertations in the Area of Software Engineering](http://www.sigsoft.org/dissertations.php)
* [ProQuest Dissertations & Theses Global](http://www.proquest.com/products-services/pqdtglobal.html)
* [History Dissertation Analysis](https://osf.io/v4ysh/)
* [Peer-making: the interconnections between PhD Thesis Committee membership and co-publishing](https://github.com/Marion-Mai/peer-making) - [Zenodo](https://doi.org/10.5281/zenodo.4966081)
* [DISAPERE: A Dataset for DIscourse Structure in Academic PEer REview](https://github.com/nnkennard/DISAPERE)
* [ETDs: Virginia Tech Electronic Theses and Dissertations](https://vtechworks.lib.vt.edu/handle/10919/5534)
* [DSpace@MIT: a digital repository for MIT's research, including peer-reviewed articles, technical reports, working papers, theses, and more](https://dspace.mit.edu/)
* [The ScanBank Dataset: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations](https://zenodo.org/record/4663578)
* [ETDMiner: extract metadata from scanned ETD](https://github.com/lamps-lab/ETDMiner/tree/master) [Google Drive](https://drive.google.com/drive/folders/1y6cADt2JJvNA10wnmlGBeMBJJrrBo6RV)

## Information Extraction and NLP
* [Citation Parsing](http://csxstatic.ist.psu.edu/about/scholarly-information-extraction)
* [Citation Parsing in humanities](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing)
* [Sentences tagged for Drug Disease pairs](https://github.com/roamanalytics/roamresearch/tree/master/BlogPosts/Features_for_healthcare)
* [Document Summarization and citation span identification](https://github.com/WING-NUS/scisumm-corpus)
* [ACL Anthology human summaries for 1000 papers](https://michiyasunaga.github.io/projects/scisumm_net/)
* [Keyphrase Extraction](https://github.com/snkim/AutomaticKeyphraseExtraction)
* [Related Work Summarization](https://github.com/WING-NUS/RelatedWorkSummarizationDataset)
* [Biomedical NLP annotated datasets](https://www.ncbi.nlm.nih.gov/research/bionlp/Data/)
* [Chemical compound and drug name recognition task](http://www.biocreative.org/tasks/biocreative-iv/chemdner/)
* [Semantic Scholar Dataset](https://allenai.org/data/data-all.html)
* [ScienceIE](https://scienceie.github.io/)
* [ACL RD TEC 2.0](http://pars.ie/lr/acl_rd-tec) also at [@CLARIN](https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-1661)
* [SEPID Corpus - Segmended ACL ARC 1.0](http://pars.ie/lr/sepid-corpus)
* [PubMed Central Open Access - BioC](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/)
* [PubMed Fulltext - protein-protein and genetic interactions](http://bioc.sourceforge.net/BioC-BioGRID.html)
* [BioNLP - Argo](http://argo.nactem.ac.uk/bioc/)
* [Biomedical NLP - Stav](http://corpora.informatik.hu-berlin.de/)
* [GENIA - BioNLP 2011](http://2011.bionlp-st.org/home)
* [Genia Treebank used for SciSpacy training](https://nlp.stanford.edu/~mcclosky/biomedical.html) - [SciSpacy link](https://allenai.github.io/scispacy/)
* [Full GENIA corpus](http://www.geniaproject.org/genia-corpus/term-corpus)
* [Anatomical Entity Mention (AnEM) corpus](http://www.nactem.ac.uk/anatomy/)
* [CellFinder - Entity detection](https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/resources/cellfinder)
* [Multi-Level Event Extraction (MLEE)](http://nactem.ac.uk/MLEE/)
* [Biomedical sentence simplification](https://research.bioinformatics.udel.edu/isimp/corpus.html)
* [PubMed - Colorado Richly Annotated Full-Text](https://github.com/UCDenver-ccp/CRAFT)
* [Biomedical NER datasets](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data) [related publication](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1776-8)
* [BioVerbNet](https://github.com/cambridgeltl/bio-verbnet)
* [Lunar and Planetary Science abstracts for NER and Relations](https://zenodo.org/record/1048419#.XAW0m2hKh3h)
* [ACM data affiliations](https://dbs.uni-leipzig.de/en/research/projects/bibliometrics)
* [ACM - DBLP database entry matching](https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution)
* [Colorado Richly Annotated Full-Text](https://github.com/UCDenver-ccp/CRAFT) - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
* [CLEF datasets for multilingual Biomedical NLP+IE](https://sites.google.com/site/clefehealth/home)
* [MedMentions - UMLS entities in PubMed](https://github.com/chanzuckerberg/MedMentions)
* [Colright Initiatve - Rich text competition](https://coleridgeinitiative.org/richcontextcompetition#phase1)
* [SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts](http://nlp.cs.washington.edu/sciIE/)
* [PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions](https://github.com/Franck-Dernoncourt/pubmed-rct)
* [NER, Parsing, Classification datasets from SciBert](https://github.com/allenai/scibert/tree/master/data)
* [ACA Wiki - Paper summaries of more than 1600 papers](https://acawiki.org/Home)
* [SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers](https://competitions.codalab.org/competitions/17422#learn_the_details-subtasks)
* [A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web](http://arrowsmith.psych.uic.edu/arrowsmith_uic/tools.html)
* [Medical Information Extraction from PubMed abstracts](https://www.figure-eight.com/dataset/medical-sentence-summary-and-relation-extraction/)
* [Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets](http://sempub.taln.upf.edu/dricorpus)
* [PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track](http://temu.bsc.es/pharmaconer/index.php/data/) - [Train](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/train-set_1.1.zip) - [Dev](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/dev-set_1.1.zip) - [Test](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/06/test-set_1.1.zip) - [Background Test set](http://temu.bsc.es/pharmaconer/wp-content/uploads/2019/05/background-set.zip)
* [Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction](https://sites.google.com/view/bb-2019/dataset?authuser=0)
* [Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus](https://sites.google.com/view/bionlp-ost19-agac-track/description?authuser=0)
* [The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation](https://sites.google.com/view/seedev2019/home?authuser=0)
* [TalkSumm - Summary of papers via alignment to talks](https://github.com/levguy/talksumm)
* [SeminalSurveyDBLP - Classification of seminal or survey papers](https://zenodo.org/record/3258164#.XWac_-hKh3g)
* [Supp.ai - PubMed supplement-drug interactions and supplement-supplement interactions](https://github.com/lucylw/supp-ai-extracted-sdi-data/)
* [GENETAG](https://github.com/openbiocorpora/genetag) - More recent versions [Publication](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-6-S1-S3) and [Download 2005](ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz)
* [MedTag: A Collection of Biomedical Annotations](https://www.aclweb.org/anthology/W05-1305) - [Download](ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/)
* [Open Biomedical corpora](https://github.com/openbiocorpora)
* [Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext](https://web.archive.org/web/20170501120500/http://amr.isi.edu/download.html) - Also see [other NLM curated biomedical resources](https://www.nlm.nih.gov/databases/download/data_distrib_main.html)
* [SciDTB: Discourse Dependency TreeBank for Scientific Abstracts](https://github.com/PKU-TANGENT/SciDTB)
* [SciDTB corpus annotated for argumentation mining](http://scientmin.taln.upf.edu/argmin/) - [Paper](https://www.aclweb.org/anthology/W19-4505.pdf)
* [Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets](http://sempub.taln.upf.edu/dricorpus)
* [ART corpus - 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result").](https://www.aber.ac.uk/en/media/departmental/computerscience/cb/art/gz/ART_Corpus.tar.gz)- [Browse files](http://www.ukoln.ac.uk/projects/ART_Corpus/menu.html) - [Project details](http://www.ukoln.ac.uk/projects/ART_Corpus/index.html)
* [Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence.](http://www.sapientaproject.com/wp-content/uploads/2016/05/consensus_annotated.zip) - [Project details](http://www.sapientaproject.com/links)
* [PubMedQA - Question answering on PubMed](https://github.com/pubmedqa/pubmedqa)
* [Corposaurus - Collection of biomedical corpus for NER](https://corposaurus.github.io/corpora/)
* [BioNER corpus](https://github.com/xhuang28/NewBioNer/tree/master/corpus)
* [NeuroQuery - 14,000 full-text publications and 400,000 peak activations](https://github.com/neuroquery/neuroquery_data) - [NeuroQuery website](https://neuroquery.org/about)
* [Medical Information Extraction dataset](https://www.figure-eight.com/dataset/medical-sentence-summary-and-relation-extraction/)
* [A Large Parallel Corpus of Full-Text Scientific Articles](https://figshare.com/s/091fcaf8ad66a3304e90)
* [Annotated Corpus of Scientific Conference's Homepages for Information Extraction](https://archive.org/details/conferences-data-0.2)
* [Chi QA - Health Question Answering dataset from NLM](https://chiqa.nlm.nih.gov/)
* [Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine - Includes wikification data](https://github.com/elsevierlabs/OA-STM-Corpus)
* [Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources](https://gitlab.com/TIBHannover/orkg/orkg-nlp/tree/master/STEM-ECR-v1.0)
* [Open Research Knowledge Graph project](https://gitlab.com/TIBHannover/orkg) - [Website](https://www.orkg.org/orkg/)
* [Academic PhraseBank](http://www.phrasebank.manchester.ac.uk/)
* [SciKG - Statement extraction datasets](https://github.com/dmsquare/SciKG)
* [A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology](https://www.aclweb.org/anthology/C12-2103/)
* [A manual corpus of annotated main findings of clinical case reports](https://academic.oup.com/database/article/doi/10.1093/database/bay143/5290151#supplementary-data)
* [TREC Precision Medicine / Clinical Decision Support Track](http://www.trec-cds.org/2019.html)
* [Lots of biomedical entity linking and entity identification datasets](https://github.com/izuna385/datasets)
* [Materials Science Named Entity Recognition: train/development/test sets](https://doi.org/10.6084/m9.figshare.8184428.v1)
* [Entities in 3.27 million materials science abstracts](https://figshare.com/articles/Entities_database/8184413)
* [Normalized entities in material science papers](https://figshare.com/articles/Entity_Normalization/8184365)
* [Named Entity Recognition for Bacterial Type IV Secretion Systems](https://doi.org/10.1371/journal.pone.0014780.s002) - [Paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780#s5)
* [Annotating and detecting phenotypic information for chronic obstructive pulmonary disease](https://datadryad.org/stash/dataset/doi:10.5061/dryad.g35948t)
* [MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction](https://doi.org/10.5281/zenodo.3234811)
* [Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics](https://doi.org/10.6084/m9.figshare.7633343.v1)
* [Multiple PUBMED annotated corpora from iProLink project](https://research.bioinformatics.udel.edu/iprolink/corpora.php)
* [Mars Target Encyclopedia - LPSC abstracts labeled data set](https://doi.org/10.5281/zenodo.1048418)
* [Annotation of phenotypes using ontologies](https://doi.org/10.5281/zenodo.1246697)
* [The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390) - [SPECIES Direct Download](https://species.jensenlab.org/files/S800-1.0.tar.gz) - [ORGANISMS Direct Download](https://organisms.jensenlab.org/Downloads)
* [Entity mention in articles used for benchmark](https://figshare.com/articles/Entity_mention_in_articles_used_for_benchmark/5620417)
* [RAMBO 800+: A Corpus for the Development of Gene/Protein Recognition from Rare and Ambiguous Abbreviations](https://doi.org/10.4119/unibi/2673424)
* [Medical Relation Extraction - CrowdTruth](https://github.com/CrowdTruth/Medical-Relation-Extraction)
* [KP20k - Kehphrase extraction on 20k abstracts](https://github.com/memray/seq2seq-keyphrase)
* [Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition](https://github.com/dmis-lab/biobert#datasets)
* [Relation Extraction: (2.5 MB), 2 datasets on biomedical relation extraction](https://github.com/dmis-lab/biobert#datasets)
* [Question Answering: (5.23 MB), 3 datasets on biomedical question answering task](https://github.com/dmis-lab/biobert#datasets)
* [SciREX : A Challenge Dataset for Document-Level Information Extraction](https://github.com/allenai/SciREX)
* [Papers with Code - Links between papers and repositories and extraction of SOTA results](https://github.com/paperswithcode/paperswithcode-data)
* [Citation Context Classification based on purpose](https://www.kaggle.com/c/3c-shared-task-purpose/)
* [Citation Context Classification based on influence](https://www.kaggle.com/c/3c-shared-task-influence/)
* [PubMed knowledge graph (PKG)](http://er.tacc.utexas.edu/datasets/ped) [Figshare](https://figshare.com/s/6327a55355fc2c99f3a2)
* [Citation and Header Datasets](https://csxstatic.ist.psu.edu/downloads/data)
* [Gobrid-NER data](https://github.com/kermitt2/grobid-ner/tree/master/grobid-ner/resources/dataset)
* [Multiple NER and Entity Linking data for science](https://github.com/kermitt2/entity-fishing/tree/master/data)
* [Scitation Context Classification](https://github.com/allenai/scicite)
* [S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers](https://github.com/allenai/s2orc/)
* [EuropePMC annotations for entities and relationships](http://europepmc.org/AnnotationsApi)
* [NLPContributionGraph - Structuring Scholarly NLP Contributions in the Open Research Knowledge Graph](https://ncg-task.github.io/)
* [GOBRID NER](https://github.com/kermitt2/grobid-ner/tree/master/resources/dataset)
* [GOBRID Sequence Labeling data](https://github.com/kermitt2/delft/tree/master/data/sequenceLabelling/grobid)
* [The General Index - Metadata, Ngrams, and Keyphrases in 107,233,728 journal articles](https://archive.org/details/GeneralIndex)
* [Pubtrends Review Dataset](https://github.com/JetBrains-Research/pubtrends-review/tree/master/review)
* [PubTator Central (PTC) - NLP annotated PMC datasets](https://www.ncbi.nlm.nih.gov/research/pubtator/)
* [PubMedCentral Author Manuscript Collection](https://ftp.ncbi.nlm.nih.gov/pub/pmc/manuscript/)
* [Paper analyzer pubmed](https://research.jetbrains.org/groups/paper_analyzer/projects/)
* [NER on Material Science Papers](https://github.com/olivettigroup/annotated-materials-syntheses)
* [SoMeSci - Software Mentions in Science](https://zenodo.org/record/4968738)
* [NLMChem a new resource for chemical entity recognition in PubMed full-text literature](https://zenodo.org/record/4628233#.Yd_YRL3MJ3g)
* [Scientific summarization datasets](https://github.com/Santosh-Gupta/ScientificSummarizationDataSets)
* [PubMed Classification](https://figshare.com/articles/dataset/PubMed_classification_v1_202102/16601402)
* [Annotated scientific findings with sentence-level and aspect-level certainty](https://github.com/Jiaxin-Pei/Certainty-in-Science-Communication)
* [SoftwareKG_Social and SoftwareKG_PubMed - Software mentions in articles](https://data.gesis.org/softwarekg/site/)
* [Bioinformatics Named Entity Recogniser for Databases and Software](https://sourceforge.net/projects/bionerds/files/goldstandard/)
* [The CodeMeta Project: preservation, discovery, reuse, and attribution of software](https://codemeta.github.io/)
* [Social Science Software Citation Dataset](https://github.com/f-krueger/SoSciSoCi)
* [SoMeSci - A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles](https://data.gesis.org/somesci/)
* [Softcite dataset: A gold-standard dataset of software mentions in research publications for supervised learning based named entity recognition](https://github.com/howisonlab/softcite-dataset)
* [SoftwareKG-PMC:a Knowledge Graph of Software mentions extracted from articles of the PMC Open Access Dataset](https://zenodo.org/record/5780121#.Yh3QO-jMJ3h)
* [DEAL: Detecting Entities in the Astrophysics Literature](https://ui.adsabs.harvard.edu/WIESP/2022/SharedTasks)
* [COMPUTER SCIENCE KNOWLEDGE GRAPH](https://scholkg.kmi.open.ac.uk/)
* [SCIERC: Multi-Task Identification of Entities, Relations, and Coreferencefor Scientific Knowledge Graph Construction](https://nlp.cs.washington.edu/sciIE/) - [Code](https://bitbucket.org/luanyi/scierc/src/master/)
* [University of Washington BIO NLP datasets](http://depts.washington.edu/bionlp/index.html?corpora)
* [multimodal_summ: Multimodal summarization of research papers](https://github.com/LCS2-IIITD/multimodal_summ)
* [ACL Anthology Corpus - Full Text](https://github.com/shauryr/ACL-anthology-corpus)
* [Entity Linking of Crossref Funding Orgs in Acknowledgements](https://github.com/SEYED7037/EDFund_sample_dataset) - [paper](https://arxiv.org/abs/2209.00351)
* [Microsoft Academic Knowledge Graph (MAKG)](https://makg.org/) - [Zenodo](http://doi.org/10.5281/zenodo.3936556) [ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences](https://makg.org/entity-embeddings/)
* [Wikidata:WikiProject Clinical Trials](https://www.wikidata.org/wiki/Wikidata:WikiProject_Clinical_Trials)
* [A Dataset of Alt Texts from HCI Publications](https://github.com/allenai/hci-alt-texts)
* [PubMed-OA-Extraction-dataset](https://zenodo.org/record/6330817)
* [SciRepEval: A Multi-Format Benchmark for Scientific Document Representations](https://github.com/allenai/scirepeval)
* [The MAPLE Benchmark for Scientific Literature Tagging](https://zenodo.org/record/7611544)

## Networks
* [ACL Anthology Network](http://clair.eecs.umich.edu/aan/index.php)
* [I³ Open Innovation Dataset Index](https://iiindex.org/) - Multiple datasets related to patent networks, inventor careers, etc.
* [Science4cast Competition](https://github.com/iarai/science4cast) - capture the evolution of scientific concepts and predict which research topics will emerge in the coming years

## Taxonomies and Ontologies of Research Concepts
* [SciGraph Springer Nature](https://scigraph.springernature.com/explorer/downloads/)
* [Medical Subject Headings](https://meshb.nlm.nih.gov/search) maintained by the [National Library of Medicine of the United States](https://www.nlm.nih.gov)
* [Computer Science Ontology](https://cso.kmi.open.ac.uk/home) maintained by [Scholarly Knowledge: Modeling, Mining and Sense Making](http://skm.kmi.open.ac.uk)
* [Physics Subject Headings (PhySH)](https://physh.aps.org/) maintained by [American Physical Society (APS)]() [GitHub](https://github.com/physh-org/PhySH)
* [Open Biological and Biomedical Ontology (OBO)](http://obofoundry.org/) maintained by the [OBO Foundry](http://obofoundry.org)
* [ACM Computing Classification System](https://www.acm.org/publications/class-2012) maintained by the [Association for Computing Machinery](https://www.acm.org)
* [Physics and Astronomy Classification Scheme (PACS)](https://web.archive.org/web/20131122200802/http://www.aip.org/pacs/pacs2010/about.html) maintained by [American Institute of Physics (AIP)]() *discontinued* in 2010 and replaced by [Physics Subject Headings](https://physh.aps.org/)
* [Mathematics Subject Classification (MSC)](https://mathscinet.ams.org/msc/msc2010.html) mantained by [Mathematical Reviews](http://www.ams.org/mr-database) and [zbMATH](https://zbmath.org)
* [Journal of Economic Literature (JEL)](https://www.aeaweb.org/econlit/jelCodes.php) maintained by the [American Economic Association](https://www.aeaweb.org)
* [STW Thesaurus for Economics](http://zbw.eu/stw/version/latest/about) maintained by [ZBW - Leibniz Information Centre for Economics](http://www.zbw.eu/de/)
* [Australian and New Zealand Standard Research Classification (ANZSRC)](https://www.arc.gov.au/grants/grant-application/classification-codes-rfcd-seo-and-anzsic-codes) maintained by [Australian Bureau of Statistics](http://www.abs.gov.au), it consists of 3 sub-classification schemes:
* [Fields of Research (FoR)](http://www.abs.gov.au/Ausstats/[email protected]/Latestproducts/6BB427AB9696C225CA2574180004463E?opendocument) classification
* [Research Fields, Courses and Disciplines (RFCD)](http://www.abs.gov.au/ausstats/[email protected]/66f306f503e529a5ca25697e0017661f/955FFA4EB1B23847CA25697E0018FB14?opendocument) classification
* [Socio-Economic Objective (SEO)](http://www.abs.gov.au/Ausstats/[email protected]/Latestproducts/CF7ADB06FA2DFD69CA2574180004CB82?opendocument) classification
* [Library of Congress Classification (LCC)](https://www.loc.gov/catdir/cpso/lcc.html) maintained by [Library of Congress](https://www.loc.gov)
* [Fields of Study (FoS)](https://academic.microsoft.com/#/topics/0/) maintained by [Microsoft Academic](https://academic.microsoft.com)
* [CrossRef Open Funder's Registry](https://gitlab.com/crossref/open_funder_registry)
* [Scientific Keyphrase Extraction Datasets - KP20k, NUS, MAG_KP](https://github.com/memray/OpenNMT-kpg-release)
* [Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources](https://fdm.luis.uni-hannover.de/tr/dataset/stem-ecr-v1-0)
* [XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages](https://github.com/cambridgeltl/sapbert)
* [IteraTeR: Understanding Iterative Revision from Human-Written Text based on ArXiv abstract edit versions](https://github.com/vipulraheja/IteraTeR)
* [CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation](https://github.com/morningmoni/CiteSum)
* [AckExtract: Acknowledgement and its name entities extraction from scholarly papers](https://github.com/lamps-lab/ackextract)
* [The MSVEC Dataset: Multi-Domain Scientific Claim Verification Evaluation Corpus (MSVEC)](https://github.com/lamps-lab/msvec)
* [GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing](https://isg.beel.org/blog/2019/12/10/giant-the-1-billion-annotated-synthetic-bibliographic-reference-string-dataset-for-deep-citation-parsing-pre-print/) - [dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LXQXAO)

## Affiliations

* [Global Research Identifier Database (GRID)](www.grid.ac)
* [CS Rankings with people linked to institutes](http://csrankings.org/#/index?all)

## Altmetrics and Dimensions
* [Altmetrics API](https://api.altmetric.com/)
* [Dimensions.ai API](https://metrics-api.dimensions.ai) - [documentation](https://figshare.com/articles/Dimensions_Metrics_API_Documentation/5783694), [example](http://metrics-api.dimensions.ai/doi/10.7717/peerj-cs.119)
* [Core Conference Rankings](http://www.core.edu.au/conference-portal/2018-conference-rankings-1)
* [China Computer Federation Conference Rankings](https://www.ccf.org.cn/xspj/rgzn/)

# Tools

## User interface to publication datasets and analysis

* [Google Scholar](https://scholar.google.com/)
* [Semantic Scholar](https://www.semanticscholar.org/)
* [Microsoft Academic Graph](http://academic.research.microsoft.com/)
* [OpenAIRE Explore](https://explore.openaire.eu)
* [AceMap](http://acemap.sjtu.edu.cn/)
* [GitXiv](http://www.gitxiv.com/)
* [ACL Anthology](http://aclanthology.info/)
* [NIPS papers](https://papers.nips.cc/)
* [Abel tools for PubMed data](http://abel.lis.illinois.edu/resources.html)
* [infolis: linking research data and publications](http://infolis.github.io/)
* [Metrics toolkit](http://www.metrics-toolkit.org/)
* [Rcrossref (R library)](https://github.com/ropensci/rcrossref)
* [Rscopus (R library)](https://cran.r-project.org/web/packages/rscopus/index.html)
* [Scholar (R library)](https://cran.r-project.org/web/packages/scholar/index.html)
* [Bibliometrix (R library)](http://www.bibliometrix.org/)
* [CITAN (R library)](https://cran.r-project.org/web/packages/CITAN/index.html)
* [BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)](https://yangliufr.shinyapps.io/BibeR/)
* [scihub.py (Python library)](https://github.com/zaytoun/scihub.py)
* [SoPaper (Python library)](https://github.com/ppwwyyxx/SoPaper)
* [CiteSeer tools](https://github.com/SeerLabs)
* [Novelty quantification in PubMed articles](https://github.com/napsternxg/Novelty)
* [TidyPMC - R based PMC XML parser](https://github.com/cstubben/tidypmc)
* [PublicationHarvester - Download PubMed publications of an author](https://github.com/andrewstellman/PublicationHarvester)
* [Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar](https://harzing.com/resources/publish-or-perish)
* [Affiliation string parser](https://github.com/titipata/affiliation_parser)
* [CiteSeerX](https://csxstatic.ist.psu.edu/downloads/software)
* [Data Set Knowledge Graph (DSKG) - a RDF data set about data sets](http://dskg.org/)
* [Citation Gecko - Find related papers](https://www.citationgecko.com)
* [pySciSci - Python tool for working with MAG, PubMed, etc.](https://github.com/SciSciCollective/pyscisci)
* [ACM Digital Library](https://dl.acm.org/)

## Tools for collecting open access papers

* [ContentMine - getpapers](https://github.com/ContentMine/getpapers)
* [rcoreoa](https://github.com/ropensci/rcoreoa) - [CORE](core.ac.uk) API R client
* [metaknowledge - A Python library for doing bibliometric and network analysis in science and health policy research](https://github.com/networks-lab/metaknowledge)
* [PubMedPortable - PubMed to Postgres](https://github.com/KerstenDoering/PubMedPortable)
* [medic - Parsing MEDLINE and storing into a DB](https://github.com/fnl/medic)

## Tools for classifying research papers

* [CSO-Classifier](https://github.com/angelosalatino/cso-classifier)
* [WikiCSSH](https://uiuc-ischool-scanr.github.io/WikiCSSH/)
* [SAGE Rejected Article Tracker](https://github.com/ad48/rejected_article_tracker_pkg)

## Visualizations
* [Rexplore](https://technologies.kmi.open.ac.uk/rexplore/)
* [VOSviewer](http://www.vosviewer.com)
* [CitNetExplorer](http://www.citnetexplorer.nl/)
* [CiteSpace](http://cluster.cis.drexel.edu/~cchen/citespace/)
* [Nobel nominations and recipients](https://ria.ru/infografika/20151210/1339535142.html?lang=en)
* [WOS2Pajek](http://vladowiki.fmf.uni-lj.si/doku.php?id=pajek:wos2pajek)

## Language Processing and Information Extraction

* [Biomedical - BioSentVec Embeddings](https://github.com/ncbi-nlp/BioSentVec)
* [Biomedical embeddings - CambridgeLTL](https://github.com/cambridgeltl/BioNLP-2016)
* [NIH scientific paper pre-processing](https://github.com/NIHOPA/NLPre)
* [SciSpacy - Spacy models for Biomedical NLP from AllenAI](https://allenai.github.io/scispacy/)
* [Multitask Biomedical NER](https://github.com/yuzhimanhua/Multi-BioNER)
* [SciBERT - Bert LM for Biomedical and CS papers](https://github.com/allenai/scibert)

## Citation and metadata extraction
* [CERMINE](https://github.com/CeON/CERMINE)
* [Grobid](https://grobid.readthedocs.io/en/latest/)
* [EXCITE (Extraction of Citations from PDF Documents)](http://excite.west.uni-koblenz.de/website/)
* [Science-Parse](https://github.com/allenai/science-parse)
* [unarXiv (Citation in context from arXiv)](https://github.com/IllDepence/unarXive)
* [Biblio-Glutton](https://github.com/kermitt2/biblio-glutton)
* [PDF/LaTeX to JSON](https://github.com/allenai/s2orc-doc2json)
* [CrossRef Reference Matching code and evaluation data](https://github.com/CrossRef/reference-matching-evaluation)
* [Citation style classifier and evaluation data](https://gitlab.com/crossref/citation_style_classifier)
* [refextract - extracting references used in scholarly communication](https://github.com/inspirehep/refextract)

## Publication and Publisher Info

* [Interactive sheet for deciding publication strategy and open science](https://docs.google.com/spreadsheets/d/1ALIr6i-ufawnR1_tZBbF_Io7ihWpLZKBOD0aLhRRp0U/edit#gid=165846403) - [Tweet](https://twitter.com/jeroenbosman/status/1492876367976968193)

## Author Name Disambiguation
* [Bibliographic Entity Automatic Recognition and Disambiguation](https://github.com/inspirehep/beard) - [paper](https://arxiv.org/abs/1508.07744)

# Community

## Journals
* [Frontiers in Research Metrics and Analytics](https://www.frontiersin.org/journals/research-metrics-and-analytics)
* [Scientometrics](https://link.springer.com/journal/11192)
* [Journal of Informetrics](https://www.journals.elsevier.com/journal-of-informetrics)
* [Quantitative Science Studies](https://www.mitpressjournals.org/loi/qss) (Open Access)
* [Science, technology and human values](https://journals.sagepub.com/home/sth)
* [Social Studies of Science](https://journals.sagepub.com/home/sss)
* [Science and Public Policy](https://academic.oup.com/spp)

## Conferences
* [Joint Conference on Digital Libraries (JCDL)](http://www.jcdl.org)
* [International Conference on Theory and Practice of Digital Libraries (TPDL)](http://www.tpdl.eu)
* [European Semantic Web Conference (ESWC), Research of Research Track](https://2019.eswc-conferences.org/call-for-papers-research-of-research-track/)
* [STI Conference series (Science and Technology indicators, e.g., 2018)](http://sti2018.cwts.nl/)
* [ISSI Conference series (INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS, e.g., 2019)](https://www.issi2019.org/)

## Workshops
* [SIGMET - Metrics workshop](https://www.asist.org/SIG/SIGMET/workshop/)
* [International Workshop on Mining Scientific Publications](https://wosp.core.ac.uk/)
* [Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)](https://save-sd.github.io/2018/)
* [Workshop on Reframing Research (RefResh)](http://refresh.kmi.open.ac.uk)
* [Enabling Open Semantic Science (SemSci)](https://semsci.github.io/SemSci2018/)
* [Workshop on Scholarly Document Processing](https://ornlcda.github.io/SDProc/index.html)

## Summer Schools
* [CWTS Scientometrics Spring School (CS3)](https://www.cwts.nl/education/cwts-scientometrics-spring-school)
* [European Summer School of Scientometrics (ESSS)](https://www.scientometrics-school.eu/)

## Courses
* [SI 710: Science of Science - University of Michigan School of Information](https://docs.google.com/document/d/1j-S5k-KHa0mNt3eqJU-bcM4s615z62Ky5c8upBaggKo/edit#heading=h.bvzc4stuveot)

## Associations & Community
* [International Society for Informetrics and Scientometrics (ISSI)](http://issi-society.org)
* [European Network of Indicator Designers (ENID)](http://www.forschungsinfo.de/ENID/)
* [4S (Society for Social Studies of Science)](http://4sonline.org/)
* [SIG/MET - Special Interest Group for the measurement of information production and use](https://www.asist.org/SIG/SIGMET/)

## Research Groups
* [Science of Science and Computational Discovery Lab - Colorado University, Boulder](https://scienceofscience.org/)

## Blogs

* [Clarivate Blog](https://clarivate.com/blog/)
* [Elsevier Connect](https://www.elsevier.com/connect)
* [The Scholarly Kitchen](https://scholarlykitchen.sspnet.org/)

# Contributions
The following people have contributed to the items on this list.
* [Shubhanshu Mishra](https://shubhanshu.com) - Maintainer of the list.
* [Angelo Antonio Salatino](https://github.com/angelosalatino)
* [Philipp Zumstein](https://github.com/zuphilip)
* [Ali (Aliakbar Akbaritabar)](http://akbaritabar.netlify.com)
* [Andrea Mannocci](https://github.com/andremann)