https://github.com/commoncrawl/cc-citations
Scientific articles using or citing Common Crawl data
https://github.com/commoncrawl/cc-citations
bibliography bibtex opendata
Last synced: 4 months ago
JSON representation
Scientific articles using or citing Common Crawl data
- Host: GitHub
- URL: https://github.com/commoncrawl/cc-citations
- Owner: commoncrawl
- Created: 2018-09-24T08:16:45.000Z (about 7 years ago)
- Default Branch: main
- Last Pushed: 2025-05-16T08:29:17.000Z (5 months ago)
- Last Synced: 2025-05-16T09:34:29.102Z (5 months ago)
- Topics: bibliography, bibtex, opendata
- Language: Jupyter Notebook
- Homepage:
- Size: 19.9 MB
- Stars: 20
- Watchers: 12
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Citation: citations_2025.csv
Awesome Lists containing this project
README
# Common Crawl Citations – BibTeX Database
BibTex files are in [bib/](./bib/)
Note: work in progress, still contains only a fraction of recent articles
## Fields Specific for Common Crawl
The following non-standard fields are used to add information how the publications relate to Common Crawl:
- cc-author-affiliation
- affiliation of the authors
- cc-class
- classification of the publication: domain of research, topics, keywords
- cc-snippet
- snippet citing Common Crawl
- cc-dataset-used
- subset of Common Crawl used, e.g., CC-MAIN-2016-07
- cc-derived-dataset-about
- the publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings
- cc-derived-dataset-used
- a dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings
- cc-derived-dataset-cited
- a derived dataset is cited but not used
## Formatting and Export of Citations
The [Makefile](./Makefile) contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: [bibtex2html](https://www.lri.fr/~filliatr/bibtex2html/), [bibclean](https://ctan.org/tex-archive/biblio/bibtex/utils/bibclean), [bibtool](http://www.gerd-neugebauer.de/software/TeX/BibTool/en/).
(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)
## Citations from Google Scholar Alerts
As an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See [gscholar_alerts](./gscholar_alerts/).