{"id":28626153,"url":"https://github.com/commoncrawl/cc-citations","last_synced_at":"2025-06-12T08:41:05.412Z","repository":{"id":142217930,"uuid":"150072694","full_name":"commoncrawl/cc-citations","owner":"commoncrawl","description":"Scientific articles using or citing Common Crawl data","archived":false,"fork":false,"pushed_at":"2025-05-16T08:29:17.000Z","size":20868,"stargazers_count":20,"open_issues_count":0,"forks_count":3,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-05-16T09:34:29.102Z","etag":null,"topics":["bibliography","bibtex","opendata"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":"citations_2025.csv","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-09-24T08:16:45.000Z","updated_at":"2025-05-16T08:29:20.000Z","dependencies_parsed_at":"2023-04-24T22:53:30.765Z","dependency_job_id":"7311c6ae-caf5-4585-8610-d6d462da4d97","html_url":"https://github.com/commoncrawl/cc-citations","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-citations","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-citations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-citations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-citations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-citations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-citations/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-citations/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432290,"owners_count":22856718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bibliography","bibtex","opendata"],"created_at":"2025-06-12T08:41:04.759Z","updated_at":"2025-06-12T08:41:05.400Z","avatar_url":"https://github.com/commoncrawl.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Common Crawl Citations – BibTeX Database\n\nBibTex files are in [bib/](./bib/)\n\nNote: work in progress, still contains only a fraction of recent articles\n\n\n## Fields Specific for Common Crawl\n\nThe following non-standard fields are used to add information how the publications relate to Common Crawl:\n\n\u003cdl\u003e\n\u003cdt\u003ecc-author-affiliation\u003c/dt\u003e\n\u003cdd\u003eaffiliation of the authors\u003c/dd\u003e\n\u003cdt\u003ecc-class\u003c/dt\u003e\n\u003cdd\u003eclassification of the publication: domain of research, topics, keywords\u003c/dd\u003e\n\u003cdt\u003ecc-snippet\u003c/dt\u003e\n\u003cdd\u003esnippet citing Common Crawl\u003c/dd\u003e\n\u003cdt\u003ecc-dataset-used\u003c/dt\u003e\n\u003cdd\u003esubset of Common Crawl used, e.g., CC-MAIN-2016-07\u003c/dd\u003e\n\u003cdt\u003ecc-derived-dataset-about\u003c/dt\u003e\n\u003cdd\u003ethe publication describes a dataset which has been derived from Common Crawl, e.g., GloVe-word-embeddings\u003c/dd\u003e\n\u003cdt\u003ecc-derived-dataset-used\u003c/dt\u003e\n\u003cdd\u003ea dataset has been used which is derived from Common Crawl, e.g., GloVe-word-embeddings\u003c/dd\u003e\n\u003cdt\u003ecc-derived-dataset-cited\u003c/dt\u003e\n\u003cdd\u003ea derived dataset is cited but not used\u003c/dd\u003e\n\u003c/dl\u003e\n\n\n## Formatting and Export of Citations\n\nThe [Makefile](./Makefile) contains targets to apply a consistent formatting to the citations. It also allows to export the citations. The following BibTeX tools are required: [bibtex2html](https://www.lri.fr/~filliatr/bibtex2html/), [bibclean](https://ctan.org/tex-archive/biblio/bibtex/utils/bibclean), [bibtool](http://www.gerd-neugebauer.de/software/TeX/BibTool/en/).\n\n(Do not be confused by the pypi package bibclean, it's entirely different. bibclean, bibtool, and bibtex2html are available as OS packages, at least in apt-based distros.)\n\n## Citations from Google Scholar Alerts\n\nAs an initial step and to get a higher coverage, citations are extracted from Google Scholar Alert e-mails received April 2016 to date. See [gscholar_alerts](./gscholar_alerts/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-citations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-citations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-citations/lists"}