{"id":45736532,"url":"https://github.com/aloth/cred-1","last_synced_at":"2026-05-05T22:01:05.237Z","repository":{"id":340539613,"uuid":"1166468607","full_name":"aloth/cred-1","owner":"aloth","description":"CRED-1: An Open Multi-Signal Domain Credibility Dataset (2,672 domains)","archived":false,"fork":false,"pushed_at":"2026-04-21T16:09:43.000Z","size":951,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-21T18:19:41.791Z","etag":null,"topics":["ai-safety","content-moderation","credibility","dataset","digital-literacy","disinformation","domain-credibility","fact-checking","fake-news","information-integrity","machine-learning","media-bias","misinformation","news-credibility","news-verification","nlp","open-dataset","python","research","web-trust"],"latest_commit_sha":null,"homepage":"https://alexloth.com/cred-1-open-domain-credibility-dataset-preprint/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aloth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-25T08:58:40.000Z","updated_at":"2026-04-21T16:09:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/aloth/cred-1","commit_stats":null,"previous_names":["aloth/cred-1"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/aloth/cred-1","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aloth%2Fcred-1","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aloth%2Fcred-1/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aloth%2Fcred-1/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aloth%2Fcred-1/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aloth","download_url":"https://codeload.github.com/aloth/cred-1/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aloth%2Fcred-1/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32669433,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-05T11:29:49.557Z","status":"ssl_error","status_checked_at":"2026-05-05T11:29:48.587Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-safety","content-moderation","credibility","dataset","digital-literacy","disinformation","domain-credibility","fact-checking","fake-news","information-integrity","machine-learning","media-bias","misinformation","news-credibility","news-verification","nlp","open-dataset","python","research","web-trust"],"created_at":"2026-02-25T11:23:34.467Z","updated_at":"2026-05-05T22:01:05.189Z","avatar_url":"https://github.com/aloth.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CRED-1: Open Domain Credibility Dataset\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/cred1-domain-credibility-dataset-banner.jpg\" alt=\"CRED-1 Domain Credibility Dataset Banner\" width=\"100%\"\u003e\n\u003c/p\u003e\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18769460.svg)](https://doi.org/10.5281/zenodo.18769460)\n[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\n\n**CRED-1** is an open, reproducible domain-level credibility dataset combining multiple openly-licensed source lists with computed enrichment signals. It provides credibility scores for **2,672 domains** known to publish mis/disinformation, conspiracy theories, or other unreliable content.\n\n\u003e **Paper:** A. Loth, M. Kappes, and M.-O. Pahl, \"CRED-1: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation,\" *Preprint*, 2026. [doi:10.2139/ssrn.6448466](https://doi.org/10.2139/ssrn.6448466)\n\n## Key Features\n\n- **2,672 domains** with credibility scores (0.0–1.0)\n- **Fully reproducible** — Python pipeline rebuilds the dataset from scratch\n- **Multi-signal scoring** combining source labels, domain age, web popularity, fact-check frequency, and threat intelligence\n- **Privacy-preserving** — designed for on-device client-side deployment (no server calls needed)\n- **Two openly-licensed sources** — no proprietary data dependencies\n\n## Quick Start\n\n```python\nimport json\n\nwith open(\"data/cred1_current.json\") as f:\n    cred = json.load(f)\n\ndomain = \"infowars.com\"\nif domain in cred:\n    score = cred[domain][\"credibility_score\"]  # 0.0 (least credible) to 1.0 (most credible)\n    print(f\"{domain}: credibility = {score}\")\nelse:\n    print(f\"{domain}: not in dataset (neutral)\")\n```\n\n## Dataset Schema\n\n### JSON Format (`cred1_current.json`)\n\n```json\n{\n  \"infowars.com\": {\n    \"category\": \"fake\",\n    \"credibility_score\": 0.14,\n    \"domain_age_years\": 26.4,\n    \"domain_registered\": \"1999-10-04T04:00:00Z\",\n    \"iffy_factual\": \"VL\",\n    \"iffy_bias\": \"FN\",\n    \"iffy_score\": 0.1,\n    \"factcheck_claims\": 52,\n    \"safe_browsing_flagged\": false,\n    \"score_age\": 0.2,\n    \"score_cat\": 0.05,\n    \"score_factcheck\": 0.0,\n    \"score_iffy\": 0.1,\n    \"score_safebrowsing\": 0.05,\n    \"score_tranco\": 0.1,\n    \"sources\": 2,\n    \"tranco_rank\": 4382\n  }\n}\n```\n\n| Field | Description |\n|---|---|\n| `category` | Full category name: `fake`, `unreliable`, `mixed`, `conspiracy`, `satire`, `reliable` |\n| `credibility_score` | Credibility score (0.0-1.0, lower = less credible) |\n| `sources` | Number of independent source lists flagging this domain |\n| `tranco_rank` | Tranco rank (optional, absent if not ranked) |\n| `domain_registered` | Domain registration date from RDAP, ISO 8601 (optional) |\n| `domain_age_years` | Domain age in years, computed from `domain_registered` (optional) |\n| `iffy_factual` | MBFC factual reporting rating (optional) |\n| `iffy_bias` | MBFC political bias rating (optional) |\n| `iffy_score` | Iffy.news credibility score, 0.0-1.0 (optional) |\n| `factcheck_claims` | Number of fact-check claims from Google Fact Check Tools API (optional) |\n| `safe_browsing_flagged` | Google Safe Browsing threat flag (optional) |\n| `score_cat` | Category score component |\n| `score_iffy` | Iffy.news score component |\n| `score_tranco` | Tranco rank score component |\n| `score_age` | Domain age score component |\n| `score_factcheck` | Fact-check frequency score component |\n| `score_safebrowsing` | Safe Browsing score component |\n\n\n\n\n\n\n\n### CSV Format (`cred1_current.csv`)\n\nSame fields as JSON, in tabular format with 18 columns. Sorted by `credibility_score` ascending (least credible first).\n\n### Compact Format (`cred1_compact.json`)\n\nMinimal format for on-device embedding (e.g., browser extensions). Short keys, no whitespace, ~168KB.\n\n| Key | Field |\n|---|---|\n| `s` | credibility_score |\n| `c` | category code (`f`, `u`, `m`, `c`, `s`, `r`) |\n| `n` | sources |\n| `r` | tranco_rank (optional) |\n| `d` | domain registration date as YYYY-MM-DD (optional) |\n\n## Scoring Model\n\nCRED-1 computes credibility scores as a weighted blend of five independent signals:\n\n| Signal | Weight | Source |\n|---|---|---|\n| **Source category** | 50% | OpenSources.co + Iffy.news consensus label |\n| **Iffy.news score** | 15% | Iffy.news credibility rating (when available) |\n| **Fact-check frequency** | 15% | Google Fact Check Tools API — number of claims |\n| **Web popularity** | 5% | Tranco Top-1M rank (log-normalized) |\n| **Domain age** | 5% | WHOIS/RDAP registration date |\n| **Google Safe Browsing** | Override | Hard cap at 0.05 if flagged as malware/social engineering |\n\nRemaining weight (when signals are unavailable) defaults to the source category score.\n\n## Data Sources\n\n| Source | Domains | License | Type |\n|---|---|---|---|\n| [OpenSources.co](https://github.com/BigMcLargeHuge/opensources) | 825 | CC BY 4.0 | Curated mis/disinformation domain list |\n| [Iffy.news Index](https://iffy.news/index/) | 2,040 | MIT | MBFC-derived unreliable source index |\n| [Tranco Top-1M](https://tranco-list.eu/) | 1,000,000 | Free to use | Aggregated web popularity ranking |\n| [RDAP](https://rdap.org/) | Public protocol | N/A | Domain registration data |\n| [Google Fact Check Tools API](https://developers.google.com/fact-check/tools/api) | N/A | Free (attribution) | Fact-check claim database |\n| [Google Safe Browsing API](https://developers.google.com/safe-browsing) | N/A | Free (attribution) | Threat intelligence |\n\n## Reproduce the Dataset\n\n```bash\n# 1. Build base dataset (fetch + merge sources)\npython3 pipeline/build_dataset.py\n\n# 2. Enrich with signals (requires Google Cloud API key)\nexport GOOGLE_API_KEY=\"your-key-here\"  # or macOS Keychain\npython3 pipeline/enrich_dataset.py\n\n# Individual enrichment steps:\npython3 pipeline/enrich_dataset.py --step tranco\npython3 pipeline/enrich_dataset.py --step rdap\npython3 pipeline/enrich_dataset.py --step factcheck\npython3 pipeline/enrich_dataset.py --step safebrowsing\npython3 pipeline/enrich_dataset.py --step score\n```\n\n**Requirements:** Python 3.10+, no external dependencies (stdlib only).\n\n## Category Distribution\n\n| Category | Count | % |\n|---|---|---|\n| Mixed | 1,335 | 50.0% |\n| Unreliable | 589 | 22.0% |\n| Fake | 493 | 18.4% |\n| Conspiracy | 153 | 5.7% |\n| Satire | 94 | 3.5% |\n| Reliable | 8 | 0.3% |\n\n## Applications\n\nCRED-1 is designed for:\n\n- **Browser extensions** — on-device pre-bunking at the content delivery stage\n- **Misinformation research** — ground truth for domain-level credibility studies\n- **Content moderation** — automated flagging of low-credibility sources\n- **Education** — media literacy tools and curricula\n\n## Citation\n\nIf you use CRED-1 in your research, please cite the paper:\n\n```bibtex\n@article{loth2026cred1,\n  title     = {{CRED-1}: An Open Multi-Signal Domain Credibility Dataset for Automated Pre-Bunking of Online Misinformation},\n  author    = {Loth, Alexander and Kappes, Martin and Pahl, Marc-Oliver},\n  year      = {2026},\n  doi       = {10.2139/ssrn.6448466},\n  url       = {https://ssrn.com/abstract=6448466},\n  note      = {Preprint available at SSRN}\n}\n```\n\nTo cite the dataset archive directly:\n\n```bibtex\n@dataset{loth2026cred1data,\n  title     = {{CRED-1}: An Open Multi-Signal Domain Credibility Dataset},\n  author    = {Loth, Alexander},\n  year      = {2026},\n  publisher = {Zenodo},\n  doi       = {10.5281/zenodo.18769460}\n}\n```\n\n## Contributing\n\nFound a misclassified domain? Want to propose a new credibility signal? We welcome community input.\n\n* 🌐 **[Report a Domain](https://github.com/aloth/cred-1/issues/new?template=domain-report.yml)** — flag a domain that seems misscored or missing\n* 💡 **[Propose a Signal](https://github.com/aloth/cred-1/issues/new?template=signal-proposal.yml)** — suggest a new credibility signal for the pipeline\n* ❓ **[Ask a Question](https://github.com/aloth/cred-1/issues/new?template=question.yml)** — methodology, usage, or reproduction questions\n\n## License\n\nThis repository (code and data) is licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).\n\n## Acknowledgments\n\nThis dataset builds on the work of:\n- Melissa Zimdars and the OpenSources.co project\n- The Iffy.news team at the Reynolds Journalism Institute\n- Google Fact Check Tools and Safe Browsing APIs\n\nPowered by [Google Fact Check Tools](https://toolbox.google.com/factcheck/) and [Google Safe Browsing](https://safebrowsing.google.com/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faloth%2Fcred-1","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faloth%2Fcred-1","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faloth%2Fcred-1/lists"}