{"id":21621641,"url":"https://github.com/commoncrawl/ccf-eot-seeds-2024","last_synced_at":"2026-01-31T18:02:08.710Z","repository":{"id":257345212,"uuid":"857984323","full_name":"commoncrawl/ccf-eot-seeds-2024","owner":"commoncrawl","description":"Common Crawl's contribution of seeds to the End of Term Archive 2024","archived":false,"fork":false,"pushed_at":"2024-10-07T06:25:23.000Z","size":7,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-25T00:02:27.327Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Makefile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-16T04:55:39.000Z","updated_at":"2024-11-20T04:39:28.000Z","dependencies_parsed_at":"2024-09-16T06:27:36.518Z","dependency_job_id":"5ec7c4f7-964b-4b63-9ca5-f7f58e206884","html_url":"https://github.com/commoncrawl/ccf-eot-seeds-2024","commit_stats":null,"previous_names":["commoncrawl/ccf-eot-seeds-2024"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/ccf-eot-seeds-2024","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fccf-eot-seeds-2024","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fccf-eot-seeds-2024/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fccf-eot-seeds-2024/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fccf-eot-seeds-2024/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/ccf-eot-seeds-2024/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fccf-eot-seeds-2024/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432319,"owners_count":22856726,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-25T00:00:45.486Z","updated_at":"2026-01-31T18:02:03.660Z","avatar_url":"https://github.com/commoncrawl.png","language":"Makefile","funding_links":[],"categories":["Makefile"],"sub_categories":[],"readme":"# ccf-eot-seeds-2024\n\nCode used to generate some of the \"seed lists\" used for the [End of\nTerm Web Archive 2024 crawl](https://github.com/end-of-term/eot2024/).\n\n## Install\n\n`pip install -r requirements.txt`\n\n## 2024 recipes\n\n### make get-csvs\n\nDownloads 2 csvs from get.gov, listing all of the federal and\nnon-federal domains registered in the .gov tld.\n\n### make get-webgraph\n\nDownload [web graph](https://commoncrawl.org/web-graphs) summaries from CCF, as tab-separated values (tsv).\n\n### make make-subsets\n\nGiven web graph domain and host ranks, grep out\nthe .mil and .gov domains therein. Output is\nstill the web graph table tsv format.\n\n### make hosts-to-seed\n\nTake current-federal.csv plus the hosts webgraph, and output all .gov\nhosts whose domains are in current-federal.csv. For .mil hosts, output\nall hosts. This output is what is checked into eot2024/seed-lists.\n\n- ccf-gov-federal-web-graph-2024-jun-jul-aug.txt\n- ccf-mil-web-graph-2024-jun-jul-aug.txt\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fccf-eot-seeds-2024","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fccf-eot-seeds-2024","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fccf-eot-seeds-2024/lists"}