https://github.com/commoncrawl/ccf-eot-seeds-2024
Common Crawl's contribution of seeds to the End of Term Archive 2024
https://github.com/commoncrawl/ccf-eot-seeds-2024
Last synced: 5 months ago
JSON representation
Common Crawl's contribution of seeds to the End of Term Archive 2024
- Host: GitHub
- URL: https://github.com/commoncrawl/ccf-eot-seeds-2024
- Owner: commoncrawl
- License: apache-2.0
- Created: 2024-09-16T04:55:39.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-07T06:25:23.000Z (over 1 year ago)
- Last Synced: 2024-11-25T00:02:27.327Z (over 1 year ago)
- Language: Makefile
- Size: 6.84 KB
- Stars: 1
- Watchers: 5
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ccf-eot-seeds-2024
Code used to generate some of the "seed lists" used for the [End of
Term Web Archive 2024 crawl](https://github.com/end-of-term/eot2024/).
## Install
`pip install -r requirements.txt`
## 2024 recipes
### make get-csvs
Downloads 2 csvs from get.gov, listing all of the federal and
non-federal domains registered in the .gov tld.
### make get-webgraph
Download [web graph](https://commoncrawl.org/web-graphs) summaries from CCF, as tab-separated values (tsv).
### make make-subsets
Given web graph domain and host ranks, grep out
the .mil and .gov domains therein. Output is
still the web graph table tsv format.
### make hosts-to-seed
Take current-federal.csv plus the hosts webgraph, and output all .gov
hosts whose domains are in current-federal.csv. For .mil hosts, output
all hosts. This output is what is checked into eot2024/seed-lists.
- ccf-gov-federal-web-graph-2024-jun-jul-aug.txt
- ccf-mil-web-graph-2024-jun-jul-aug.txt