https://github.com/commoncrawl/cc-warcinfo-index-builder
Code to build an index that maps warcinfo-id to crawl / warc
https://github.com/commoncrawl/cc-warcinfo-index-builder
Last synced: about 1 year ago
JSON representation
Code to build an index that maps warcinfo-id to crawl / warc
- Host: GitHub
- URL: https://github.com/commoncrawl/cc-warcinfo-index-builder
- Owner: commoncrawl
- Created: 2025-05-27T05:25:13.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-27T05:48:29.000Z (about 1 year ago)
- Last Synced: 2025-05-27T06:35:44.919Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 6.84 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# generate-warcinfo-index
For each crawl, generate parquet which has the following fields:
- warcinfo_id
- warc_filename
The `make all-warcinfo` step runs one extractor per crawl. On the
first run, the first crawl extraction finished in 1h 35m and the last
in 6h 56m.
A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet
## How to query
Look at the test code, test_pandas.py and test_duck.py
## Updating the index
The code uses smart_open() to read the initial part of every warc, extracting
the first record, which should be the warcinfo record.
The code is smart enough to not re-download anything, and runs in
parallel for every crawl. It only needs about 3% of a core per
extractor, but network latency slows it down to as slow as 7 hours for
a single crawl. And if you are doing many crawls in parallel, the
slowest one could be much slower than the fastest.
```
make collinfo
make all-crawls
make all-warcinfo
make parquet
make test
```
To add a single new crawl, edit the Makefile to change the CRAWL
variable, then
```
make one-paths
make one-warcinfo
make parquet
make test
```
## Install
If happy, copy to place:
```
make install
```