https://github.com/commoncrawl/cc-warcinfo-index-builder

Code to build an index that maps warcinfo-id to crawl / warc
https://github.com/commoncrawl/cc-warcinfo-index-builder

Last synced: about 1 year ago
JSON representation

Code to build an index that maps warcinfo-id to crawl / warc

Host: GitHub
URL: https://github.com/commoncrawl/cc-warcinfo-index-builder
Owner: commoncrawl
Created: 2025-05-27T05:25:13.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-27T05:48:29.000Z (about 1 year ago)
Last Synced: 2025-05-27T06:35:44.919Z (about 1 year ago)
Language: Python
Homepage:
Size: 6.84 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# generate-warcinfo-index

For each crawl, generate parquet which has the following fields:

- warcinfo_id
- warc_filename

The `make all-warcinfo` step runs one extractor per crawl. On the
first run, the first crawl extraction finished in 1h 35m and the last
in 6h 56m.

A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet

## How to query

Look at the test code, test_pandas.py and test_duck.py

## Updating the index

The code uses smart_open() to read the initial part of every warc, extracting
the first record, which should be the warcinfo record.

The code is smart enough to not re-download anything, and runs in
parallel for every crawl. It only needs about 3% of a core per
extractor, but network latency slows it down to as slow as 7 hours for
a single crawl. And if you are doing many crawls in parallel, the
slowest one could be much slower than the fastest.

```
make collinfo
make all-crawls
make all-warcinfo
make parquet
make test
```

To add a single new crawl, edit the Makefile to change the CRAWL
variable, then

```
make one-paths
make one-warcinfo
make parquet
make test
```

## Install

If happy, copy to place:

```
make install
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/commoncrawl/cc-warcinfo-index-builder

Awesome Lists containing this project

README