{"id":28626113,"url":"https://github.com/commoncrawl/cc-warcinfo-index-builder","last_synced_at":"2025-06-12T08:40:56.638Z","repository":{"id":295741553,"uuid":"991083478","full_name":"commoncrawl/cc-warcinfo-index-builder","owner":"commoncrawl","description":"Code to build an index that maps warcinfo-id to crawl / warc","archived":false,"fork":false,"pushed_at":"2025-05-27T05:48:29.000Z","size":7,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-27T06:35:44.919Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-27T05:25:13.000Z","updated_at":"2025-05-27T05:48:33.000Z","dependencies_parsed_at":"2025-05-27T06:35:46.302Z","dependency_job_id":"58553918-b1a4-4911-97a9-4c265460a4ae","html_url":"https://github.com/commoncrawl/cc-warcinfo-index-builder","commit_stats":null,"previous_names":["commoncrawl/cc-warcinfo-index-builder"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-warcinfo-index-builder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-warcinfo-index-builder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-warcinfo-index-builder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-warcinfo-index-builder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-warcinfo-index-builder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-warcinfo-index-builder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-warcinfo-index-builder/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432227,"owners_count":22856704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-12T08:40:55.970Z","updated_at":"2025-06-12T08:40:56.629Z","avatar_url":"https://github.com/commoncrawl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# generate-warcinfo-index\n\nFor each crawl, generate parquet which has the following fields:\n\n- warcinfo_id\n- warc_filename\n\nThe `make all-warcinfo` step runs one extractor per crawl. On the\nfirst run, the first crawl extraction finished in 1h 35m and the last\nin 6h 56m.\n\nA copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet\n\n## How to query\n\nLook at the test code, test_pandas.py and test_duck.py\n\n## Updating the index\n\nThe code uses smart_open() to read the initial part of every warc, extracting\nthe first record, which should be the warcinfo record.\n\nThe code is smart enough to not re-download anything, and runs in\nparallel for every crawl. It only needs about 3% of a core per\nextractor, but network latency slows it down to as slow as 7 hours for\na single crawl. And if you are doing many crawls in parallel, the\nslowest one could be much slower than the fastest.\n\n```\nmake collinfo\nmake all-crawls\nmake all-warcinfo\nmake parquet\nmake test\n```\n\nTo add a single new crawl, edit the Makefile to change the CRAWL\nvariable, then\n\n```\nmake one-paths\nmake one-warcinfo\nmake parquet\nmake test\n```\n\n## Install\n\nIf happy, copy to place:\n\n```\nmake install\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-warcinfo-index-builder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-warcinfo-index-builder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-warcinfo-index-builder/lists"}