{"id":50961931,"url":"https://github.com/commoncrawl/crawl-openathena","last_synced_at":"2026-06-18T15:01:37.319Z","repository":{"id":360645181,"uuid":"1223944143","full_name":"commoncrawl/crawl-openathena","owner":"commoncrawl","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-15T09:58:26.000Z","size":3220,"stargazers_count":4,"open_issues_count":10,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-06-15T11:37:09.276Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T20:07:41.000Z","updated_at":"2026-06-15T09:58:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/commoncrawl/crawl-openathena","commit_stats":null,"previous_names":["commoncrawl/crawl-openathena"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/crawl-openathena","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcrawl-openathena","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcrawl-openathena/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcrawl-openathena/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcrawl-openathena/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/crawl-openathena/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcrawl-openathena/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34495380,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-18T02:00:06.871Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-18T15:01:35.114Z","updated_at":"2026-06-18T15:01:36.249Z","avatar_url":"https://github.com/commoncrawl.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# crawl-openathena\n\nThis repo contains the project plan and issues related to steering\nCommon Crawl's crawl using quality signals. The crawl is currently\nmostly steered using search-engine-style ranking.\n\n\nThis project is currently in a pilot phase.\n\n## Evaluations\n\nThis crawl experiment was evaluated using the Jupyter notebooks in the `notebooks` folder to answer the following and more research questions:\n\n- [Are the science classifier scores statistically significantly different in the focus crawl compared to the baseline?](notebooks/compare_multi_label_scores.ipynb)\n- [What is the impact of the focus crawl if we consider both the science and the quality classifier?](notebooks/compare_multi_label_scores.ipynb)\n- [Do classifier scores drift over the lifetime of the focus crawl?](notebooks/compare_classifier_scores_segment_drift.ipynb)\n- [Does the MIME-type composition of fetched records drift across the lifetime of the focus crawl?](notebooks/compare_mime_type_drift.ipynb)\n- [What is the impact of the increased content limit on truncation? (main crawl 5 MB vs focus 25 MB)](notebooks/analyze_truncated_content.ipynb)\n- [i18n report for SUPPLEMENTAL-2026-22](notebooks/CC-SUPPLEMENTAL-2026-22_cc-i18n_report.html) (21 MB HTML file)\n- [What fraction did we crawl of all of the known URLs (fetched vs un-fetched ratio)](notebooks/overlap-of-focus-vs-main-archive.md)\n- [bot-blocking in the Open Athena pilot crawl](notebooks/OPEN-ATHENA-PILOT.md)\n\nOther analysis (e.g., [URL overlap](https://docs.google.com/spreadsheets/d/1Cx_H8cXh9M_TMBR4rWY7QUSx9jAn1xXcGgce3wmz2Rs/edit?usp=sharing) or [token count estimation](https://docs.google.com/spreadsheets/d/12wORJBUZLFQZKeKq6ymqrJcb3bjPgZxl-jkfdq1llUQ/edit?usp=sharing)) can be found in the Google Drive.\n\n## Install\n\n```bash\nuv sync\n```\n\nThe base install supports `ccoa classify-warc`. The Jupyter notebooks in `notebooks` folder\nneed an extra:\n\n```bash\nuv sync --extra notebooks\n```\n\n`ccoa tokenize` needs the HuggingFace `transformers` stack:\n\n```bash\nuv sync --extra tokenize\n```\n\n## CLI\n\n```bash\nuv run ccoa --help\n```\n\n### Classify WARC\n\n`ccoa classify-warc` streams WARC files from S3 (or any fsspec URL),\nextracts plain text from each response record with trafilatura, and\napplies one or more HuggingFace-hosted fasttext classifiers in a single\npass. Per-record output is a CSV with one `score_\u003clabel\u003e` column per\nrequested label, between `URL` and the `warc_filename`/`warc_record_index`\ntail:\n\n```\nURL,score_\u003clabel_1\u003e,...,score_\u003clabel_N\u003e,warc_filename,warc_record_index\n```\n\nA per-column score-distribution summary is logged at the end and written\nto a `\u003coutput\u003e.summary.csv` file.\n\n```bash\nuv run ccoa classify-warc \\\n  --warc-paths 's3://commoncrawl/crawl-data/CC-MAIN-2025-51/segments/1764871645602.73/warc/*.warc.gz' \\\n  --shuffle-files --seed 42 --files-limit 8 \\\n  --records-per-file-limit 50 \\\n  --skip-homepages \\\n  --workers 4 \\\n  --output data/classified.csv\n```\n\nWhen a `--warc-paths` value contains glob characters (`*`, `?`, `[`) it\nis expanded via fsspec; matches are de-duplicated and sorted, then\noptionally shuffled with `--seed` and truncated to `--files-limit`.\nQuote the glob pattern in the shell to prevent local expansion.\n`--records-limit` caps the total response records across all selected\nfiles; `--records-per-file-limit` caps the records taken from each\nindividual file. Both default to `0` (unlimited).\n`--skip-homepages` drops site-root URLs (empty/root path, no query, no\nfragment) before extraction — useful when the classifier is meant to\nscore actual content pages, not link hubs.\n`--workers N` (default `1`) processes that many WARC files concurrently;\nCSV row order stays deterministic regardless of worker count. Combining\n`--workers \u003e 1` with `--records-limit` is rejected (the global cap\ncan't be enforced deterministically across parallel files); use\n`--records-per-file-limit` instead.\n\n`--workers-mode` picks the parallelism strategy when `--workers \u003e 1`:\n`thread` (default) shares one loaded model behind a lock — cheap, but\ncalls trafilatura/lxml concurrently and has been observed to hit glibc\nheap-corruption aborts (`corrupted size vs. prev_size`) on adversarial\nHTML. `process` loads a separate model per worker process (~4 GB extra\nRAM each) and fully isolates lxml + fasttext C state — pick this if\nthread mode crashes mid-run.\n\nWhen `--output` is a file path (not `-`) the command also writes a\nsidecar **summary** to `\u003coutput\u003e.summary.\u003cext\u003e` (e.g. `foo.csv` →\n`foo.summary.csv`). It is a two-column `key,value` CSV containing the\nexact CLI args, resolved input count, record counters, score stats\n(min/max/mean/median/percentiles), wall-clock + per-step timings, and\nstart/finish timestamps — enough to reproduce the run. To avoid\nclobbering past results, the command fails fast with a non-zero exit\nif either the output **or** the summary file already exists.\n\nThe Common Crawl bucket no longer permits anonymous reads. The command\nuses the default AWS credential chain (env / `~/.aws/credentials` /\ninstance profile) — any valid IAM identity works; the bucket owner pays\nfor requests (it is *not* Requester Pays). Alternatively, use the public\nHTTPS gateway URL\n(`https://data.commoncrawl.org/...`) with no credentials.\n\nThe default classifier is\n[`ibm-granite/GneissWeb.Sci_classifier`](https://huggingface.co/ibm-granite/GneissWeb.Sci_classifier).\nWithout `--labels` it emits both of the model's labels —\n`score___label__science` and `score___label__cc`, which sum to 1.0 per\nrecord. The first run downloads the ~4 GB model into the HuggingFace\ncache. Override with `--model-repo`, `--model-file`, and `--labels`.\n\n`--model-repo` and `--model-file` are list-valued and zipped positionally,\nso you can score against multiple classifiers in one pass:\n\n```bash\nuv run ccoa classify-warc \\\n  --warc-paths 's3://commoncrawl/.../*.warc.gz' \\\n  --model-repo ibm-granite/GneissWeb.Sci_classifier ibm-granite/GneissWeb.Quality_annotator \\\n  --model-file fasttext_science.bin \u003cquality_model_filename.bin\u003e \\\n  --output data/classified.csv\n```\n\n`--labels` is also list-valued (one entry per model). Each entry is a\ncomma-separated list of labels (`\"__label__science,__label__cc\"`) or the\nliteral `*` to use all of that model's labels (the default when `--labels`\nis omitted). Output columns are emitted in the order: models in CLI order,\nlabels in the order given (or model-internal order for `*`).\n\nColumn naming depends on whether the run has one model or many:\n\n- **Single model**: `score_\u003clabel\u003e` (e.g. `score___label__science`).\n- **Multiple models**: `score_m\u003cidx\u003e_\u003clabel\u003e`, where `\u003cidx\u003e` is the 0-based\n  CLI position of the model (e.g. `score_m0___label__science`,\n  `score_m1___label__hq`). This namespacing means two models can share a\n  label name — Sci_classifier and Quality_annotator both emit\n  `__label__cc` — without colliding.\n\n`--output` accepts `-` for stdout, any local path, or any fsspec URL —\nincluding `s3://bucket/key.csv`. S3 outputs use the same `--anonymous-s3`\n/ `--s3-requester-pays` options as inputs.\n\n### Text extraction cache\n\nTrafilatura is by far the most expensive step in the pipeline. When the\nsame WARCs are reprocessed (different model, different label, parameter\nsweeps, retries), pass `--cache-dir` to skip re-extraction:\n\n```bash\nuv run ccoa classify-warc \\\n  --warc-paths s3://commoncrawl/crawl-data/CC-MAIN-2025-51/segments/.../foo.warc.gz \\\n  --limit 100 \\\n  --cache-dir s3://my-bucket/ccoa-cache/ \\\n  --output data/classified.csv\n```\n\nOne gzipped JSONL file is written per input WARC, keyed by the 0-based\nordinal of the response record. Empty extractions are cached too\n(negative caching — avoids re-running trafilatura on junk HTML).\n`--cache-dir` may be a local path or any fsspec URI; S3 cache dirs honor\nthe same `--anonymous-s3` / `--s3-requester-pays` flags as inputs and\noutputs. Input URIs are mirrored under the cache dir by scheme — e.g.\n`s3://commoncrawl/.../foo.warc.gz` becomes\n`\u003ccache-dir\u003e/s3/commoncrawl/.../foo.warc.gz.jsonl.gz`, so a single cache\ndir can safely hold caches for many sources.\n\n### Resuming an interrupted run\n\nIf a run crashes or is killed partway through, pass the partial output to\n`--resume-from-output` on the next invocation to skip records already\nclassified:\n\n```bash\nuv run ccoa classify-warc \\\n  --warc-paths 's3://commoncrawl/crawl-data/CC-MAIN-2025-51/segments/.../*.warc.gz' \\\n  --records-per-file-limit 1000 \\\n  --resume-from-output data/classified.csv \\\n  --output data/classified__resume-2.csv\n```\n\nThe resume CSV's header must match the new run's output schema\n**exactly** — same `score_\u003clabel\u003e` columns in the same order, between\nthe leading `URL` and the trailing `warc_filename`/`warc_record_index`.\nAny drift (reorder, missing, extra) is rejected fast with a structured\ndiff so a concatenation (drop the second header) yields a well-formed\nCSV. Records matching that `(warc_filename, record_index)` pair are\nskipped on the new run; the new `--output` contains only the missing\nrows.\n\nWith `--records-per-file-limit N` the limit is interpreted as the\n**target total** per file (resumed + new). Files already at the target\nare skipped without opening the input stream; for files below the\ntarget, only `N − resumed` more records are processed. To process an\nadditional `M` records on top of a prior run, set the limit to\n`prior_limit + M`.\n\n`--resume-from-output` is also useful with `--workers-mode process`:\nwhen a worker dies on adversarial HTML the pool drops the suspect file\nand continues; a follow-up resume run will retry the dropped files.\n\n\n### Tokenize\n\n`ccoa tokenize` reads the per-WARC text-extraction cache produced by\n`ccoa classify-warc --cache-dir \u003curi\u003e`, tokenizes each record with a\nfast HuggingFace tokenizer, and writes a per-record parquet:\n\n```\ncache_path: string, record_index: int32, n_tokens: int32, token_ids: list\u003cint32\u003e\n```\n\nPlus a sidecar `\u003coutput\u003e.summary.csv` with run metadata and a token-count\ndistribution (count/min/max/mean/median/p10..p99/total) mirroring the\n`classify-warc` summary shape.\n\n```bash\nuv sync --extra tokenize\nexport HF_TOKEN=\u003cyour token with the model's license accepted\u003e\nuv run ccoa tokenize \\\n  --cache-paths 's3://commoncrawl-dev/cc-focus-tools/warc-text-extract-cache/s3/commoncrawl/crawl-data/CC-MAIN-2025-51/segments/*/warc/*.warc.gz.jsonl.gz' \\\n  --files-limit 1 --records-per-file-limit 100 \\\n  --workers 4 --progress-every 25 \\\n  --output /tmp/tokens.parquet\n```\n\n`--cache-paths` accepts one or more URIs or globs; matches must be\ngzipped-JSONL cache files (`{\"index\": N, \"text\": \"...\"}` per line) as\nproduced by `classify-warc --cache-dir`. Each cache file maps 1:1 to a\nsource WARC and is the unit of work for `--workers` parallelism.\n\n`--tokenizer` defaults to `meta-llama/Llama-2-7b`, which is gated —\naccept the license on HuggingFace, then set `HF_TOKEN` (or run\n`huggingface-cli login`). Override with any HuggingFace repo id; the\ntokenizer must resolve to a fast (Rust) variant for thread-mode safety.\n\n`--workers-mode thread` (default) shares one tokenizer instance across\nworker threads — HF fast tokenizers release the GIL and are\nthread-safe. `--workers-mode process` loads a separate tokenizer per\nworker process; pick it if you must use a slow tokenizer.\n\n`--batch-size N` (default 64) controls how many texts are handed to the\ntokenizer per call (fast tokenizers vectorize internally — bigger is\nfaster up to a point). `--progress-every N` logs a per-file heartbeat\nevery N tokenized records; per-file completion lines always log\n`progress — files=K/M elapsed=... eta=~...` like classify-warc.\n\n`--output` accepts a local path or any fsspec URI (e.g.\n`s3://bucket/key.parquet`). To overwrite an existing output, pass\n`--overwrite`.\n\nThe cache JSONL stores `index` + `text` only — no URL. The parquet's\n`cache_path` is the source JSONL URI; downstream code can reverse it to\na WARC URI if the `--cache-dir` prefix is known.\n\n## Development\n\n```bash\nmake test       # pytest\nmake lint       # ruff check\nmake format        # ruff format\nmake check      # lint + format-check + test\n```\n\n## License\n\nApache 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcrawl-openathena","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcrawl-openathena","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcrawl-openathena/lists"}