{"id":28626107,"url":"https://github.com/commoncrawl/cc-host-index","last_synced_at":"2025-06-12T08:40:56.337Z","repository":{"id":288135489,"uuid":"966199980","full_name":"commoncrawl/cc-host-index","owner":"commoncrawl","description":"Tools for working with the host index","archived":false,"fork":false,"pushed_at":"2025-06-04T23:29:26.000Z","size":62,"stargazers_count":4,"open_issues_count":1,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-06-05T04:56:20.516Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-14T14:55:51.000Z","updated_at":"2025-06-04T23:29:29.000Z","dependencies_parsed_at":"2025-04-15T18:48:26.365Z","dependency_job_id":"5c7e4ab7-a63a-46e4-ad0d-fab4d3e6d538","html_url":"https://github.com/commoncrawl/cc-host-index","commit_stats":null,"previous_names":["commoncrawl/cc-host-index"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-host-index","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-host-index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-host-index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-host-index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-host-index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-host-index/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-host-index/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432224,"owners_count":22856703,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-12T08:40:55.515Z","updated_at":"2025-06-12T08:40:56.331Z","avatar_url":"https://github.com/commoncrawl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cc-host-index\n\nThis repo contains examples using Common Crawl's host index. The host\nindex for a crawl is a table which contains a single row for each\nweb host, combining aggregated information from the columnar index,\nthe web graph, and our raw crawler logs.\n\n*This document discusses the testing v2 version of this dataset -- it\nwill change before its final release.*\n\n## Example questions this index can answer\n\n- What's our history of crawling a particular website, or group of websites?\n- What popular websites have a lot of non-English content?\n- What popular websites seem to have so little content that we might need to execute javascript to crawl them?\n\n## Example questions that we'll use to improve our crawl\n\n- What's the full list of websites where more than half of the webpages are primarily not English?\n- What popular websites end our crawls with most of their crawl budget left uncrawled?\n\n## Example questions that future versions of this host index can answer\n\n- What websites have a lot of content in particular languages?\n- What websites have a lot of content with particular Unicode scripts?\n\n## Highlights of the schema\n\n- there is a Hive-style partition on `crawl`, which is a crawl name like `CC-MAIN-2025-13`.\n- there is one row for every webhost in the web graph, even if we didn't crawl that website in that particular crawl\n- the primary key is `surt_host_name`, which is quirky (commoncrawl.org -\u003e org,commoncrawl)\n- there is also `url_host_tld`, which we recommend that you use whenever possible (\"org\" for commoncrawl.org)\n- there are counts of what we stored in our archive (warc, crawldiagnostics, robots)\n - `fetch_200, fetch_3xx, fetch_4xx, fetch_5xx, fetch_gone, fetch_notModified, fetch_other, fetch_redirPerm, fetch_redirTemp`\n - `robots_200, robots_3xx, robots_4xx, robots_5xx, robots_gone, robots_notModified, robots_other, robots_redirPerm, robots_redirTemp`\n- there is ranking information from the web graph: harmonic centrality, page rank, and both normalized to a 0-10 scale\n  - `hcrank`, `prank`, `hcrank10`, `prank10`\n- there is a language summary (for now, just the count of languages other than English (LOTE))\n  - `fetch_200_lote, fetch_200_lote_pct`\n- for a subset of the numbers, there is a `foo_pct`, which can help you avoid doing math in SQL. It is an integer 0-100.\n- there are raw numbers from our crawler logs, which also reveal the crawl budget and if we exhausted it\n  - `nutch_fetched, nutch_gone, nutch_notModified, nutch_numRecords, nutch_redirPerm, nutch_redirTemp, nutch_unfetched`\n  - `nutch_fetched_pct, nutch_gone_pct, nutch_notModified_pct, nutch_redirPerm_pct, nutch_redirTemp_pct, nutch_unfetched_pct`\n- there is a size summary (average and median size (compressed))\n  - `warc_record_length_median, warc_record_length_av` (will be renamed to _avg in v3)\n- the full schema is at `athena_schema.v2.sql`\n\n## Examples\n\nUS Federal government websites in the *.gov domain (about 1,400 domains, y-axis scale is millions):\n\n![current-federal.txt_sum.png](https://commoncrawl.github.io/cc-host-index-media/current-federal.txt_sum.png)\n\n[See all graphs from this dataset](https://commoncrawl.github.io/cc-host-index-media/current-federal.txt.html)\n\ncommoncrawl.org fetch. You can see that we revamped our website in CC-2023-14, which caused a lot of\npermanent redirects to be crawled in the next few crawls:\n\n![commoncrawl.org_fetch.png](https://commoncrawl.github.io/cc-host-index-media/commoncrawl.org_fetch.png)\n\n[See all graphs from this dataset](https://commoncrawl.github.io/cc-host-index-media/commoncrawl.org.html)\n\n## Setup\n\nThe host index can either be used in place at AWS, or you can download\nit and use it from local disk. The size is about 7 gigabytes per\ncrawl, and the most recent 24 crawls are currently indexed\n(testing-v2).\n\n### Setup -- local development environment\n\n```\npip install -r requirements.txt\n```\n\n### Setup -- duckdb from outside AWS\n\n```\nwget https://data.commoncrawl.org/projects/host-index-testing/v2.paths.gz\nexport HOST_INDEX=v2.paths.gz\n```\n\n### Setup -- duckdb from inside AWS -- us-east-1\n\n```\naws s3 cp s3://commoncrawl/projects/host-index-testing/v2.paths.gz .\nexport HOST_INDEX=v2.paths.gz\nexport HOST_INDEX_BUCKET=s3://commoncrawl/\n```\n\n### Setup -- duckdb with local files\n\nIf you have enough local disk space (180 gigabytes), install\n[cc-downloader](https://github.com/commoncrawl/cc-downloader/)\nand then:\n\n```\nwget https://data.commoncrawl.org/projects/host-index-testing/v2.paths.gz\ncc-downloader download v2.paths.gz .\n```\n\nWherever you move the downloaded files, point at the top directory:\n\n```\nexport HOST_INDEX=/home/cc-pds/commoncrawl/projects/host-index-testing/v2/\n```\n\n### Setup -- AWS Athena -- us-east-1\n\n```\nCREATE DATABASE cchost_index_testing_v2\n```\n\nSelect the new database in the GUI.\n\nPaste the contents of `athena_schema.v2.sql` into a query and run it.\n\nThen,\n\n```\nMSCK REPAIR TABLE host_index_testing_v2\n```\n\nNow check that it's working:\n\n```\nSELECT COUNT(*) FROM cchost_index_testing_v2\n```\n\n## Python code examples\n\nThe included script `graph.py` knows how to make csvs, png images, and webpages containing\nthese images. It runs in 3 styles:\n\n- one web host: `python ./graph.py example.com`\n- wildcared subdomains: `python ./graph.py *.example.com`\n- a list of hosts: `python ./graph.py -f list_of_hosts.txt`\n\nYes, these commands take a while to run, because Parquet is a bad\nchoice to look up a single row. On my machine the example.com graph\ntakes 5.5 minutes of CPU time.\n\nThis repo also has `duckdb-to-csv.py`, which you can use to run a\nsingle SQL command and get csv output.\n\n## Example SQL queries\n\nHost fetches, non-English count, and ranking. Note the use of `url_host_tld` ... that is recommended\nto make the SQL query optimizer's life easier.\n\n```\nSELECT\n  crawl, fetch_200, fetch_200_lote, prank10, hcrank10\nFROM cchost_index_testing_v2\nWHERE surt_host_name = 'org,commoncrawl'\n  AND url_host_tld = 'org'\nORDER BY crawl ASC\n```\n\nCounts of web captures. This includes captures that you\nwill find in the crawldiagnostics warcs.\n\n```\nSELECT\n  crawl, fetch_200, fetch_gone, fetch_redirPerm, fetch_redirTemp, fetch_notModified, fetch_3xx, fetch_4xx, fetch_5xx, fetch_other\nFROM cchost_index_testing_v2\nWHERE surt_host_name = 'org,commoncrawl'\n  AND url_host_tld = 'org'\nORDER BY crawl ASC\n```\n\nPer-host page size stats, average and median.\n\n```\nSELECT\n  crawl, warc_record_length_av, warc_record_length_median\nFROM cchost_index_testing_v2\nWHERE surt_host_name = 'org,commoncrawl'\n  AND url_host_tld = 'org'\nORDER BY crawl ASC\n```\n\nRaw crawler logs counts. `nutch_numrecords` gives an idea of what the crawl budget was,\nand if it was exhausted.\n\n```\nSELECT\n  crawl, nutch_numRecords, nutch_fetched, nutch_unfetched, nutch_gone, nutch_redirTemp, nutch_redirPerm, nutch_notModified\nFROM cchost_index_testing_v2\nWHERE surt_host_name = 'org,commoncrawl'\n  AND url_host_tld = 'org'\nORDER BY crawl ASC\n```\n\nRobots.txt fetch details.\n\n```\nSELECT\n  crawl, robots_200, robots_gone, robots_redirPerm, robots_redirTemp, robots_notModified, robots_3xx, robots_4xx, robots_5xx, robots_other\nFROM cchost_index_testing_v2\nWHERE surt_host_name = 'org,commoncrawl'\n  AND url_host_tld = 'org'\nORDER BY crawl ASC\n```\n\nTop 10 Vatican websites from crawl CC-MAIN-2025-13 that are \u003e 50% languages other than English (LOTE).\n\n```\nSELECT\n  crawl, surt_host_name, hcrank10, fetch_200_lote_pct, fetch_200_lote\nFROM cchost_index_testing_v2\nWHERE crawl = 'CC-MAIN-2025-13'\n  AND url_host_tld = 'va'\n  AND fetch_200_lote_pct \u003e 50\nORDER BY hcrank10 DESC\nLIMIT 10\n```\n\n| # | crawl | surt\\_host\\_name | hcrank10 | fetch\\_200\\_lote\\_pct | fetch\\_200\\_lote |\n| --- |  --- |  --- |  --- |  --- |  --- |\n| 1 | CC-MAIN-2025-13 | va,vaticannews | 5.472 | 89 | 18872 |\n| 2 | CC-MAIN-2025-13 | va,vatican | 5.164 | 73 | 14549 |\n| 3 | CC-MAIN-2025-13 | va,museivaticani | 4.826 | 77 | 568 |\n| 4 | CC-MAIN-2025-13 | va,vatican,press | 4.821 | 67 | 3804 |\n| 5 | CC-MAIN-2025-13 | va,clerus | 4.813 | 79 | 68 |\n| 6 | CC-MAIN-2025-13 | va,osservatoreromano | 4.783 | 98 | 3305 |\n| 7 | CC-MAIN-2025-13 | va,vaticanstate | 4.738 | 73 | 509 |\n| 8 | CC-MAIN-2025-13 | va,migrants-refugees | 4.732 | 67 | 2055 |\n| 9 | CC-MAIN-2025-13 | va,iubilaeum2025 | 4.73 | 85 | 672 |\n| 10 | CC-MAIN-2025-13 | va,cultura | 4.724 | 67 | 80 |\n\nTop 10 websites from crawl CC-MAIN-2025-13 that are \u003e 90% languages other than English (LOTE).\n\n```\nSELECT\n  crawl, surt_host_name, hcrank10, fetch_200_lote_pct, fetch_200_lote\nFROM cchost_index_testing_v2\nWHERE crawl = 'CC-MAIN-2025-13'\n  AND fetch_200_lote_pct \u003e 90\nORDER BY hcrank10 DESC\nLIMIT 10\n```\n\n| # | crawl | surt\\_host\\_name | hcrank10 | fetch\\_200\\_lote\\_pct | fetch\\_200\\_lote |\n| --- |  --- |  --- |  --- |  --- |  --- |\n| 1 | CC-MAIN-2025-13 | org,wikipedia,fr | 5.885 | 99 | 55334 |\n| 2 | CC-MAIN-2025-13 | org,wikipedia,es | 5.631 | 100 | 48527 |\n| 3 | CC-MAIN-2025-13 | com,chrome,developer | 5.62 | 92 | 29298 |\n| 4 | CC-MAIN-2025-13 | ar,gob,argentina | 5.613 | 95 | 16580 |\n| 5 | CC-MAIN-2025-13 | fr,ebay | 5.579 | 100 | 24633 |\n| 6 | CC-MAIN-2025-13 | org,wikipedia,ja | 5.55 | 100 | 49008 |\n| 7 | CC-MAIN-2025-13 | ru,gosuslugi | 5.535 | 100 | 1560 |\n| 8 | CC-MAIN-2025-13 | org,wikipedia,de | 5.508 | 100 | 48223 |\n| 9 | CC-MAIN-2025-13 | com,acidholic | 5.477 | 100 | 356 |\n| 10 | CC-MAIN-2025-13 | ph,telegra | 5.455 | 92 | 57153 |\n\n## Known bugs\n\n- Some of the partitions have a different schema from others, so you will get errors for some of the columns in some of\nthe crawls. We recommend that you avoid using those crawls, and only use the columns you need.\n- When the S3 bucket is under heavy use, AWS Athena will sometimes throw 503 errors. We have yet to figure out how to increase the retry limit.\n- duckdb's https retry behavior got much better in version 1.3.0, so update\n- Hint: https://status.commoncrawl.org/ has graphs of S3 performance for the last day, week, and month.\n- The sort order is a bit messed up, so database queries take more time than they should.\n\n## Expected changes in test v3\n\n- `warc_record_length_av` will be renamed to `_avg` (that was a typo)\n- more `_pct` columns\n- count truncations: length, time, disconnect, unspecified\n- addition of indegree and outdegree from the web graph\n- improve language details to be more than only LOTE and LOTE\\_pct\n  - `content_language_top`, `content_language_top_pct`\n- add unicode block information, similar to languages\n- `prank10` needs its power law touched up (`hcrank10` might change too)\n- there's a sort problem that .com shards have a smattering of not-.com hosts. This hurts performance.\n- add domain prank/hcrank\n- CI running against S3\n- `robots_digest_count_distinct`\n- `robots_digest` (if there is exactly 1 robots digest)\n- Summarize `fetch_redirect`: same surt, same surt host, other.\n\n## Contributing\n\nWe'd love to get testing and code contributions! Here are some clues:\n\n- We'd love to hear if you tried it out, and what your comments are\n- We'd love to have python examples using Athena, similar to duckdb\n- We'd love to have more python examples\n- Please use pyarrow whenever possible\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-host-index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-host-index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-host-index/lists"}