{"id":47817628,"url":"https://github.com/commoncrawl/eot2020-host-index","last_synced_at":"2026-04-03T18:49:39.181Z","repository":{"id":339834797,"uuid":"1162906987","full_name":"commoncrawl/eot2020-host-index","owner":"commoncrawl","description":"Tools to work with the preliminary End of Term Archive host index","archived":false,"fork":false,"pushed_at":"2026-03-02T10:18:02.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-02T14:11:41.606Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-20T21:01:01.000Z","updated_at":"2026-03-02T10:18:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/commoncrawl/eot2020-host-index","commit_stats":null,"previous_names":["commoncrawl/eot2020-host-index"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/eot2020-host-index","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Feot2020-host-index","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Feot2020-host-index/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Feot2020-host-index/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Feot2020-host-index/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/eot2020-host-index/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Feot2020-host-index/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31370218,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-03T17:53:18.093Z","status":"ssl_error","status_checked_at":"2026-04-03T17:53:17.617Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-03T18:49:38.589Z","updated_at":"2026-04-03T18:49:39.168Z","avatar_url":"https://github.com/commoncrawl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# eot2020-host-index\n\nPRELIMINARY VERSION\n\nThis readme describes an host index database that aggregates\ninformation about the contents in the End of Term Archive.\n\nhttps://eotarchive.org/\n\nIt only has the year 2020 for now. The parquet file is stored on S3 but also accessible via HTTP.\n\n## Install the duckdb cli\n\nhttps://duckdb.org/install/\n\nand the python client library\n\n```bash\npip install duckdb\n```\n\n## Schema\n\n```bash\nduckdb -c \"DESCRIBE FROM 'https://data.commoncrawl.org/projects/eot2020-host-testing/EOT-2020-with-ranks-v5.parquet'\"\n```\n\n\u003cdetails\u003e\u003csummary\u003eclick to see output\u003c/summary\u003e\n\n```\n┌────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐\n│        column_name         │ column_type │  null   │   key   │ default │  extra  │\n│          varchar           │   varchar   │ varchar │ varchar │ varchar │ varchar │\n├────────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤\n│ surt_host_name             │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │\n│ url_host_name_reversed     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_200                  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ url_host_name              │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │\n│ url_host_tld               │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │\n│ url_host_registered_domain │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │\n│ warc_record_length_av      │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ warc_record_length_median  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_200_lote             │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_200_lote_pct         │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_3xx                  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_4xx                  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_5xx                  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_gone                 │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_notModified          │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_other                │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_redirPerm            │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ fetch_redirTemp            │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_200                 │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_3xx                 │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_4xx                 │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_5xx                 │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_gone                │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_notModified         │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_other               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_redirPerm           │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ robots_redirTemp           │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ is_us_federal              │ BOOLEAN     │ YES     │ NULL    │ NULL    │ NULL    │\n│ hcrank_pos                 │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ hcrank_raw                 │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │\n│ hcrank100s                 │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │\n│ hcrank100p                 │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │\n│ prank_pos                  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │\n│ prank_raw                  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │\n│ prank100s                  │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │\n│ prank100p                  │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │\n├────────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤\n│ 36 rows                                                                6 columns │\n└──────────────────────────────────────────────────────────────────────────────────┘\n```\n\u003c/details\u003e\n\nThe schema has multiple parts:\n\n### Hostnames\n\n- `url_host_name`, `surt_host_name` and `url_host_name_reversed` are what they say they are\n- `url_host_tld` and `url_host_registered_domain` are useful for wider queries\n- `is_us_federal` is true for hosts that are actual US federal government websites\n\n\u003e [!NOTE]\n\u003e BUG: `is_us_federal` is too broad in the v2 testing database. This bug was resolved in V5.\n\n\n### Crawl Summary\n\n- `fetch_*` shows the count of status codes for this host. `fetch_200`, for example, is the number of successful fetches.\n- `robots_*` does the same for robots.txt.\n- `_lote` is \"Languages Other Than English.\" `fetch_200_lote_pct` is the percentage of `fetch_200` that has a primary language other than English.\n\n\u003e [!NOTE]\n\u003e BUG: In the v2 testing database, all of the `fetch_` and `robots_` should be integers.\n\n### Ranking information\n\nWe use a web graph to compute search engine-style ranks. We have 2\ndifferent algorithms (harmonic centrality and pagerank) and\n(currently) 2 different ways of normalizing these ranks to the range\n0-100. (Eventually we'll choose one of the two.)\n\n- `hcrank_raw`, `prank_raw`, `hcrank_pos`, `prank_pos` are unnormalized, so you should probably ignore them\n- `hcrank100s` and `hcrank100p` are two different 0-100 normalizations of the harmonic centrality rank\n- ditto for `prank100s` and `prank100p`\n\n### Other\n\n- `warc_record_length_av` and `warc_record_length_median` are the average and median size of all of the warc records for this host\n\n## Examples\n\nLet's look at an entire row for **congress.gov**. We'll do it in Python\nusing a helper script `select.py`. This script takes 2 arguments, the\nSELECT and WHERE clauses. We'll use some shell variables to reduce\ntyping.\n\nSince the parquet file is only 80 megabytes, we'll download it\n\n```bash\nwget https://data.commoncrawl.org/projects/eot2020-host-testing/EOT-2020-with-ranks-v5.parquet\n```\n\nAnd to save typing:\n```bash\nWHERE=\"surt_host_name = 'gov,congress'\"\n```\n\n### Names\n\n```bash\npython select.py \"url_host_name, surt_host_name, url_host_name_reversed, url_host_tld, url_host_registered_domain, is_us_federal\" \"$WHERE\"\n```\n\n```\n┌──────────────────┬────────────────┬────────────────────────┬──────────────┬────────────────────────────┬───────────────┐\n│  url_host_name   │ surt_host_name │ url_host_name_reversed │ url_host_tld │ url_host_registered_domain │ is_us_federal │\n│     varchar      │    varchar     │        varchar         │   varchar    │          varchar           │    boolean    │\n├──────────────────┼────────────────┼────────────────────────┼──────────────┼────────────────────────────┼───────────────┤\n│ www.congress.gov │ gov,congress   │ gov.congress.www       │ gov          │ congress.gov               │ true          │\n└──────────────────┴────────────────┴────────────────────────┴──────────────┴────────────────────────────┴───────────────┘\n```\n\n### Crawl\n\n```bash\npython ./select.py \"fetch_200, fetch_200_lote, fetch_200_lote_pct, fetch_gone, fetch_notModified\" \"$WHERE\"\n```\n\n```\n┌───────────┬────────────────┬────────────────────┬────────────┬───────────────────┐\n│ fetch_200 │ fetch_200_lote │ fetch_200_lote_pct │ fetch_gone │ fetch_notModified │\n│   int64   │     int64      │        int8        │   int64    │       int64       │\n├───────────┼────────────────┼────────────────────┼────────────┼───────────────────┤\n│   2819681 │            812 │                  0 │      46765 │                 0 │\n└───────────┴────────────────┴────────────────────┴────────────┴───────────────────┘\n```\n\n```bash\npython ./select.py \"fetch_3xx, fetch_4xx, fetch_5xx\" \"$WHERE\"\n```\n\n```\n┌───────────┬───────────┬───────────┐\n│ fetch_3xx │ fetch_4xx │ fetch_5xx │\n│   int64   │   int64   │   int64   │\n├───────────┼───────────┼───────────┤\n│         0 │   1933097 │      2414 │\n└───────────┴───────────┴───────────┘\n```\n\n\u003e [!NOTE]\n\u003e That's an alarming 4xx result -- 404 and 410 are gone, these 4xxs might be bot defenses? Spoiler: they're all 400s.\n\n### Robots\n\n```bash\npython ./select.py \"robots_200, robots_gone, robots_notModified\" \"$WHERE\"\n```\n\n```\n┌────────────┬─────────────┬────────────────────┐\n│ robots_200 │ robots_gone │ robots_notModified │\n│   int64    │    int64    │       int64        │\n├────────────┼─────────────┼────────────────────┤\n│     771803 │       46765 │                  0 │\n└────────────┴─────────────┴────────────────────┘\n```\n\n```bash\npython ./select.py \"robots_3xx, robots_4xx, robots_5xx\" \"$WHERE\"\n```\n\n```\n┌────────────┬────────────┬────────────┐\n│ robots_3xx │ robots_4xx │ robots_5xx │\n│   int64    │   int64    │   int64    │\n├────────────┼────────────┼────────────┤\n│          0 │    1933097 │       2414 │\n└────────────┴────────────┴────────────┘\n```\n\n### Ranks\n\n```bash\npython ./select.py \"hcrank100s, hcrank100p, prank100s, prank100p\" \"$WHERE\"\n```\n\n```\n┌────────────┬────────────┬───────────┬───────────┐\n│ hcrank100s │ hcrank100p │ prank100s │ prank100p │\n│   int32    │   int32    │   int32   │   int32   │\n├────────────┼────────────┼───────────┼───────────┤\n│        100 │        100 │       100 │       100 │\n└────────────┴────────────┴───────────┴───────────┘\n```\n\n\u003e [!WARNING]\n\u003e In V2, these rank values were nulls due to a www/not-www issue that was fixed in V3.\n\n\n```bash\npython ./select.py \"hcrank_raw, hcrank_pos, prank_raw, prank_pos\" \"$WHERE\"\n```\n\n```\n┌────────────┬────────────┬───────────────────────┬───────────┐\n│ hcrank_raw │ hcrank_pos │       prank_raw       │ prank_pos │\n│   double   │   int64    │        double         │   int64   │\n├────────────┼────────────┼───────────────────────┼───────────┤\n│ 21142028.0 │       1172 │ 5.100846383868066e-06 │      1793 │\n└────────────┴────────────┴───────────────────────┴───────────┘\n```\n\n\n### Subdomains\n\nThis needs a different WHERE clause:\n\n```bash\npython ./select.py \"url_host_name, url_host_name_reversed, is_us_federal, hcrank100s, hcrank100p, prank100s, prank100p\" \"url_host_registered_domain = 'congress.gov'\"\n```\n\n```\nSELECT url_host_name, url_host_name_reversed, is_us_federal, hcrank100s, hcrank100p, prank100s, prank100p FROM eot2020_host WHERE url_host_registered_domain = 'congress.gov'\n┌────────────────────────────┬────────────────────────────┬───────────────┬────────────┬────────────┬───────────┬───────────┐\n│       url_host_name        │   url_host_name_reversed   │ is_us_federal │ hcrank100s │ hcrank100p │ prank100s │ prank100p │\n│          varchar           │          varchar           │    boolean    │   int32    │   int32    │   int32   │   int32   │\n├────────────────────────────┼────────────────────────────┼───────────────┼────────────┼────────────┼───────────┼───────────┤\n│ smon.congress.gov          │ gov.congress.smon          │ true          │         33 │         28 │         0 │         0 │\n│ lda.congress.gov           │ gov.congress.lda           │ true          │         73 │         83 │        96 │       100 │\n│ test.congress.gov          │ gov.congress.test          │ true          │         62 │         69 │         0 │         0 │\n│ www.congress.gov           │ gov.congress.www           │ true          │        100 │        100 │       100 │       100 │\n│ beta.congress.gov          │ gov.congress.beta          │ true          │         98 │        100 │        98 │       100 │\n│ bioguide.congress.gov      │ gov.congress.bioguide      │ true          │         98 │        100 │        98 │       100 │\n│ crsreports.congress.gov    │ gov.congress.crsreports    │ true          │         98 │        100 │        98 │       100 │\n│ constitution.congress.gov  │ gov.congress.constitution  │ true          │         97 │        100 │        97 │       100 │\n│ bioguideretro.congress.gov │ gov.congress.bioguideretro │ true          │         97 │        100 │        97 │       100 │\n└────────────────────────────┴────────────────────────────┴───────────────┴────────────┴────────────┴───────────┴───────────┘\n```\n\n\n## Let's ask some questions\n\n### What are the highest ranked federal .gov hosts that we have nothing for?\n\n```bash\npython ./select.py \"url_host_name_reversed, hcrank100s\" \"url_host_tld = 'gov' AND is_us_federal AND fetch_200 = 0 ORDER BY hcrank100s DESC LIMIT 10\"\n```\n\n```\nSELECT url_host_name_reversed, hcrank100s FROM eot2020_host WHERE url_host_tld = 'gov' AND is_us_federal AND fetch_200 = 0 ORDER BY hcrank100s DESC LIMIT 10\n┌────────────────────────┬────────────┐\n│ url_host_name_reversed │ hcrank100s │\n│        varchar         │   int32    │\n├────────────────────────┴────────────┤\n│               0 rows                │\n└─────────────────────────────────────┘\n```\nWell that was boring.\n\n### What hosts have a large fraction of LOTE (languages other than english) pages?\n\n```bash\npython ./select.py \"hcrank100s, url_host_name_reversed, fetch_200, fetch_200_lote_pct\" \"fetch_200_lote_pct \u003e 10 AND url_host_tld = 'gov' AND is_us_federal ORDER BY hcrank100s DESC LIMIT 20\"\n```\n\n```\nSELECT hcrank100s, url_host_name_reversed, fetch_200, fetch_200_lote_pct FROM eot2020_host WHERE fetch_200_lote_pct \u003e 10 AND url_host_tld = 'gov' AND is_us_federal ORDER BY hcrank100s DESC LIMIT 20\n┌────────────┬─────────────────────────┬───────────┬────────────────────┐\n│ hcrank100s │ url_host_name_reversed  │ fetch_200 │ fetch_200_lote_pct │\n│   int32    │         varchar         │   int64   │        int8        │\n├────────────┼─────────────────────────┼───────────┼────────────────────┤\n│        100 │ gov.irs                 │    285880 │                 33 │\n│        100 │ gov.usa                 │     10153 │                 12 │\n│        100 │ gov.fema                │     90320 │                 21 │\n│        100 │ gov.medlineplus.www     │     80914 │                 22 │\n│         99 │ gov.uscis               │     30177 │                 13 │\n│         99 │ gov.womenshealth.www    │     10399 │                 14 │\n│         99 │ gov.atf.www             │     46592 │                 11 │\n│         98 │ gov.uscg.navcen         │     58883 │                 15 │\n│         98 │ gov.loc.cdn             │     67138 │                 17 │\n│         98 │ gov.nasa.nascom.sohowww │    258690 │                 15 │\n│         98 │ gov.fec.transition      │     21291 │                 13 │\n│         98 │ gov.usembassy.mx        │     12447 │                 26 │\n│         98 │ gov.hhs.acf.ohs.eclkc   │     93908 │                 21 │\n│         98 │ gov.vaccines            │      1401 │                 12 │\n│         98 │ gov.nasa.gsfc.lambda    │     42493 │                 25 │\n│         98 │ gov.econsumer.www       │      2104 │                 21 │\n│         98 │ gov.nasa.nascom.soho    │    228403 │                 14 │\n│         97 │ gov.usembassy.kr        │      8177 │                 11 │\n│         97 │ gov.america.share.www   │    158923 │                 32 │\n│         97 │ gov.nasa.gsfc.asd       │     37567 │                 13 │\n├────────────┴─────────────────────────┴───────────┴────────────────────┤\n│ 20 rows                                                     4 columns │\n└───────────────────────────────────────────────────────────────────────┘\n```\n\n### What are the top US federal government websites according to harmonic centrality?\n\n```bash\npython ./select.py \"url_host_name, is_us_federal, fetch_200, hcrank_pos, hcrank_raw, hcrank100s\" \"is_us_federal is TRUE ORDER BY hcrank_pos ASC LIMIT 10\"\n```\n\n```\nSELECT url_host_name, is_us_federal, fetch_200, hcrank_pos, hcrank_raw, hcrank100s FROM eot2020_host WHERE is_us_federal is TRUE ORDER BY hcrank_pos ASC LIMIT 10\n┌───────────────────────┬───────────────┬───────────┬────────────┬────────────┬────────────┐\n│     url_host_name     │ is_us_federal │ fetch_200 │ hcrank_pos │ hcrank_raw │ hcrank100s │\n│        varchar        │    boolean    │   int64   │   int64    │   double   │   int32    │\n├───────────────────────┼───────────────┼───────────┼────────────┼────────────┼────────────┤\n│ www.nasa.gov          │ true          │     26809 │        128 │ 23268830.0 │        100 │\n│ cdc.gov               │ true          │    777329 │        140 │ 23232876.0 │        100 │\n│ www.ncbi.nlm.nih.gov  │ true          │   3641163 │        178 │ 23025570.0 │        100 │\n│ www.loc.gov           │ true          │   1746500 │        275 │ 22444530.0 │        100 │\n│ www.whitehouse.gov    │ true          │     82638 │        318 │ 22309318.0 │        100 │\n│ www.privacyshield.gov │ true          │      2712 │        368 │ 22142958.0 │        100 │\n│ www.fda.gov           │ true          │     15471 │        383 │ 22108202.0 │        100 │\n│ ftc.gov               │ true          │    281639 │        526 │ 21787366.0 │        100 │\n│ justice.gov           │ true          │   2324332 │        550 │ 21741378.0 │        100 │\n│ www.nps.gov           │ true          │    318360 │        572 │ 21716190.0 │        100 │\n├───────────────────────┴───────────────┴───────────┴────────────┴────────────┴────────────┤\n│ 10 rows                                                                        6 columns │\n└──────────────────────────────────────────────────────────────────────────────────────────┘\n```\n\n\n## Let's also look at the url index\n\n[The url index schema is described elsewhere.](https://commoncrawl.org/columnar-index)\nWe won't download the entire index like we did before -- the helper\nprogram `url-select.py` tells duckdb to directly access the parquet\nfiles from s3.\n\n### What are those 4xxs for congress.gov?\n\nFirst let's look at all non-200s:\n\n```bash\npython ./url-select.py \"url, fetch_status\" \"url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003c\u003e 200 LIMIT 10\"\n```\n\n```\nSELECT url, fetch_status FROM eot2020_url WHERE url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003c\u003e 200 LIMIT 10\n┌───────────────────────────┬──────────────┐\n│            url            │ fetch_status │\n│          varchar          │    int16     │\n├───────────────────────────┼──────────────┤\n│ http://www.congress.gov// │          301 │\n│ http://www.congress.gov/  │          301 │\n│ https://www.congress.gov/ │          400 │\n│ http://congress.gov/      │          301 │\n│ http://www.congress.gov/  │          301 │\n│ https://congress.gov/     │          302 │\n│ http://congress.gov/      │          301 │\n│ http://www.congress.gov/  │          301 │\n│ https://congress.gov/     │          302 │\n│ http://congress.gov/      │          301 │\n├───────────────────────────┴──────────────┤\n│ 10 rows                        2 columns │\n└──────────────────────────────────────────┘\n```\n\nOK but what about 4xx/5xx?\n\n```bash\npython ./url-select.py \"url, fetch_status\" \"url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003e= 400 LIMIT 10\"\n```\n\n```\nSELECT url, fetch_status FROM eot2020_url WHERE url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003e= 400 LIMIT 10\n┌──────────────────────────────────────────────────────────────────────┬──────────────┐\n│                                 url                                  │ fetch_status │\n│                               varchar                                │    int16     │\n├──────────────────────────────────────────────────────────────────────┼──────────────┤\n│ https://www.congress.gov/                                            │          400 │\n│ https://www.congress.gov/%20-%20legislation-text                     │          404 │\n│ https://www.congress.gov/'/'                                         │          404 │\n│ https://www.congress.gov/103/bills/hjres281/BILLS-103hjres281cph.pdf │          400 │\n│ https://www.congress.gov/103/bills/hr1804/BILLS-103hr1804pcs.pdf     │          400 │\n│ https://www.congress.gov/103/bills/hr1834/BILLS-103hr1834ih.pdf      │          400 │\n│ https://www.congress.gov/103/bills/hr20/BILLS-103hr20cds.pdf         │          400 │\n│ https://www.congress.gov/103/bills/hr2876/BILLS-103hr2876eh.pdf      │          400 │\n│ https://www.congress.gov/103/bills/hr3508/BILLS-103hr3508eh.pdf      │          503 │\n│ https://www.congress.gov/103/bills/hr4165/BILLS-103hr4165ih.pdf      │          400 │\n├──────────────────────────────────────────────────────────────────────┴──────────────┤\n│ 10 rows                                                                   2 columns │\n└─────────────────────────────────────────────────────────────────────────────────────┘\n```\n\n404s are fetch_gone, so the 400s and 503 are concerning.\n\nHow about for robots? (Note the trick of `url_path = '/robots.txt'` ... in Common Crawl's normal url index\nthere's `subset = 'robotstxt'`, but that hive partition does not exist in the EOT2020 url index.)\n\n```bash\npython ./url-select.py \"url, fetch_status\" \"url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003e= 400 AND url_path = '/robots.txt' LIMIT 10\"\n```\n\n```\nSELECT url, fetch_status FROM eot2020_url WHERE url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003e= 400 AND url_path = '/robots.txt' LIMIT 10\n┌─────────────────────────────────────┬──────────────┐\n│                 url                 │ fetch_status │\n│               varchar               │    int16     │\n├─────────────────────────────────────┼──────────────┤\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n├─────────────────────────────────────┴──────────────┤\n│ 10 rows                                  2 columns │\n└────────────────────────────────────────────────────┘\n```\n\nHm, and are there non-400s?\n\n```bash\npython ./url-select.py \"url, fetch_status\" \"url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003e 400 AND url_path = '/robots.txt' LIMIT 10\"\n```\n\n```\nSELECT url, fetch_status FROM eot2020_url WHERE url_host_tld = 'gov' AND url_host_registered_domain = 'congress.gov' AND fetch_status \u003e 400 AND url_path = '/robots.txt' LIMIT 10\n┌─────────────────────────────────────────┬──────────────┐\n│                   url                   │ fetch_status │\n│                 varchar                 │    int16     │\n├─────────────────────────────────────────┼──────────────┤\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n│ http://bioguide.congress.gov/robots.txt │          404 │\n├─────────────────────────────────────────┴──────────────┤\n│ 10 rows                                      2 columns │\n└────────────────────────────────────────────────────────┘\n```\n\nWhoops, I meant to only look at the host congress.gov! Which has 2 host names, congress.gov and www.congress.gov. Having already\nnoticed that congress.gov is a redirect, let's just look at www.congress.gov:\n\n```\npython ./url-select.py \"url, fetch_status\" \"url_host_tld = 'gov' AND url_host_name = 'www.congress.gov' AND fetch_status \u003e= 400 AND url_path = '/robots.txt' LIMIT 10\"\nSELECT url, fetch_status FROM eot2020_url WHERE url_host_tld = 'gov' AND url_host_name = 'www.congress.gov' AND fetch_status \u003e= 400 AND url_path = '/robots.txt' LIMIT 10\n┌─────────────────────────────────────┬──────────────┐\n│                 url                 │ fetch_status │\n│               varchar               │    int16     │\n├─────────────────────────────────────┼──────────────┤\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n│ https://www.congress.gov/robots.txt │          400 │\n├─────────────────────────────────────┴──────────────┤\n│ 10 rows                                  2 columns │\n└────────────────────────────────────────────────────┘\n```\n\nAre they all 400s? Let's try a GROUP BY:\n\n```\npython ./url-select.py \"fetch_status, COUNT(*)\" \"url_host_tld = 'gov' AND url_host_name = 'www.congress.gov' AND fetch_status \u003e= 400 AND url_path = '/robots.txt' GROUP BY fetch_status\"\nSELECT fetch_status, COUNT(*) FROM eot2020_url WHERE url_host_tld = 'gov' AND url_host_name = 'www.congress.gov' AND fetch_status \u003e= 400 AND url_path = '/robots.txt' GROUP BY fetch_status\n┌──────────────┬──────────────┐\n│ fetch_status │ count_star() │\n│    int16     │    int64     │\n├──────────────┼──────────────┤\n│          400 │          300 │\n└──────────────┴──────────────┘\n```\n\n### What are some of the LOTE urls, for example on irs.gov?\n\n```\npython ./url-select.py \"url, content_languages\" \"url_host_registered_domain = 'irs.gov' AND content_languages NOT LIKE 'eng%' LIMIT 10\"\nSELECT url, content_languages FROM eot2020_url WHERE url_host_registered_domain = 'irs.gov' AND content_languages NOT LIKE 'eng%' LIMIT 10\n┌────────────────────────┬───────────────────┐\n│          url           │ content_languages │\n│        varchar         │      varchar      │\n├────────────────────────┼───────────────────┤\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n│ https://www.irs.gov/es │ spa,eng,kor       │\n├────────────────────────┴───────────────────┤\n│ 10 rows                          2 columns │\n└────────────────────────────────────────────┘\n```\nBoring. Let's look at non-'/es' paths:\n\n```\npython ./url-select.py \"url, content_languages\" \"url_host_registered_domain = 'irs.gov' AND url_path \u003c\u003e '/es' AND content_languages NOT LIKE 'eng%' LIMIT 10\"\nSELECT url, content_languages FROM eot2020_url WHERE url_host_registered_domain = 'irs.gov' AND url_path \u003c\u003e '/es' AND content_languages NOT LIKE 'eng%' LIMIT 10\n┌─────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────────────┐\n│                                                 url                                                 │ content_languages │\n│                                               varchar                                               │      varchar      │\n├─────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────────────┤\n│ https://www.irs.gov/es/'https://www.irs.gov/es'                                                     │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es'                                                     │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es'                                                     │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es'                                                     │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es'                                                     │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es'                                                     │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es/charities-and-nonprofits'                            │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es/coronavirus-tax-relief-and-economic-impact-payments' │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es/coronavirus-tax-relief-and-economic-impact-payments' │ spa,eng,kor       │\n│ https://www.irs.gov/es/'https://www.irs.gov/es/coronavirus-tax-relief-and-economic-impact-payments' │ spa,eng,kor       │\n├─────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────┤\n│ 10 rows                                                                                                       2 columns │\n└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n```\nThose are all mangled. Let's try excluding '/es%':\n\n```\npython ./url-select.py \"url, content_languages\" \"url_host_registered_domain = 'irs.gov' AND url_path NOT LIKE '/es%' AND content_languages NOT LIKE 'eng%' LIMIT 10\" \nSELECT url, content_languages FROM eot2020_url WHERE url_host_registered_domain = 'irs.gov' AND url_path NOT LIKE '/es%' AND content_languages NOT LIKE 'eng%' LIMIT 10\n┌──────────────────────────────────────────────────────────────────────────────┬───────────────────┐\n│                                     url                                      │ content_languages │\n│                                   varchar                                    │      varchar      │\n├──────────────────────────────────────────────────────────────────────────────┼───────────────────┤\n│ https://www.irs.gov/help/information-about-federal-taxes-arabic              │ ara,eng,xho       │\n│ https://www.irs.gov/help/information-about-federal-taxes-arabic              │ ara,eng,xho       │\n│ https://www.irs.gov/help/information-about-federal-taxes-bengali             │ ben,eng,xho       │\n│ https://www.irs.gov/help/information-about-federal-taxes-bengali             │ ben,eng,xho       │\n│ https://www.irs.gov/help/information-about-federal-taxes-chinese-traditional │ zho,eng,ind       │\n│ https://www.irs.gov/help/information-about-federal-taxes-chinese-traditional │ zho,eng,ind       │\n│ https://www.irs.gov/help/information-about-federal-taxes-farsi               │ fas,eng,urd       │\n│ https://www.irs.gov/help/information-about-federal-taxes-farsi               │ fas,eng,urd       │\n│ https://www.irs.gov/help/information-about-federal-taxes-french              │ fra,eng,kor       │\n│ https://www.irs.gov/help/information-about-federal-taxes-french              │ fra,eng,kor       │\n├──────────────────────────────────────────────────────────────────────────────┴───────────────────┤\n│ 10 rows                                                                                2 columns │\n└──────────────────────────────────────────────────────────────────────────────────────────────────┘\n```\nJackpot!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Feot2020-host-index","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Feot2020-host-index","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Feot2020-host-index/lists"}