{"id":28626134,"url":"https://github.com/commoncrawl/cc-index-annotations","last_synced_at":"2025-06-12T08:40:58.552Z","repository":{"id":295309480,"uuid":"989724989","full_name":"commoncrawl/cc-index-annotations","owner":"commoncrawl","description":"Example code to join an annotation to a host or url index","archived":false,"fork":false,"pushed_at":"2025-05-24T20:03:03.000Z","size":5,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-24T20:33:29.774Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-24T17:37:56.000Z","updated_at":"2025-05-24T20:03:06.000Z","dependencies_parsed_at":"2025-05-24T20:33:33.852Z","dependency_job_id":"035fdc63-c175-44c4-ae30-511e2347f496","html_url":"https://github.com/commoncrawl/cc-index-annotations","commit_stats":null,"previous_names":["commoncrawl/cc-index-annotations"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-index-annotations","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-annotations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-annotations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-annotations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-annotations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-index-annotations/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-annotations/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432237,"owners_count":22856706,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-12T08:40:57.223Z","updated_at":"2025-06-12T08:40:58.541Z","avatar_url":"https://github.com/commoncrawl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cc-index-annotations\n\nCommon Crawl's datasets are all random access -- you can use an index\nto efficiently process a subset of the data, if you don't want all of\nit.\n\nIndex annotations are a mechanism for one organization to create a\ndatabase table that can be joined to Common Crawl's columnar url index\nor host index. Then they or another organization can use the\nresulting joined index.\n\nAs an example, perhaps you're interested in content on the web that\nhas Creative Commons licenses -- that's currently less than 0.1% of\nall of the content on the web. In this scenario, you'd like to use a\nlist of webpages -- labeled by someone else -- to extract just the CC\nlicensed content. That is a an annotation table, joined against our\nurl index, followed by an extraction of content.\n\nAlternately, you might be interested in investigating which web hosts\nhave a lot of Creative Commons-licensed data. That is an aggregation\nof url-level information to a host level, and then a join against our\nhost index. Then you could use additional information from the host\nindex, such as host ranks or the percentage of languages, to examine\nwhich web hosts have a lot of Creative Commons licensed content in a\nparticular language.\n\n## More examples of annotations\n\n- URL quality indications from the ClueWeb22 dataset, Nemotron-CC, FineWeb\n- robots.txt information, such as how often robots.txt files are changed on web hosts\n- Alternative language identifications (the standard index only has CLD2)\n\n(None of these are currently available in a form that can easily be joined.)\n\n## Understanding the \"join\" column(s)\n\nA database join needs to use one or more keys. For better or worse,\nthe usual practice in web archiving is little unusual. For\napplications like the Wayback Machine, the important fields in the url\nindex are the URL, and the time it was crawled. There is a unique ID\nnamed WARC-Record-ID, but it is not traditionally included in indexes.\nAlso, the URL is usually indexed in the SURT form, which drops the\nleading www and reverses the order of the parts of the hostname:\n\n- example.com/README -\u003e com,example)/README\n- www.exmaple.com/README -\u003e com,example)/README\n- www.com -\u003e com,www)/\n\nFor the host index, the primary key is the hostname part of the SURT:\n\n- example.com -\u003e com,example\n- www.example.com -\u003e com,example\n- www.com -\u003e com,www\n\nIn both cases (url index, host index) the data tables are Hive sharded\nby crawl, e.g. CC-MAIN-2025-18.\n\nIf you only have a list of urls (e.g. FineWeb), then you'll have to\ndecide for yourself what time range you'd like the annotation to apply\nto.\n\n## 'left join' jargon\n\nThese tools consider the url index or host index the \"left\" database,\nand the annotation database is joined to the left database using a\nLEFT OUTER JOIN. This means that any row in the annotation that does\nnot match the \"left\" database will not appear in the result. The\nchoice \"left\" instead of \"right\" is totally arbitrary.\n\n## Annotation tool\n\nThis repo contains tools that join an index with an annotation,\nruns a query, and saves the output to a csv file. The configuration\nof the index, annotation, and query are all contained in yaml\nfiles. The index and annotation can be on local disk or on AWS.\n\nIn the following example, the index is our host index, and the\nannotation is taken from our web graph, and contains the columns\n`surt_host_name`, `webgraph_outdegree`, and `webgraph_indegree`.\n\nThe YAML configuration files are:\n\n- `left_local_host_index.yaml`\n- `left_web_host_index.yaml`\n- `join_local_outin.yaml`\n- `join_web_outin.yaml`\n- `action_surt_host_name.yaml`\n- `action_like_surt_host_name.yaml`\n\nTo run the python code, you'll need to install a few things in your\nvirtual environment:\n\n```\npip install -r requirements.txt\n```\n\nIf you want to use \"web\", you'll need to download some necessary\nfiles:\n\n```\nmake host-index-paths.gz webgraph-outin-paths.gz\n```\n\nHere are example command lines:\n\n- `python annotate.py left_local_host_index.yaml join_local_outin.yaml action_surt_host_name.yaml commoncrawl.org`\n- `python annotate.py left_web_host_index.yaml join_web_outin.yaml action_surt_host_name.yaml commoncrawl.org`\n- `python annotate.py left_local_host_index.yaml join_local_outin.yaml action_like_surt_host_name.yaml .commoncrawl.org`\n- `python annotate.py left_web_host_index.yaml join_web_outin.yaml action_like_surt_host_name.yaml .commoncrawl.org`\n\nAnd example csv output:\n\n```\n\"surt_host_name\",\"crawl\",\"hcrank10\",\"webgraph_outdegree\",\"webgraph_indegree\"\n\"org,commoncrawl\",\"CC-MAIN-2021-49\",4.718,,\n\"org,commoncrawl\",\"CC-MAIN-2022-05\",4.718,,\n\"org,commoncrawl\",\"CC-MAIN-2022-21\",4.86,,\n\"org,commoncrawl\",\"CC-MAIN-2022-27\",4.86,,\n\"org,commoncrawl\",\"CC-MAIN-2022-33\",4.86,,\n\"org,commoncrawl\",\"CC-MAIN-2022-40\",4.847,,\n\"org,commoncrawl\",\"CC-MAIN-2022-49\",4.847,,\n\"org,commoncrawl\",\"CC-MAIN-2023-06\",4.847,,\n\"org,commoncrawl\",\"CC-MAIN-2023-14\",5.003,,\n\"org,commoncrawl\",\"CC-MAIN-2023-23\",5.003,,\n\"org,commoncrawl\",\"CC-MAIN-2023-40\",5.003,,\n\"org,commoncrawl\",\"CC-MAIN-2023-50\",4.773,,\n\"org,commoncrawl\",\"CC-MAIN-2024-10\",4.954,,\n\"org,commoncrawl\",\"CC-MAIN-2024-18\",4.879,,\n\"org,commoncrawl\",\"CC-MAIN-2024-22\",4.872,,\n\"org,commoncrawl\",\"CC-MAIN-2024-26\",4.982,,\n\"org,commoncrawl\",\"CC-MAIN-2024-30\",5.085,291,1746\n\"org,commoncrawl\",\"CC-MAIN-2024-33\",4.928,274,1654\n\"org,commoncrawl\",\"CC-MAIN-2024-38\",5.101,288,1608\n\"org,commoncrawl\",\"CC-MAIN-2024-42\",5.067,294,1624\n\"org,commoncrawl\",\"CC-MAIN-2024-46\",4.974,307,1710\n\"org,commoncrawl\",\"CC-MAIN-2024-51\",4.83,329,1588\n\"org,commoncrawl\",\"CC-MAIN-2025-05\",4.967,,\n\"org,commoncrawl\",\"CC-MAIN-2025-08\",4.973,330,1682\n\"org,commoncrawl\",\"CC-MAIN-2025-13\",4.962,,\n\"org,commoncrawl\",\"CC-MAIN-2025-18\",4.845,310,1721\n```\n## TODOS\n\n- copy script that joins an index and annotation and outputs the result to local disk\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-index-annotations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-index-annotations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-index-annotations/lists"}