{"id":21658656,"url":"https://github.com/commoncrawl/cc-index-table","last_synced_at":"2025-06-12T08:41:08.095Z","repository":{"id":31348292,"uuid":"110119350","full_name":"commoncrawl/cc-index-table","owner":"commoncrawl","description":"Index Common Crawl archives in tabular format","archived":false,"fork":false,"pushed_at":"2025-05-08T20:34:30.000Z","size":197,"stargazers_count":119,"open_issues_count":8,"forks_count":10,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-05-08T20:35:54.003Z","etag":null,"topics":["apache-parquet","aws-athena","columnar-storage","commoncrawl","spark","sql"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-11-09T13:33:36.000Z","updated_at":"2025-05-08T19:44:28.000Z","dependencies_parsed_at":"2025-03-10T19:30:16.977Z","dependency_job_id":"9274494c-31a6-4fd0-aac8-d1e0a3d66be2","html_url":"https://github.com/commoncrawl/cc-index-table","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/cc-index-table","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-table","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-table/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-table/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-table/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/cc-index-table/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fcc-index-table/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259432309,"owners_count":22856724,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-parquet","aws-athena","columnar-storage","commoncrawl","spark","sql"],"created_at":"2024-11-25T09:29:37.413Z","updated_at":"2025-06-12T08:41:08.086Z","avatar_url":"https://github.com/commoncrawl.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# Common Crawl Index Table\n\nBuild and process the [Common Crawl index table](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) – an index to WARC files in a columnar data format ([Apache Parquet](https://parquet.apache.org/)).\n\nThe index table is built from the Common Crawl URL index files by [Apache Spark](https://spark.apache.org/). It can be queried by [SparkSQL](https://spark.apache.org/sql/), [Amazon Athena](https://aws.amazon.com/athena/) (built on [Presto](https://prestosql.io/) or [Trino](https://trino.io/)), [Apache Hive](https://hive.apache.org/) and many other big data frameworks and applications.\n\nThis projects provides a comprehensive set of example queries (SQL) and also Java code to fetch and process the WARC records matched by a SQL query.\n\n\n## Build Java tools\n\n`mvn package`\n\n\n## Spark installation\n\n[Spark](https://spark.apache.org/) needs to be installed in order to [build the table](#conversion-of-the-url-index) and also (alternatively) [for processing](#process-the-table-with-spark). Please refer to the [Spark documentation](https://spark.apache.org/docs/latest/) how to install Spark and set up a Spark cluster.\n\n\n## Building and running using Docker\n\nA [Dockerfile](./Dockerfile) is provided to compile the project and run the Spark job in a Docker container.\n\n1. build the Docker image:\n   ```sh\n   docker build . -t cc-index-table\n   ```\n2. run the table converter tool, here showing the command-line help (`--help`):\n   ```sh\n   docker run --rm -ti cc-index-table --help\n   ```\n   More details to run the converter are given below.\n\nNote that the Dockerfile defines the conversion tool as entry point.\nOverriding the entrypoint woulld allow to inspect the container using an interactive shell:\n\n```\n$\u003e docker run --rm --entrypoint=/bin/bash -ti cc-index-table\n\nspark@9eb71e5f09a6:/app$ java -version\nopenjdk version \"17.0.15\" 2025-04-15\nOpenJDK Runtime Environment Temurin-17.0.15+6 (build 17.0.15+6)\nOpenJDK 64-Bit Server VM Temurin-17.0.15+6 (build 17.0.15+6, mixed mode, sharing)\n```\n\nOr you could directly call the command `spark-submit`:\n\n```sh\ndocker run --rm --entrypoint=/opt/spark/bin/spark-submit cc-index-table\n```\n\n\n## Python, PySpark, Jupyter Notebooks\n\nNot part of this project. Please have a look at [cc-pyspark](//github.com/commoncrawl/cc-pyspark) for examples how to query and process the tabular URL index with Python and PySpark. The project [cc-notebooks](//github.com/commoncrawl/cc-notebooks) includes some examples how to gain insights into the Common Crawl data sets using the columnar index.\n\n\n## Conversion of the URL index\n\nA Spark job converts the Common Crawl URL index files (a [sharded gzipped index](https://pywb.readthedocs.io/en/latest/manual/indexing.html#zipnum-sharded-index) in [CDXJ format](https://iipc.github.io/warc-specifications/specifications/cdx-format/openwayback-cdxj/)) into a table in [Parquet](https://parquet.apache.org/) or [ORC](https://orc.apache.org/) format.\n\n```\n\u003e APPJAR=target/cc-index-table-0.3-SNAPSHOT-jar-with-dependencies.jar\n\u003e $SPARK_HOME/bin/spark-submit --class org.commoncrawl.spark.CCIndex2Table $APPJAR\n\nCCIndex2Table [options] \u003cinputPathSpec\u003e \u003coutputPath\u003e\n\nArguments:\n  \u003cinputPaths\u003e\n        pattern describing paths of input CDX files, e.g.\n        s3a://commoncrawl/cc-index/collections/CC-MAIN-2017-43/indexes/cdx-*.gz\n  \u003coutputPath\u003e\n        output directory\n\nOptions:\n  -h,--help                     Show this message\n     --outputCompression \u003carg\u003e  data output compression codec: gzip/zlib\n                                (default), snappy, lzo, none\n     --outputFormat \u003carg\u003e       data output format: parquet (default), orc\n     --partitionBy \u003carg\u003e        partition data by columns (comma-separated,\n                                default: crawl,subset)\n     --useNestedSchema          use the schema with nested columns (default:\n                                false, use flat schema)\n```\n\nThe script [convert_url_index.sh](src/script/convert_url_index.sh) runs `CCIndex2Table` using Spark on Yarn.\n\nColumns are defined and described in the table schema ([flat](src/main/resources/schema/cc-index-schema-flat.json) or [nested](src/main/resources/schema/cc-index-schema-nested.json)).\n\n\n### Runing the converter in a Docker container\n\nThe converter can be run from the Docker container, built from the Dockerfile, see the instructions above.\n\nThe steps given below are just an example – the way data is passed in and out from the container may vary.\n\n```sh\n# create a test folder\nmkdir -p /tmp/data/in\n\n# copy CDX files into /tmp/data/in/\ncp .../*.cdx.gz /tmp/data/in/\n\ntree /tmp/data/\n# outputs:\n# /tmp/data/\n# └── in\n#     └── CC-MAIN-20241208172518-20241208202518-00000.cdx.gz\n\n# ensure that also the user \"spark\" in the container has write permissions\nchmod a+w /tmp/data\n\n# note: the output will be written to /tmp/data/out/, but Spark\n#       will complain if the output folder already exists\n\n# launch the Docker container, running the Spark job\ndocker run --mount=type=bind,source=/tmp/data,destination=/data --rm cc-index-table /data/in /data/out\n\ntree /tmp/data/\n# /tmp/data/\n# ├── in\n# │   └── CC-MAIN-20241208172518-20241208202518-00000.cdx.gz\n# └── out\n#     ├── crawl=CC-MAIN-2024-51\n#     │   └── subset=warc\n#     │       └── part-00000-4b2c091d-24db-4248-8c3c-817fd04b7a85.c000.gz.parquet\n#     └── _SUCCESS\n```\n\n\n## Query the table in Amazon Athena\n\nFirst, the table needs to be imported into [Amazon Athena](https://aws.amazon.com/athena/). In the Athena Query Editor:\n\n1. create a database `ccindex`: `CREATE DATABASE ccindex` and make sure that it's selected as \"DATABASE\"\n2. edit the \"create table\" statement ([flat](src/sql/athena/cc-index-create-table-flat.sql) or [nested](src/sql/athena/cc-index-create-table-nested.sql)) and add the correct table name and path to the Parquet/ORC data on `s3://`. Execute the \"create table\" query.\n3. make Athena recognize the data partitions on `s3://`: `MSCK REPAIR TABLE ccindex` (do not forget to adapt the table name). This step needs to be repeated every time new data partitions have been added.\n\nA couple of sample queries are also provided (for the flat schema):\n- count captures over partitions (crawls and subsets), get a quick overview how many pages are contained in the monthly crawl archives (and are also indexed in the table): [count-by-partition.sql](src/sql/examples/cc-index/count-by-partition.sql)\n- page/host/domain counts per top-level domain: [count-by-tld-page-host-domain.sql](src/sql/examples/cc-index/count-by-tld-page-host-domain.sql)\n- \"word\" count of\n  - host name elements (split host name at `.` into words): [count-hostname-elements.sql](src/sql/examples/cc-index/count-hostname-elements.sql)\n  - URL path elements (separated by `/`): [count-url-path-elements.sql](src/sql/examples/cc-index/count-url-path-elements.sql)\n- count\n  - HTTP status codes: [count-fetch-status.sql](src/sql/examples/cc-index/count-fetch-status.sql)\n  - the domains of a specific top-level domain: [count-domains-of-tld.sql](src/sql/examples/cc-index/count-domains-of-tld.sql)\n  - page captures of Internationalized Domain Names (IDNA): [count-idna.sql](src/sql/examples/cc-index/count-idna.sql)\n  - URL paths pointing to robots.txt files [count-robotstxt-url-paths.sql](src/sql/examples/cc-index/count-robotstxt-url-paths.sql) (note: `/robots.txt` may be a redirect)\n  - pages of the Alexa top 1 million sites by joining two tables (ccindex and a CSV file): [count-domains-alexa-top-1m.sql](src/sql/examples/cc-index/count-domains-alexa-top-1m.sql)\n- compare document MIME types (Content-Type in HTTP response header vs. MIME type detected by [Tika](https://tika.apache.org/): [compare-mime-type-http-vs-detected.sql](src/sql/examples/cc-index/compare-mime-type-http-vs-detected.sql)\n- distribution/histogram of host name lengths: [host-length-distrib.sql](src/sql/examples/cc-index/host-length-distrib.sql)\n- export WARC record specs (file, offset, length) for\n  - a single domain: [get-records-of-domain.sql](src/sql/examples/cc-index/get-records-of-domain.sql)\n  - a specific MIME type: [get-records-of-mime-type.sql](src/sql/examples/cc-index/get-records-of-mime-type.sql)\n  - a specific language (e.g., Icelandic): [get-records-for-language.sql](src/sql/examples/cc-index/get-records-for-language.sql)\n  - home pages of a given list of domains: [get-records-home-pages.sql](src/sql/examples/cc-index/get-records-home-pages.sql)\n- find homepages for low-resource languages: [get-home-pages-languages.sql](src/sql/examples/cc-index/get-home-pages-languages.sql)\n- obtain a random sample of URLs: [random-sample-urls.sql](src/sql/examples/cc-index/random-sample-urls.sql)\n- find similar domain names by Levenshtein distance (few characters changed): [similar-domains.sql](src/sql/examples/cc-index/similar-domains.sql)\n- average length, occupied storage and payload truncation of WARC records by MIME type: [average-warc-record-length-by-mime-type.sql](src/sql/examples/cc-index/average-warc-record-length-by-mime-type.sql)\n- count pairs of top-level domain and content language: [count-language-tld.sql](src/sql/examples/cc-index/count-language-tld.sql)\n- find correlations between TLD and content language using the log-likelihood ratio: [loglikelihood-language-tld.sql](src/sql/examples/cc-index/loglikelihood-language-tld.sql)\n- ... and similar for correlations between content language and character encoding: [correlation-language-charset.sql](src/sql/examples/cc-index/correlation-language-charset.sql)\n- discover sites hosting content of specific language(s): [site-discovery-by-language.sql](src/sql/examples/cc-index/site-discovery-by-language.sql)\n- find multi-lingual domains by analyzing URL paths: [get-language-translations-url-path.sql](src/sql/examples/cc-index/get-language-translations-url-path.sql)\n- extract robots.txt records for a list of sites: [get-records-robotstxt.sql](src/sql/examples/cc-index/get-records-robotstxt.sql)\n\nAthena creates results in CSV format. E.g., for the last example, the mining of multi-lingual domains we get:\n\ndomain                    |n_lang | n_pages  | lang_counts\n--------------------------|-------|----------|------------------\nvatican.va                |    40 |    42795 | {de=3147, ru=20, be=1, fi=3, pt=4036, bg=11, lt=1, hr=395, fr=5677, hu=79, uc=2, uk=17, sk=20, sl=4, sp=202, sq=5, mk=1, ge=204, sr=2, sv=3, or=2243, sw=5, el=5, mt=2, en=7650, it=10776, es=5360, zh=5, iw=2, cs=12, ar=184, vi=1, th=4, la=1844, pl=658, ro=9, da=2, tr=5, nl=57, po=141}\niubilaeummisericordiae.va |     7 |     2916 | {de=445, pt=273, en=454, it=542, fr=422, pl=168, es=612}\nosservatoreromano.va      |     7 |     1848 | {de=284, pt=42, en=738, it=518, pl=62, fr=28, es=176}\ncultura.va                |     3 |     1646 | {en=373, it=1228, es=45}\nannusfidei.va             |     6 |      833 | {de=51, pt=92, en=171, it=273, fr=87, es=159}\npas.va                    |     2 |      689 | {en=468, it=221}\nphotogallery.va           |     6 |      616 | {de=90, pt=86, en=107, it=130, fr=83, es=120}\nim.va                     |     6 |      325 | {pt=2, en=211, it=106, pl=1, fr=3, es=2}\nmuseivaticani.va          |     5 |      266 | {de=63, en=54, it=47, fr=37, es=65}\nlaici.va                  |     4 |      243 | {en=134, it=5, fr=51, es=53}\nradiovaticana.va          |     3 |      220 | {en=5, it=214, fr=1}\ncasinapioiv.va            |     2 |      213 | {en=125, it=88}\nvaticanstate.va           |     5 |      193 | {de=25, en=76, it=24, fr=25, es=43}\nlaityfamilylife.va        |     5 |      163 | {pt=21, en=60, it=3, fr=78, es=1}\ncamposanto.va             |     1 |      156 | {de=156}\nsynod2018.va              |     3 |      113 | {en=24, it=67, fr=22}\n\n\n\n## Process the Table with Spark\n\n### Export Views\n\nAs a first use case, let's export parts of the table and save it in one of the formats supported by Spark. The tool [CCIndexExport](src/main/java/org/commoncrawl/spark/examples/CCIndexExport.java) runs a Spark job to extract parts of the index table and save it as a table in Parquet, ORC, JSON or CSV. It may even transform the data into an entirely different table. Please refert to the [Spark SQL programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) and the [overview of built-in SQL functions](https://spark.apache.org/docs/latest/api/sql/) for more information.\n\nThe tool requires as arguments input and output path, but you also want to pass a useful SQL query instead of the default `SELECT * FROM ccindex LIMIT 10`. All available command-line options are show when called with `--help`:\n\n```\n\u003e $SPARK_HOME/bin/spark-submit --class org.commoncrawl.spark.examples.CCIndexExport $APPJAR --help\n\nCCIndexExport [options] \u003ctablePath\u003e \u003coutputPath\u003e\n\nArguments:\n  \u003ctablePath\u003e\n        path to cc-index table\n        s3://commoncrawl/cc-index/table/cc-main/warc/\n  \u003coutputPath\u003e\n        output directory\n\nOptions:\n  -h,--help                       Show this message\n  -q,--query \u003carg\u003e                SQL query to select rows\n  -t,--table \u003carg\u003e                name of the table data is loaded into\n                                  (default: ccindex)\n     --numOutputPartitions \u003carg\u003e  repartition data to have \u003cn\u003e output partitions\n     --outputCompression \u003carg\u003e    data output compression codec: none, gzip/zlib\n                                  (default), snappy, lzo, etc.\n                                  Note: the availability of compression options\n                                  depends on the chosen output format.\n     --outputFormat \u003carg\u003e         data output format: parquet (default), orc,\n                                  json, csv\n     --outputPartitionBy \u003carg\u003e    partition data by columns (comma-separated,\n                                  default: crawl,subset)\n```\n\nThe following Spark SQL options are recommended to achieve an optimal query performance:\n```\nspark.hadoop.parquet.enable.dictionary=true\nspark.hadoop.parquet.enable.summary-metadata=false\nspark.sql.hive.metastorePartitionPruning=true\nspark.sql.parquet.filterPushdown=true\n```\n\nBecause the schema of the index table has slightly changed over time by adding new columns the following option is required if any of the new columns (e.g., `content_languages`) is used in the query:\n```\nspark.sql.parquet.mergeSchema=true\n```\n\n\n### Export Subsets of the Common Crawl Archives\n\nThe [URL index](https://index.commoncrawl.org/) was initially created to easily fetch web page captures from the Common Crawl archives. The columnar index also contains the necessary information for this task - the fields `warc_filename`, `warc_record_offset` and `warc_record_length`. This allows us to define a subset of the Common Crawl archives by a SQL query, fetch all records of the subset and export them to WARC files for further processing. The tool [CCIndexWarcExport](src/main/java/org/commoncrawl/spark/examples/CCIndexWarcExport.java) addresses this use case:\n\n```\n\u003e $SPARK_HOME/bin/spark-submit --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR --help\n\nCCIndexWarcExport [options] \u003ctablePath\u003e \u003coutputPath\u003e\n\nArguments:\n  \u003ctablePath\u003e\n        path to cc-index table\n        s3://commoncrawl/cc-index/table/cc-main/warc/\n  \u003coutputPath\u003e\n        output directory\n\nOptions:\n  -q,--query \u003carg\u003e                  SQL query to select rows. Note: the result\n                                    is required to contain the columns `url',\n                                    `warc_filename', `warc_record_offset' and\n                                    `warc_record_length', make sure they're\n                                    SELECTed.\n  -t,--table \u003carg\u003e                  name of the table data is loaded into\n                                    (default: ccindex)\n     --csv \u003carg\u003e                    CSV file to load WARC records by filename,\n                                    offset and length.The CSV file must have\n                                    column headers and the input columns `url',\n                                    `warc_filename', `warc_record_offset' and\n                                    `warc_record_length' are mandatory, see also\n                                    option --query.\n  -h,--help                         Show this message\n     --numOutputPartitions \u003carg\u003e    repartition data to have \u003cn\u003e output\n                                    partitions\n     --numRecordsPerWarcFile \u003carg\u003e  allow max. \u003cn\u003e records per WARC file. This\n                                    will repartition the data so that in average\n                                    one partition contains not more than \u003cn\u003e\n                                    rows. Default is 10000, set to -1 to disable\n                                    this option.\n                                    Note: if both --numOutputPartitions and\n                                    --numRecordsPerWarcFile are used, the former\n                                    defines the minimum number of partitions,\n                                    the latter the maximum partition size.\n     --warcCreator \u003carg\u003e            (WARC info record) creator of WARC export\n     --warcOperator \u003carg\u003e           (WARC info record) operator of WARC export\n     --warcPrefix \u003carg\u003e             WARC filename prefix\n```\n\nLet's try to put together a couple of WARC files containing only web pages written in Icelandic (ISO-639-3 language code [isl](https://en.wikipedia.org/wiki/ISO_639:isl)). We choose Icelandic because it's not so common and the number of pages in the Common Crawl archives is manageable, cf. the [language statistics](https://commoncrawl.github.io/cc-crawl-statistics/plots/languages). We take the query [get-records-for-language.sql](src/sql/examples/cc-index/get-records-for-language.sql) and run it as Spark job:\n\n```\n\u003e $SPARK_HOME/bin/spark-submit \\\n   --conf spark.hadoop.parquet.enable.dictionary=true \\\n   --conf spark.hadoop.parquet.enable.summary-metadata=false \\\n   --conf spark.sql.hive.metastorePartitionPruning=true \\\n   --conf spark.sql.parquet.filterPushdown=true \\\n   --conf spark.sql.parquet.mergeSchema=true \\\n   --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \\\n   --query \"SELECT url, warc_filename, warc_record_offset, warc_record_length\n            FROM ccindex\n            WHERE crawl = 'CC-MAIN-2018-43' AND subset = 'warc' AND content_languages = 'isl'\" \\\n   --numOutputPartitions 12 \\\n   --numRecordsPerWarcFile 20000 \\\n   --warcPrefix ICELANDIC-CC-2018-43 \\\n   s3://commoncrawl/cc-index/table/cc-main/warc/ \\\n   .../my_output_path/\n```\n\nIt's also possible to pass the result of SQL query as a CSV file, e.g., an Athena result file. If you've already run the [get-records-for-language.sql](src/sql/examples/cc-index/get-records-for-language.sql) and the output file is available on S3, just replace the `--query` argument by `--csv` pointing to the result file:\n\n```\n\u003e $SPARK_HOME/bin/spark-submit --class org.commoncrawl.spark.examples.CCIndexWarcExport $APPJAR \\\n   --csv s3://aws-athena-query-results-123456789012-us-east-1/Unsaved/2018/10/26/a1a82705-047c-4902-981d-b7a93338d5ac.csv \\\n   ...\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-index-table","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fcc-index-table","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fcc-index-table/lists"}