{"id":49181710,"url":"https://github.com/commoncrawl/whirlwind-java","last_synced_at":"2026-06-18T15:01:35.236Z","repository":{"id":364055270,"uuid":"1080649917","full_name":"commoncrawl/whirlwind-java","owner":"commoncrawl","description":"A whirlwind tour of Common Crawl's data using Java","archived":false,"fork":false,"pushed_at":"2026-04-20T16:23:18.000Z","size":493,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-11T14:04:25.619Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-21T17:06:00.000Z","updated_at":"2026-04-20T16:23:39.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/commoncrawl/whirlwind-java","commit_stats":null,"previous_names":["commoncrawl/whirlwind-java"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/whirlwind-java","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwhirlwind-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwhirlwind-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwhirlwind-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwhirlwind-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/whirlwind-java/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Fwhirlwind-java/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34495380,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-18T02:00:06.871Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-23T02:00:21.366Z","updated_at":"2026-06-18T15:01:35.216Z","avatar_url":"https://github.com/commoncrawl.png","language":"Java","funding_links":[],"categories":["Training/Documentation"],"sub_categories":["Training Materials"],"readme":"# Whirlwind Tour of Common Crawl's Datasets using Java\n\nThe Common Crawl corpus contains petabytes of crawl data, including raw web page data, metadata, and parsed text. Common Crawl's data storage is a little complicated, as you might expect for such a large and rich dataset. We make our crawl data available in a variety of formats (WARC, WET, WAT) and we also have two index files of the crawled webpages: CDXJ and columnar.\n```mermaid\nflowchart TD\n    WEB[\"WEB\"] -- crawler --\u003e cc[\"Common Crawl\"]\n    cc --\u003e WARC[\"WARC\"] \u0026 WAT[\"WAT\"] \u0026 WET[\"WET\"] \u0026 CDXJ[\"CDXJ\"] \u0026 Columnar[\"Columnar\"] \u0026 etc[\"...\"]\n    WEB@{ shape: cyl}\n    WARC@{ shape: stored-data}\n    WAT@{ shape: stored-data}\n    WET@{ shape: stored-data}\n    CDXJ@{ shape: stored-data}\n    Columnar@{ shape: stored-data}\n    etc@{ shape: stored-data}\n```\n\nThe goal of this whirlwind tour is to show you how a single webpage appears in all of these different places. That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete), which we crawled on the date 2024-05-18T01:58:10Z. On the way, we'll also explore the file formats we use and learn about some useful tools for interacting with our data!\n\nIn the Whirlwind Tour, we will:\n1) explore the WARC, WET and WAT file formats used to store Common Crawl's data.\n2) play with some useful Java libraries for interacting with the data: [jwarc](https://github.com/iipc/jwarc), TBA if needed\nand [duckdb](https://duckdb.org/).\n3) learn about how the data is compressed in an unusual way to allow random access.\n4) use the CDXJ index and the columnar index to access the data we want.\n\n**Prerequisites:** To get the most out of this tour, you should be comfortable with Maven, running commands on the command line, and basic SQL. Some knowledge of HTTP requests and HTML is also helpful but not essential. We assume you have [make](https://www.gnu.org/software/make/) and [Maven](https://maven.apache.org/) installed.\n\nWe use a [Makefile](https://makefiletutorial.com) to provide many of the commands needed to run this tutorial. To see what commands are being run, open the `Makefile` and find the relevant target: e.g. `make build` is running `mvn clean package`.\n\nLet's get started!\n\n## Task 0: Set-up\n\nThis tutorial was written on Linux and MacOS and it should also work on Windows. If you encounter any problems, please raise an issue.\n\n### Clone the repository\n\nFirst, [clone this repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository) to create a local copy, then navigate to the `whirlwind-java` directory on your computer.\n\nNext, Maven usually takes care of downloading the JARs of all libraries when need, so you don't need to run anything beforehand. \n\n### Install and configure AWS-CLI\n\nWe will use the AWS Command Line Interface (CLI) later in the tour to access the data stored in Common Crawl's S3 bucket. Instructions on how to install the AWS-CLI and configure your account are available on the [AWS website](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). \n\n## Task 1: Look at the crawl data\n\nCommon Crawl's website includes a [Get Started](https://commoncrawl.org/get-started) guide which summarises different ways to access the data and the file formats. We can use the dropdown menu to access the links for downloading crawls over HTTP(S):\n\n![crawl_dropdown.png](img/crawl_dropdown.png)\n\nIf we click on `CC-MAIN-2024-22' in the dropdown, we are taken to a page listing the files contained in this crawl: \n\n![crawl_file_listing.png](img/crawl_file_listing.png)\n\nIn this whirlwind tour, we're going to look at the WARC, WET, and WAT files: the data types which store the crawl data. Later, we will look at the two index files and how these help us access the crawl data we want. At the [end of the Tour](#other-datasets), we'll mention some of Common Crawl's other datasets and where you can find more information about them.\n\n### WARC\n\n[WARC files](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/) are a container that holds files, similar to zip and tar files. It's the standard data format used by archiving\ncommunity, and we use it to store raw crawl data. As you can see in the file listing above, our WARC files are very large even when compressed! Luckily, we have a much smaller example to look at. \n\nOpen `data/whirlwind.warc` in your favorite text editor. Note that this is an uncompressed version of the file; normally we always work with these files while they are compressed. This is the WARC corresponding to the single webpage we mentioned in the introduction.\n\nYou'll see four records total, with the start of each record marked with the header `WARC/1.0` followed by metadata related to that particular record. The `WARC-Type` field tells you the type of each record. In our WARC file, we have:\n1) a `warcinfo` record. Every WARC has that at the start. \n2) the `request` to the webserver, with its HTTP headers.\n3) the `response` from the webserver, with its HTTP headers followed by the html.\n4) a `metadata` record related to the HTTP response.\n\n### WET\n\nWET (WARC Encapsulated Text) files only contain the body text of web pages parsed from the HTML and exclude any HTML code, images, or other media. This makes them useful for text analysis and natural language processing (NLP) tasks.\n\nOpen `data/whirlwind.warc.wet`: this is the WET derived from our original WARC. We can see that it's still in WARC format with two records: \n1) a `warcinfo` record.\n2) a `conversion` record: the parsed text with HTTP headers removed.\n\n### WAT\n\nWAT (Web ARChive Timestamp) files contain metadata associated with the crawled web pages (e.g. parsed data from the HTTP response headers, links recovered from HTML pages, server response codes etc.). They are useful for analysis that requires understanding the structure of the web.\n\nOpen `data/whirlwind.warc.wat`: this is the WAT derived from our original WARC. Like the WET file, it's also in WARC format. It contains two records:\n1) a `warcinfo` record.\n2) a `metadata` record: there should be one for each response in the WARC. The metadata is stored as JSON. \n\nYou might want to feed the JSON into a pretty-printer to read it more easily. For example, you can save just the JSON into a file and use `python -m json.tool FILENAME` to pretty-print it.\n\nNow that we've looked at the uncompressed versions of these files to understand their structure, we'll be interacting with compressed WARC, WET, and WAT files for the rest of this tour. This is the usual way we manipulate this data with software tools due to the size of the files.\n\n## Task 2: Iterate over WARC, WET, and WAT files\n\nThe [JWarc](https://github.com/iipc/jwarc) Java library lets us read and write WARC files both programmatically and via a CLI.\n\nYou should download the [JWarc](https://github.com/iipc/jwarc)'s JAR using `make jwarc.jar` which should download the JAR in the root directory. \nIf you download it yourself, we recommend you to rename it to remove the version from the jar filename, so you can copy-paste the commands directly.\nYou can now explore the CLI commands available by running:\n\n```shell\njava -jar jwarc.jar --help\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view the result\u003c/summary\u003e\n\n```\nusage: jwarc \u003ccommand\u003e [args]...\n\nCommands:\n\n  cdx         List records in CDX format\n  cdxj        List records in CDXJ format\n  dedupe      Deduplicate records by looking up a CDX server\n  extract     Extract record by offset\n  fetch       Download a URL recording the request and response\n  filter      Copy records that match a given filter expression\n  ls          List records in WARC file(s)\n  record      Fetch a page and subresources using headless Chrome\n  recorder    Run a recording proxy\n  saveback    Saves wayback-style replayed pages as WARC records\n  screenshot  Take a screenshot of each page in the given WARCs\n  serve       Serve WARC files with a basic replay server/proxy\n  stats       Print statistics about WARC and CDX files\n  validate    Validate WARC or ARC files\n  version     Print version information\n```\n\n\u003c/details\u003e\n\nLet's iterate over our WARC, WET, and WAT files and print out the record types we looked at before. We will see the use of `ls` for listing records and offsets, and `extract` for pulling out records information (payload, headers) using the offsets as reference:\n\n```shell\njava -jar jwarc.jar ls data/whirlwind.warc.gz\n         0 warcinfo   -    -\n       516 request    GET  https://an.wikipedia.org/wiki/Escopete\n      1023 response   200  https://an.wikipedia.org/wiki/Escopete\n     18374 metadata   -    https://an.wikipedia.org/wiki/Escopete\n```\n\nThe `java -jar jwarc.jar ls` command lists the records in a WARC file, showing the offset, type, HTTP status code (if applicable), and target URI for each record.\n\nYou can then extract information about the response record:\n```shell\njava -jar jwarc.jar extract data/whirlwind.warc.gz 1023\n```\n\nThis command will return the full record: headers and payload. Is possible to select either by passing the `--payload` or `--headers` parameters before the filename. \n\n```shell\njava -jar jwarc.jar extract --headers data/whirlwind.warc.gz 1023\n```\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view the result\u003c/summary\u003e\n\n```\nWARC/1.0\nContent-Length: 74581\nContent-Type: application/http; msgtype=response\nWARC-Block-Digest: sha1:35FTUGFVNWRVTZQGCWIX2MQA3LMYC7X7\nWARC-Concurrent-To: \u003curn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f\u003e\nWARC-Date: 2024-05-18T01:58:10Z\nWARC-Identified-Payload-Type: text/html\nWARC-IP-Address: 208.80.154.224\nWARC-Payload-Digest: sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU\nWARC-Record-ID: \u003curn:uuid:2aabeff2-67f5-4608-8466-e87c6296e2b6\u003e\nWARC-Target-URI: https://an.wikipedia.org/wiki/Escopete\nWARC-Type: response\nWARC-Warcinfo-ID: \u003curn:uuid:668d88fc-4208-41fc-b327-1aa6cb783331\u003e\n\nHTTP/1.1 200 OK\ndate: Sat, 18 May 2024 01:58:10 GMT\nserver: mw-web.eqiad.canary-bb67b76b8-jtwdb\nx-content-type-options: nosniff\ncontent-language: an\norigin-trial: AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9\naccept-ch: \nvary: Accept-Encoding,Cookie,Authorization\nlast-modified: Sat, 04 May 2024 01:58:10 GMT\ncontent-type: text/html; charset=UTF-8\nX-Crawler-content-encoding: gzip\nage: 0\nx-cache: cp1106 miss, cp1106 miss\nx-cache-status: miss\nserver-timing: cache;desc=\"miss\", host;desc=\"cp1106\"\nstrict-transport-security: max-age=106384710; includeSubDomains; preload\nreport-to: { \"group\": \"wm_nel\", \"max_age\": 604800, \"endpoints\": [{ \"url\": \"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error\u0026schema_uri=/w3c/reportingapi/network_error/1.0.0\" }] }\nnel: { \"report_to\": \"wm_nel\", \"max_age\": 604800, \"failure_fraction\": 0.05, \"success_fraction\": 0.0}\nset-cookie: WMF-Last-Access=18-May-2024;Path=/;HttpOnly;secure;Expires=Wed, 19 Jun 2024 00:00:00 GMT\nset-cookie: WMF-Last-Access-Global=18-May-2024;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 19 Jun 2024 00:00:00 GMT\nset-cookie: WMF-DP=1a6;Path=/;HttpOnly;secure;Expires=Sat, 18 May 2024 00:00:00 GMT\nx-client-ip: 34.239.158.223\ncache-control: private, s-maxage=0, max-age=0, must-revalidate\nset-cookie: GeoIP=US:VA:Ashburn:39.05:-77.49:v4; Path=/; secure; Domain=.wikipedia.org\nset-cookie: NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600\naccept-ranges: bytes\nX-Crawler-transfer-encoding: chunked\nContent-Length: 72848\n```\n\u003c/details\u003e\n\nNow, let's have a look at the WET and WAT compressed files.  \nYou can obtain similar information by running `ls` on those files: \n```shell\njava -jar jwarc.jar ls data/whirlwind.warc.wet.gz \n         0 warcinfo   -    -\n       466 conversion -    https://an.wikipedia.org/wiki/Escopete\n```\n\nand \n\n```shell\njava -jar jwarc.jar ls data/whirlwind.warc.wat.gz \n         0 warcinfo   -    -\n       443 metadata   -    https://an.wikipedia.org/wiki/Escopete\n```\n\nFollowing the same principle, you can obtain the converted text payload by running: \n\n```shell\njava -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view the result\u003c/summary\u003e\n\n```\nEscopete - Biquipedia, a enciclopedia libre\nIr al contenido\nMenú principal\nMenú principal\nmover a la barra lateral\nocultar\nNavego\nPortalada\nA tabierna\nActualidat\nZaguers cambeos\nUna pachina a l'azar\nAduya\nDonativos\nMirar\nMirar-lo\nCreyar cuenta\nDentrar-ie\nFerramientas personals\nCreyar cuenta\nDentrar-ie\nPáginas para editores desconectados más información\nContribucions\nPachina de descusión d'ista IP\nContenidos\nmover a la barra lateral\nocultar\nInicio\n1Cheografía\n2Historia\n3Administración\nAlternar subsección Administración\n3.1Alcaldes\n4Molimentos\n5Fiestas\n6Referencias\n7Vinclos externos\nCambiar a la tabla de contenidos\nEscopete\n32 idiomas\nAsturianu\nBrezhoneg\nCatalà\nНохчийн\nCebuano\nDeutsch\nEnglish\nEsperanto\nEspañol\nEuskara\nFrançais\nMagyar\nInterlingua\nInterlingue\nItaliano\nҚазақша\nLadin\nLombard\nBahasa Melayu\nNederlands\nOccitan\nPolski\nPortuguês\nРусский\nSvenska\nТатарча / tatarça\nУкраїнська\nVèneto\nTiếng Việt\nWinaray\n中文\n閩南語 / Bân-lâm-gú\nModificar os enlaces\nPachina\nDiscusión\naragonés\nLeyer\nEditar\nModificar codigo\nAmostrar l'historial\nFerramientas\nHerramientas\nmover a la barra lateral\nocultar\nAcciones\nLeyer\nEditar\nModificar codigo\nAmostrar l'historial\nGeneral\nPachinas que enlazan con ista\nCambios relacionatos\nCargar fichero\nPachinas especials\nVinclo permanent\nInformación d'a pachina\nCitar ista pachina\nObtener URL acortado\nDescargar código QR\nElemento de Wikidata\nImprentar/exportar\nCreyar un libro\nDescargar como PDF\nVersión ta imprentar\nEn otros proyectos\nWikimedia Commons\nDe Biquipedia\nIste articlo ye en proceso de cambio enta la ortografía oficial de Biquipedia (la Ortografía de l'aragonés de l'Academia Aragonesa d'a Luenga). Puez aduyar a completar este proceso revisando l'articlo, fendo-ie los cambios ortograficos necesarios y sacando dimpués ista plantilla.\nEscopete\nMunicipio de Castiella-La Mancha\nEntidat\n• Estau\n• Comunidat\n• Provincia\n• Comarca Municipio\nEspanya\nCastiella-La Mancha\nGuadalachara\nLa Alcarria\nSuperficie 19,01 km²\nPoblación\n• Total\n68 hab. (2013)\nAltaria\n• Meyana\n860 m.\nDistancia\n• 47 km\nenta Guadalachara\nAlcalde Hilario Lopez Ferrer\nCodigo postal 19119\nChentilicio escopetero / escopetera\n(en castellano)\nCoordenadas\n40°24’59’’N 3° 0’23’’U\nEscopete\nEscopete en Castiella-La Mancha\nEscopete ye un municipio d'a provincia de Guadalachara, en a comunidat autonoma de Castiella-La Mancha, Espanya, comarca de La Alcarria y partiu chudicial de Guadalachara.\nA suya población ye de 84 habitants (2007), en una superficie de 19,01 km² y una densidat de población de 4,42 hab/km².\nCheografía[editar | modificar o codigo]\nYe situato a 860 metros d'altaria sobre o ran d'a mar, a una distancia de 47 km de Guadalachara, a capital d'a suya provincia, y d'o suyo termin municipal fa parti o lugar de Monteumbría.\nHistoria[editar | modificar o codigo]\nEscopete ye citato en as Relaciones Topográficas de los pueblos de Espanya, feitas por Felipe II de Castiella en 1578.\nAdministración[editar | modificar o codigo]\nAlcaldes[editar | modificar o codigo]\nLista d'alcaldes\nLechislatura\nNombre\nPartiu politico\n1979–1983\n1983–1987\n1987–1991\n1991–1995\n1995–1999\n1999–2003\n2003–2007\n2007–2011 Hilario López Herrer Partido Socialista Obrero Español\nMolimentos[editar | modificar o codigo]\nIlesia parroquial de l'Asunción, d'estilo romanico, d'o sieglo XIII.[1] Fue parcialment destruita en a Guerra Civil espanyola.\nFiestas[editar | modificar o codigo]\n11 d'agosto.[1]\nReferencias[editar | modificar o codigo]\n↑ 1,0 1,1 Deputación Provincial de Guadalachara.\nVinclos externos[editar | modificar o codigo]\n(es) Escopete en a pachina web d'a Deputación Provincial de Guadalachara.\nObteniu de \"https://an.wikipedia.org/w/index.php?title=Escopete\u0026oldid=2049929\"\nCategoría:\nLocalidaz d'a provincia de Guadalachara\nCategorías amagadas:\nBiquiprochecto:Grafía/Articlos con grafía EFA\nWikipedia:Articlos con datos por tresladar ta Wikidata\nZaguera edición d'ista pachina o 17 ago 2023 a las 21:26.\nO texto ye disponible baixo a Licencia Creative Commons Atribución/Compartir-Igual; talment sigan d'aplicación clausulas adicionals. Mire-se os termins d'uso ta conoixer más detalles.\nPolitica de privacidat\nSobre Biquipedia\nAlvertencias chenerals\nCódigo de conducta\nDesembolicadors\nEstatisticas\nDeclaración de cookies\nVersión ta mobils\nActivar o desactivar el límite de anchura del contenido\n```\n\n\u003c/details\u003e\n\nFeel free to experiment more by looking at other part of the records, or extracting different records. \n\n## Task 3: Index the WARC, WET, and WAT\n\nThe example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index. \n```mermaid\nflowchart LR\n    warc --\u003e indexer --\u003e cdxj \u0026 columnar\n    warc@{shape: cyl}\n    cdxj@{ shape: stored-data}\n    columnar@{ shape: stored-data}\n```\n\n\nWe have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index.\n\n### CDX(J) index\n\nThe CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons 💅\n\nWe can create our own CDXJ index from the local WARCs by running:\n\n```make cdxj```\n\nThis uses the JWARC library to generate CDXJ index files for our WARC files by running the code below: \n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view code\u003c/summary\u003e\n\n```\ncreating *.cdxj index files from the local warcs\njava -jar jwarc.jar cdxj data/whirlwind.warc.gz \u003e whirlwind.warc.cdxj\njava -jar jwarc.jar cdxj data/whirlwind.warc.wet.gz --record-type conversion \u003e whirlwind.warc.wet.cdxj\njava -jar jwarc.jar cdxj data/whirlwind.warc.wat.gz --record-type metadata \u003e whirlwind.warc.wat.cdxj\n```\n\n\u003c/details\u003e\n\nNow look at the `.cdxj` files with `cat whirlwind*.cdxj`. You'll see that each file has one entry in the index. The WARC only has the response record indexed, since by default cdxj-indexer guesses that you won't ever want to random-access the request or metadata. WET and WAT have the conversion and metadata records indexed (Common Crawl doesn't publish a WET or WAT index, just WARC).\n\nFor each of these records, there's one text line in the index - yes, it's a flat file! It starts with a string like `org,wikipedia,an)/wiki/escopete 20240518015810`, followed by a JSON blob. The starting string is the primary key of the index. The first thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt) (Sort-friendly URI Reordering Transform). The big integer is a date, in ISO-8601 format with the delimiters removed.\n\nWhat is the purpose of this funky format? It's done this way because these flat files (300 gigabytes total per crawl) can be sorted on the primary key using any out-of-core sort utility e.g. the standard Linux `sort`, or one of the Hadoop-based out-of-core sort functions.\n\nThe JSON blob has enough information to cleanly isolate the raw data of a single record: it defines which WARC file the record is in, and the byte offset and length of the record within this file. We'll use that in the next section.\n\n## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT \n\nNormally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.\n\nTo extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record.\n\nRun:\n\n```make extract```\n\nto run a set of extractions from your local\n`whirlwind.*.gz` files with `JWARC` using the commands below:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view code\u003c/summary\u003e\n\n```\ncreating extraction.* from local warcs, the offset numbers are from the cdxj index\njava -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 \u003e extraction.html\njava -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 \u003e extraction.txt\njava -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 \u003e extraction.json\nhint: python -m json.tool extraction.json\n```\n\n\u003c/details\u003e\n\nThe offset numbers in the Makefile are the same\nones as in the index. Look at the three output files: `extraction.html`, `extraction.txt`, and `extraction.json` (pretty-print the json with `python -m json.tool extraction.json`). \n\nNotice that we extracted HTML from the WARC, text from WET, and JSON from the WAT (as shown in the different file extensions). This is because the payload in each file type is formatted differently!\n\n## Task 5: Wreck the WARC by compressing it wrong\n\nAs mentioned earlier, WARC/WET/WAT files look like they're normal gzipped files, but they're actually gzipped in a particular way that allows random access. This means that you can't `gunzip` and then `gzip` a warc without wrecking random access. This example:\n\n* creates a copy of one of the warc files in the repo\n* using JWARC we list the records and their respective offsets\n* we access one of the records in the middle of the archive to show that it works\n* uncompresses it\n* recompresses it the wrong way\n* access one of the records in the middle of the archive of the compressed file showing that it fails \n* recompresses it the right way using `org.commoncrawl.whirlwind.RecompressWARC`\n* show that it works now accessing one of the records in the middle of the archive\n\nRun\n\n```make wreck_the_warc```\n\nand read through the output. You should get something like the output below:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view output\u003c/summary\u003e\n\n```\nwe will break and then fix this warc\ncp data/whirlwind.warc.gz data/testing.warc.gz\nrm -f data/testing.warc\ngzip -d data/testing.warc.gz  # windows gunzip no work-a\n\ncompress it the wrong way\ngzip data/testing.warc\n\nshowing the records in the compressed warc - note the offsets of request and response are identical \njava -jar jwarc.jar ls data/testing.warc.gz\n         0 warcinfo   -    -\n      3734 request    GET  https://an.wikipedia.org/wiki/Escopete\n      3734 response   200  https://an.wikipedia.org/wiki/Escopete\n     18386 metadata   -    https://an.wikipedia.org/wiki/Escopete\n\naccess the request record - failing\njava -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true\nException in thread \"main\" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: \u003c-- HERE --\u003e\\xffffff87@\\r\\xffffffa1\\xffffffca\\xffffff84\\x1d\\xffffffca\\x0f0\\xffffffb4\\xffffff93\\xfffffff9\\xffffffc5\\xfffffff3\\xffffff89\\xffffffeb?\\x1b\\xffffff87,q\\xffffffed\\xffffffb3!s\\xffffffc1\\x08\\xffffff83\\\\xffffffe0T\\xffffffadG\\xffffffdcd5\\x02\\xffffffbaQ... (offset 3734)\n        at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)\n        at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)\n        at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)\n        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)\n\naccess the response record - failing\njava -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true\nException in thread \"main\" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 0: \u003c-- HERE --\u003e\\xffffff87@\\r\\xffffffa1\\xffffffca\\xffffff84\\x1d\\xffffffca\\x0f0\\xffffffb4\\xffffff93\\xfffffff9\\xffffffc5\\xfffffff3\\xffffff89\\xffffffeb?\\x1b\\xffffff87,q\\xffffffed\\xffffffb3!s\\xffffffc1\\x08\\xffffff83\\\\xffffffe0T\\xffffffadG\\xffffffdcd5\\x02\\xffffffbaQ... (offset 3734)\n        at org.netpreserve.jwarc.WarcParser.parse(WarcParser.java:356)\n        at org.netpreserve.jwarc.WarcReader.next(WarcReader.java:181)\n        at org.netpreserve.jwarc.tools.ExtractTool.main(ExtractTool.java:141)\n        at org.netpreserve.jwarc.tools.WarcTool.main(WarcTool.java:26)\n\nnow let's do it the right way\ngzip -d data/testing.warc.gz\nmvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args=\"data/testing.warc data/testing.warc.gz\"\n\nshowing the records in the compressed warc\njava -jar jwarc.jar ls data/testing.warc.gz\n         0 warcinfo   -    -\n       518 request    GET  https://an.wikipedia.org/wiki/Escopete\n      1027 response   200  https://an.wikipedia.org/wiki/Escopete\n     18383 metadata   -    https://an.wikipedia.org/wiki/Escopete\n\naccess the request record - works\njava -jar jwarc.jar extract data/testing.warc.gz 518 | head\nWARC/1.0\nContent-Length: 265\nContent-Type: application/http; msgtype=request\nWARC-Block-Digest: sha1:IE7NEN3QEJHUCYRRGVMHDDW3BEHFRQ6V\nWARC-Date: 2024-05-18T01:58:10Z\nWARC-IP-Address: 208.80.154.224\nWARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ\nWARC-Record-ID: \u003curn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f\u003e\nWARC-Target-URI: https://an.wikipedia.org/wiki/Escopete\nWARC-Type: request\n\naccess the response record - works\njava -jar jwarc.jar extract data/testing.warc.gz 1027 | head -n 20\nWARC/1.0\nContent-Length: 74581\nContent-Type: application/http; msgtype=response\nWARC-Block-Digest: sha1:35FTUGFVNWRVTZQGCWIX2MQA3LMYC7X7\nWARC-Concurrent-To: \u003curn:uuid:292f457d-203c-42f2-a1b5-69a4dabefd4f\u003e\nWARC-Date: 2024-05-18T01:58:10Z\nWARC-Identified-Payload-Type: text/html\nWARC-IP-Address: 208.80.154.224\nWARC-Payload-Digest: sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU\nWARC-Record-ID: \u003curn:uuid:2aabeff2-67f5-4608-8466-e87c6296e2b6\u003e\nWARC-Target-URI: https://an.wikipedia.org/wiki/Escopete\nWARC-Type: response\nWARC-Warcinfo-ID: \u003curn:uuid:668d88fc-4208-41fc-b327-1aa6cb783331\u003e\n\nHTTP/1.1 200 OK\ndate: Sat, 18 May 2024 01:58:10 GMT\nserver: mw-web.eqiad.canary-bb67b76b8-jtwdb\nx-content-type-options: nosniff\ncontent-language: an\norigin-trial: AonOP4SwCrqpb0nhZbg554z9iJimP3DxUDB8V4yu9fyyepauGKD0NXqTknWi4gnuDfMG6hNb7TDUDTsl0mDw9gIAAABmeyJvcmlnaW4iOiJodHRwczovL3dpa2lwZWRpYS5vcmc6NDQzIiwiZmVhdHVyZSI6IlRvcExldmVsVHBjZCIsImV4cGlyeSI6MTczNTM0Mzk5OSwiaXNTdWJkb21haW4iOnRydWV9\n```\n\n\u003c/details\u003e\n\nMake sure you compress WARCs the right way!\n\n## Task 6: Query the full CDX index and download those captures from AWS S3\n\nSome of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.\n\nThe CDX server API is documented [here](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference) and can be accessed through a HTTP API. \n\nRight now there is no specific tool in Java for query the CDX index, nevertheless, we do have a very useful Python tool for working with the CDX index: [cdx_toolkit](https://github.com/cocrawler/cdx_toolkit). Please refer to the [Python Whirlwind Tour](https://github.com/commoncrawl/whirlwind-python) for more details. \n\nIn this task we will achieve the same results using direct HTTP API calls and JWARC. \n\nRun\n\n```make query_cdx```\n\nThe output looks like this:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view output\u003c/summary\u003e\n\n```\ndemonstrate that we have this entry in the index\ncurl https://index.commoncrawl.org/CC-MAIN-2024-22-index?url=an.wikipedia.org/wiki/Escopete\u0026output=json\u0026from=20240518015810\u0026to=20240518015810\n\n{\"urlkey\": \"org,wikipedia,an)/wiki/escopete\", \"timestamp\": \"20240518015810\", \"url\": \"https://an.wikipedia.org/wiki/Escopete\", \"mime\": \"text/html\", \"mime-detected\": \"text/html\", \"status\": \"200\", \"digest\": \"RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU\", \"length\": \"17423\", \"offset\": \"80610731\", \"filename\": \"crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz\", \"languages\": \"spa\", \"encoding\": \"UTF-8\"}\n\ncleanup previous work\nrm -f TEST-000000.extracted.warc.gz\nretrieve the content from the commoncrawl s3 bucket (offset: 80628153 = 80610731 + 17423 - 1)\ncurl --request GET \\\n  --url https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz \\\n  --header 'Range: bytes=80610731-80628153' \u003e TEST-000000.extracted.warc.gz\n\nindex this new warc\njava -jar jwarc.jar cdxj TEST-000000.extracted.warc.gz  \u003e TEST-000000.extracted.warc.cdxj\ncat TEST-000000.extracted.warc.cdxj\norg,wikipedia,an)/wiki/escopete 20240518015810 {\"url\": \"https://an.wikipedia.org/wiki/Escopete\", \"mime\": \"text/html\", \"status\": \"200\", \"digest\": \"sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU\", \"length\": \"17455\", \"offset\": \"406\", \"filename\": \"TEST-000000.extracted.warc.gz\"}\n\niterate this new warc\njava -jar jwarc.jar ls TEST-000000.extracted.warc.gz\n 0 response   200  https://an.wikipedia.org/wiki/Escopete\n```\n\n\u003c/details\u003e\n\nThere's a lot going on here so let's unpack it a little.\n\n#### Check that the crawl has a record for the page we are interested in\n\nWe check for capture results querying the index.commoncrawl.org with GET parameters, specifying the crawl (`CC-MAIN-2024-22-index`), the exact URL `an.wikipedia.org/wiki/Escopete` and the timestamp range `from=20240518015810` and `to=20240518015810`. \nThe result of this tells us that the crawl successfully fetched this page at timestamp `20240518015810`. \n* Captures are named by the surtkey and the time.\n\n[//]: # (* If you need to search across all crawls,  of `--crawl CC-MAIN-2024-22`, you could pass `--cc` to search across all crawls.)\n[//]: # (Here I'm tempted to mention that you should use the columnar index for this kind of operations, however cdx_toolkit iterate over all crawls when called with -cc, if I'm not wrong)\n* You can use the parameter `limit=\u003cN\u003e` to limit the number of results returned - in this case because we have restricted the timestamp range to a single value, we only expect one result.\n* URLs may be specified with wildcards to return even more results: `\"an.wikipedia.org/wiki/Escop*\"` matches `an.wikipedia.org/wiki/Escopulión` and `an.wikipedia.org/wiki/Escopete`.\n\n#### Retrieve the fetched content as WARC\n\nNext, we make another HTTP call to retrieve the content and save it locally as a new WARC file, again specifying the exact URL, crawl identifier, and timestamp range. \nThis creates the WARC file `TEST-000000.extracted.warc.gz` \n\n[//]: # (Here there is no warcinfo when getting from data.commoncrawl.org, right?)\n[//]: # (which contains a `warcinfo` record explaining what the WARC is, followed by the `response` record we requested. )\n* If you check the cURL command, you'll find that it is using the offset and length of the WARC record (as returned by the CDX index query) to make an HTTP byte range request to `data.commoncrawl.org` that isolates and returns just the single record we want from the full file. It only downloads the response WARC record because our CDX index only has the response records indexed.\n* Limit, timestamp, and crawl index parameters, as well as URL wildcards.\n\n### Indexing the WARC and viewing its contents\n\nFinally, we run `jwarc cdxj` that process the WARC to make a CDXJ index of it as in Task 3, and then list the records using `jwarc ls` as in Task 2.\n\n## Task 7: Find the right part of the columnar index \n\nNow let's look at the columnar index, the other kind of index that Common Crawl makes available. This index is stored in parquet files so you can access it using SQL-based tools like AWS Athena and duckdb as well as through tables in your favorite table packages such as pandas, pyarrow, and polars.\n\nWe could read the data directly from our index in our S3 bucket and analyse it in the cloud through AWS Athena. However, this is a managed service that costs money to use (though usually a small amount). [You can read about using it here.](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format) This whirlwind tour will only use the free method of either fetching data from outside of AWS (which is kind of slow), or making a local copy of a single columnar index (300 gigabytes per monthly crawl), and then using that.\n\nThe columnar index is divided up into a separate index per crawl, which Athena or duckdb can stitch together. The cdx index is similarly divided up, but cdx_toolkit hides that detail from you.\n\nFor the purposes of this whirlwind tour, we don't want to configure all the crawl indices because it would be slow. So let's start by figuring out which crawl was ongoing on the date 20240518015810, and then we'll work with just that one crawl.\n\n### Downloading collinfo.json\n\nWe're going to use the `collinfo.json` file to find out which crawl we want. This file includes the dates for the start and end of every crawl and is available through the Common Crawl website at [index.commoncrawl.org](https://index.commoncrawl.org). To download it, run:\n\n```make download_collinfo```\n\nThe output should look like:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view output\u003c/summary\u003e\n\n```\ndownloading collinfo.json so we can find out the crawl name\ncurl -O https://index.commoncrawl.org/collinfo.json\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n100 30950  100 30950    0     0  75467      0 --:--:-- --:--:-- --:--:-- 75487\n```\n\n\u003c/details\u003e\n\nThe date of our test record is 20240518015810, which is\n2024-05-18T01:58:10 if you add the delimiters back in. We can scroll through the records in `collinfo.json` and look at the from/to values to find the right crawl: CC-MAIN-2024-22. Now we know the crawl name, we can access the correct fraction of the index without having to read the metadata of all the rest.\n\n## Task 8: Query using the columnar index + DuckDB from outside AWS\n\nA single crawl columnar index is around 300 gigabytes. If you don't have a lot of disk space, but you do have a lot of time, you can directly access the index stored on AWS S3. We're going to do just that, and then use [DuckDB](https://duckdb.org) to make an SQL query against the index to find our webpage. We'll be running the following query:\n\n```sql\n    SELECT\n      *\n    FROM ccindex\n    WHERE subset = 'warc'\n      AND crawl = 'CC-MAIN-2024-22'\n      AND url_host_tld = 'org' -- help the query optimizer\n      AND url_host_registered_domain = 'wikipedia.org' -- ditto\n      AND url = 'https://an.wikipedia.org/wiki/Escopete'\n    ;\n```\n\nRun\n\n```make duck_cloudfront```\n\nOn a machine with a 1 gigabit network connection and many cores, this should take about one minute total, and uses 8 cores. The output should look like:\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to view output\u003c/summary\u003e\n\n```\nUsing algorithm: cloudfront\nTotal records for crawl: CC-MAIN-2024-22\n100% ▕████████████████████████████████████████████████████████████▏ \n2709877975\n\nOur one row:\n100% ▕████████████████████████████████████████████████████████████▏ \nurl_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset\n--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\norg,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc\n\nWriting our one row to a local parquet file, whirlwind.parquet\n100% ▕████████████████████████████████████████████████████████████▏ \nTotal records for local whirlwind.parquet should be 1:\n1\n\nOur one row, locally:\nurl_surtkey | url | url_host_name | url_host_tld | url_host_2nd_last_part | url_host_3rd_last_part | url_host_4th_last_part | url_host_5th_last_part | url_host_registry_suffix | url_host_registered_domain | url_host_private_suffix | url_host_private_domain | url_host_name_reversed | url_protocol | url_port | url_path | url_query | fetch_time | fetch_status | fetch_redirect | content_digest | content_mime_type | content_mime_detected | content_charset | content_languages | content_truncated | warc_filename | warc_record_offset | warc_record_length | warc_segment | crawl | subset\n--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\norg,wikipedia,an)/wiki/escopete | https://an.wikipedia.org/wiki/Escopete | an.wikipedia.org | org | wikipedia | an | NULL | NULL | org | wikipedia.org | org | wikipedia.org | org.wikipedia.an | https | NULL | /wiki/Escopete | NULL | 2024-05-18T01:58:10Z | 200 | NULL | RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU | text/html | text/html | UTF-8 | spa | NULL | crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz | 80610731 | 17423 | 1715971057216.39 | CC-MAIN-2024-22 | warc\n\nComplete row:\n  url_surtkey org,wikipedia,an)/wiki/escopete\n  url https://an.wikipedia.org/wiki/Escopete\n  url_host_name an.wikipedia.org\n  url_host_tld org\n  url_host_2nd_last_part wikipedia\n  url_host_3rd_last_part an\n  url_host_4th_last_part null\n  url_host_5th_last_part null\n  url_host_registry_suffix org\n  url_host_registered_domain wikipedia.org\n  url_host_private_suffix org\n  url_host_private_domain wikipedia.org\n  url_host_name_reversed org.wikipedia.an\n  url_protocol https\n  url_port null\n  url_path /wiki/Escopete\n  url_query null\n  fetch_time 2024-05-18T01:58:10Z\n  fetch_status 200\n  fetch_redirect null\n  content_digest RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU\n  content_mime_type text/html\n  content_mime_detected text/html\n  content_charset UTF-8\n  content_languages spa\n  content_truncated null\n  warc_filename crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz\n  warc_record_offset 80610731\n  warc_record_length 17423\n  warc_segment 1715971057216.39\n  crawl CC-MAIN-2024-22\n  subset warc\n\nEquivalent to CDXJ:\norg,wikipedia,an)/wiki/escopete 20240518015810 {\"url\":\"https://an.wikipedia.org/wiki/Escopete\",\"mime\":\"text/html\",\"status\":\"200\",\"digest\":\"sha1:RY7PLBUFQNI2FFV5FTUQK72W6SNPXLQU\",\"length\":\"17423\",\"offset\":\"80610731\",\"filename\":\"crawl-data/CC-MAIN-2024-22/segments/1715971057216.39/warc/CC-MAIN-20240517233122-20240518023122-00000.warc.gz\"}\n```\n\u003c/details\u003e\n\nThe above command runs code in `Duck.java`, which accesses the relevant part of the index for our crawl (CC-MAIN-2024-22) and then counts the number of records in that crawl (2709877975!). The code runs the SQL query we saw before which should match the single response record we want. \n\nThe program then writes that one record into a local Parquet file, does a second query that returns that one record, and shows the full contents of the record. We can see that the complete row contains many columns containing different information associated with our record. Finally, it converts the row to the CDXJ format we saw before. \n\n### Bonus: download a full crawl index and query with DuckDB\n\nIn case you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. \n\n\u003e [!IMPORTANT]\n\u003e If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```\n\nTo download the crawl index, please use [cc-downloader](https://github.com/commoncrawl/cc-downloader), which is a polite downloader for Common Crawl data:  \n\n```shell\ncargo install cc-downloader\n```\n\n`cc-downloader` will not be set up on your path by default, but you can run it by prepending the right path.\nIf cargo is not available or does not install, please check on [the cc-downloader official repository](https://github.com/commoncrawl/cc-downloader). \n\n```shell\nmkdir crawl\n~/.cargo/bin/cc-downloader download-paths CC-MAIN-2024-22 cc-index-table crawl\n~/.cargo/bin/cc-downloader download  crawl/cc-index-table.paths.gz --progress crawl\n```\n\nIn both ways, the file structure should be something like this: \n```shell\ntree crawl/\ncrawl/\n├── cc-index\n│   └── table\n│       └── cc-main\n│           └── warc\n│               └── crawl=CC-MAIN-2024-22\n│                   └── subset=warc\n│                       ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet\n│                       ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c001.gz.parquet\n```\n\nThen, you can run `make duck_local_files LOCAL_DIR=crawl` to run the same query as above, but this time using your local copy of the index files.\n\nBoth `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).\n\n\n## Bonus 2: combine some steps\n\n1. Use the DuckDb techniques from [Task 8](#task-8-query-using-the-columnar-index--duckdb-from-outside-aws) and the [Index Server](https://index.commoncrawl.org) to find a new webpage in the archives. \n2. Note its url, warc, and timestamp. \n3. Now open up the Makefile from [Task 6](#task-6-query-the-full-cdx-index-and-download-those-captures-from-aws-s3) and look at the actions from the cdx_toolkit section.\n4. Repeat the cdx_toolkit steps, but for the page and date range you found above.\n\n## Congratulations!\n\nYou have completed the Whirlwind Tour of Common Crawl's Datasets using Java! You should now understand different filetypes we have in our corpus and how to interact with Common Crawl's datasets using Java. To see what other people have done with our data, see the  [Examples page](https://commoncrawl.org/examples) on our website. Why not join our Discord through the Community tab?\n\n\n## Other datasets\n\nWe make more datasets available than just the ones discussed in this Whirlwind Tour. Below is a short introduction to some of these other datasets, along with links to where you can find out more.\n\n### Web Graphs\n\nCommon Crawl regularly releases Web Graphs which are graphs describing the structure and connectivity of the web as captured in the crawl releases. We provide two levels of graph: host-level and domain-level. Both are available to download [from our website](https://commoncrawl.org/web-graphs). \n\nThe host-level graph describes links between pages on the web at the level of hostnames (e.g. `en.wikipedia.org`). The domain-level graph aggregates this information in the host-level graph, describing links at the pay-level domain (PLD) level (based on the public suffix list maintained on [publicsuffix.org](https://publicsuffix.org)). The PLD is the subdomain directly under the top-level domain (TLD): e.g. for `en.wikipedia.org`, the TLD would be `.org` and the PLD would be `wikipedia.org`.\n\nAs an example, let's look at the [Web Graph release for March, April and May 2025](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/index.html). This page provides links to download data associated with the host- and domain-level graph for those months. The key files needed to construct the graphs are the files containing the vertices or nodes (the hosts or domains), and the files containing the edges (the links between the hosts/domains). These are currently the top two links in each of the tables. \n\n![web-graph.png](img/web-graph.png)\n\nThe `.txt` files for nodes and edges are actually tab-separated files. The \"Description\" column in the table explains what data is in the columns. If we download the domain-level graph vertices, \n[cc-main-2025-mar-apr-may-domain-vertices.txt](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/domain/cc-main-2025-mar-apr-may-domain-vertices.txt.gz), we find that the top of the file looks like this:\n\n```tsv\n0\taaa.1111\t1\n1\taaa.11111\t1\n2\taaa.2\t1\n3\taaa.a\t1\n4\taaa.aa\t1\n5\taaa.aaa\t3\n6\taaa.aaaa\t1\n7\taaa.aaaaaa\t1\n8\taaa.aaaaaaa\t1\n9\taaa.aaaaaaaaa\t1\n```\nThe first column gives the node ID, the second gives the (pay-level) domain name (as provided by reverse DNS), and the third column gives the number of hosts in the domain.\n\nWe can also look at the top of the domain-level edges/vertices [cc-main-2025-mar-apr-may-domain-edges.txt](https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2025-mar-apr-may/domain/cc-main-2025-mar-apr-may-domain-edges.txt.gz):\n\n```tsv\n39\t126790965\n41\t53700629\n41\t126790965\n42\t126790965\n48\t22113090\n48\t91547783\n48\t110426784\n48\t119774627\n48\t121059062\n49\t22113090\n```\nHere, each row defines a link between two domains, with the first column giving the ID of the originating nodes, and the second column giving the ID of the destination node. The files of nodes and edges for the host-level graph are similar to those for the domain graph, with the only difference being that there is no column for number of hosts in a domain.\n\nIf you're interested in working more with the Web Graphs, we provide a [repository](https://github.com/commoncrawl/cc-webgraph) with tools to construct, process, and explore the Web Graphs. We also have a [notebook](https://github.com/commoncrawl/cc-notebooks/tree/main/cc-webgraph-statistics) which shows users how to view statistics about the Common Crawl Web Graph data sets and interactively explore the graphs.\n\n### Host index\n\nThe host index is a database which has one row for every web host we know about in each individual crawl. It contains summary information from the crawl, indices, the web graph, and our raw crawler logs. More information is available [here](https://commoncrawl.org/blog/introducing-the-host-index). We also provide a [repository](https://github.com/commoncrawl/cc-host-index) containing examples on how to use the host index. \n\n### Index annotations\n\nIndex annotations allow users to create a database table that can be joined to Common Crawl's columnar url index or host index. This is useful because we can enrich our datasets with extra information and then use it for analysis. We have a [repository](https://github.com/commoncrawl/cc-index-annotations) with example code for joining annotations to the columnar url index or host index.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fwhirlwind-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Fwhirlwind-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Fwhirlwind-java/lists"}