{"id":28626173,"url":"https://github.com/commoncrawl/arc2warc-conversion","last_synced_at":"2026-02-16T04:04:58.301Z","repository":{"id":285932161,"uuid":"940750954","full_name":"commoncrawl/arc2warc-conversion","owner":"commoncrawl","description":"Experiences converting Common Crawl's ARC files from the crawls 2008 - 2012 to the WARC format","archived":false,"fork":false,"pushed_at":"2025-04-03T14:18:26.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2026-01-28T00:51:48.823Z","etag":null,"topics":["arc","arc-files","warc","warc-files","warc-format","webarchive","webarchiving"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/commoncrawl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-02-28T18:13:21.000Z","updated_at":"2025-04-03T14:18:30.000Z","dependencies_parsed_at":"2025-09-10T03:58:39.558Z","dependency_job_id":"176dbd28-d904-4143-bb62-3b9e1c886655","html_url":"https://github.com/commoncrawl/arc2warc-conversion","commit_stats":null,"previous_names":["commoncrawl/arc2warc-conversion"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/commoncrawl/arc2warc-conversion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Farc2warc-conversion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Farc2warc-conversion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Farc2warc-conversion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Farc2warc-conversion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/commoncrawl","download_url":"https://codeload.github.com/commoncrawl/arc2warc-conversion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/commoncrawl%2Farc2warc-conversion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29499815,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-16T03:57:51.541Z","status":"ssl_error","status_checked_at":"2026-02-16T03:55:59.854Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arc","arc-files","warc","warc-files","warc-format","webarchive","webarchiving"],"created_at":"2025-06-12T08:41:07.729Z","updated_at":"2026-02-16T04:04:58.259Z","avatar_url":"https://github.com/commoncrawl.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"Conversion of Common Crawl ARC Files to WARC\n============================================\n\nIn this project we want to share our experiences with converting\nCommon Crawl's older ARC archives to WARC:\n- issues observed during the conversion\n- tools used for the conversion\n- tests to verify the resulting WARC files\n\n\n## Background\n\nThe first three crawls run by the Common Crawl Foundation (CCF) used\nthe ARC file format as primary archive format for web captures. The\nARC files have been written with varying crawler software. In total,\nthere are 130 TiB ARC data:\n\n- May 2008 - Jan 2009 (`crawl-001`), Nutch\n  - 12 TiB ARC files\n- July 2009 - Sept 2010 (`crawl-002`), Nutch\n  - 29 TiB ARC files\n- Jan 2012 - June 2012,\n  [commoncrawl-crawler](https://github.com/commoncrawl/commoncrawl-crawler),\n  - 89 TiB ARC files\n  - text and metadata extracts (Hadoop sequence files)\n\nThe conversion of ARC data to the WARC format is motivated by\n- low (and further dropping) support for the ARC file format by data\n  and text processing tools\n- bugs and glitches in the ARC files written by CCF's crawler software\n\nSeveral format issues in the ARC files were detected in 2019 when the\nARC data was indexed using [warcio](https://github.com/webrecorder/warcio)\nand [PyWB](https://pywb.readthedocs.io/en/latest/). The indexing succeeded\nafter some modifications and work-arounds were made.\n\nMostly open questions are\n- the conversion and transfer of metadata from ARC to WARC, both on the record and file level (\"warcinfo\" record)\n- crawl/fetch metadata stored in HTTP headers\n- and the required rewriting of HTTP headers\n- whether to convert file by file or to group 10 ARC files into one WARC to meet the [1 GB WARC files size recommendation](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations)) (100 MB for ARC)\n\n\n## References and Documentation\n\n### The 2008 - 2012 Archives\n\n- 2008 - 2012 crawls\n  - location and data formats: \u003chttps://commoncrawl.atlassian.net/wiki/spaces/CRWL/pages/2850886/About+the+Data+Set\u003e\n- 2012 crawl - numbers and statistics:\n  - \u003chttps://commoncrawl.atlassian.net/wiki/spaces/CRWL/pages/4292610/Data+Set+Size+Statistics+-+2012\u003e\n  - \u003chttps://commoncrawl.org/blog/startup-profile-swiftkeys-head-data-scientist-on-the-value-of-common-crawls-open-data\u003e\n  - \u003chttps://commoncrawl.org/blog/a-look-inside-common-crawls-210tb-2012-web-corpus\u003e\n  - \u003chttps://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit\u003e\n\n\n### The ARC File Format\n\n- \u003chttps://archive.org/web/researcher/ArcFileFormat.php\u003e\n- \u003chttps://en.wikipedia.org/wiki/Heritrix#Arc_files\u003e\n\n\n### The WARC File Format\n\n- \u003chttps://en.wikipedia.org/wiki/Web_ARChive\u003e\n- \u003chttps://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/\u003e\n\n\n### ARC to WARC Conversion Software\n\n- warcio \u003chttps://warcio.readthedocs.io/en/latest/#arc-files\u003e\n- jwarc \u003chttps://github.com/iipc/jwarc\u003e\n\nPrincipally, any software able to read ARC and write WARC can be\nused. We focus on warcio and jwarc.\n\n\n### WARC Validation Software\n\n(if not listed as ARC-to-WARC conversion software)\n\n- FastWARC \u003chttps://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html\u003e\n- `warc-tiny` https://github.com/JustAnotherArchivist/little-things\u003e\n\nFeature matrix (March 2025):\n\n|                         | warcio | jwarc | FastWARC | warc-tiny |\n| ----------------------- | :----: | :---: | :------: | :-------: |\n| Software version        |  1.7.5 | 0.31.1|   0.15.1 |  579d589  |\n|                         |        |       |          |           |\n| Validates also ARC      |    ✔   |    ✔  |     ✘    |    ✘      |\n|                         |        |       |          |           |\n|                         |        |       |          |           |\n| WARC-Block-Digest       |    ✔   |    ✔  |     ✔    |     ✔     |\n| WARC-Payload-Digest     |    ✔   |    ✔  |     ✔    |     ✔     |\n| WARC-Target-URI         |    ✘   |    ✘  |    ✘     |     ✘     |\n|                         |        |       |          |           |\n|                         |        |       |          |           |\n\n\n\n## Format Issues in Common Crawl ARC Files\n\n### Issues Discovered 2019 While Indexing ARC Data\n\n1. invalid URIs causing SURT canonicalization to fail\n   - seen in `crawl-002` (2009/2010)\n   - examples\n     ```\n     http://www.babelicious.com%3Fnats=stiff7788:partner:BBLCS,0,0,0,0\n     http://www.insuranceforpets.net]www.insuranceforpets.net/\n     ```\n   - fixed in `pywb/utils/canonicalize.py`\n     1. try to fix the URL ([2103a7d](https://github.com/commoncrawl/pywb/commit/2103a7da02fd8e90e21a796095ad972ed9f14af4))\n     2. do not fail but skip ([6dc9c39](https://github.com/commoncrawl/pywb/commit/6dc9c395201be219c89862d72f62e4261cb497fb))\n   - test resources\n     - `test/resources/arc/crawl-002_2009_09_17_12_1253241189984_12-4827319.arc.gz`\n       (`s3://commoncrawl/crawl-002/2009/09/17/12/1253241189984_12.arc.gz`, offset 4827319, length 4260)\n     - `test/resources/arc/crawl-002_2010_02_16_114_1266352769711_14-7060652.arc.gz`\n       (`s3://commoncrawl/crawl-002/2010/02/16/14/1266352769711_14.arc.gz`, offset 7060652, length 1959)\n\n2. white space in URLs breaks ARC header\n   - seen in 2012 crawl\n   - fixed in `warcio/recordloader.py`\n     ([a8a0014](https://github.com/commoncrawl/pywb/commit/a8a0014408aeda258eba8143f7ae18a279b515a3))\n   - notes:\n     - the URL spec ([RFC 1738](https://datatracker.ietf.org/doc/html/rfc1738))\n       and the living [WHATWG URL standard](https://url.spec.whatwg.org/) do not allow white\n       space in URLs. However, the [Java URL class](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/net/URL.html)\n       does not complain, while the [Java URI class](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/net/URI.html)\n       does. Such, white space in URLs is a common issue in Java-based crawlers.\n     - while the ARC descriptions use \"URLs\", the WARC spec requires a URI ([WARC-Target-URI](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-target-uri))\n     - test resources\n       - `test/resources/arc/crawl-2012_1341690165832_1341699469441_1478-7224105.arc.gz`\n         (`s3://commoncrawl/parse-output/segment/1341690165832/1341699469441_1478.arc.gz`, offset 7224105, length 10280)\n   \n3. wrong content-length in ARC header\n   - seen in 2012 crawl\n   - reported and discussed: \u003chttps://groups.google.com/forum/#!topic/common-crawl/P40niQBb8GY\u003e\n   - fixed in\n     - `pywb/indexer/archiveindexer.py` ([fd4ace1](https://github.com/commoncrawl/pywb/commit/fd4ace13e4c61bbd10034e7b2233c53b7d69fa4b))\n     - `warcio/archiveiterator.py` ([b431f7d](https://github.com/commoncrawl/pywb/commit/b431f7d23a5514186b116a25cafb943f2d57b83c))\n   - test resources\n     - `test/resources/arc/crawl-2012_1341690165636_1341785606830_6-0-4421.arc.gz`\n       (`s3://commoncrawl/parse-output/segment/1341690165636/1341785606830_6.arc.gz`, offset 0, length 4421)\n\nNote: issues 2 and 3 have already been fixed in 2012 after they where\nreported on CCF's discussion group.  However, the policy was to keep\nthe erroneous ARC files in place but to not include them in a list of\n\"valid ARC files\".  Erroneously, also the outdated ARC files were\ntried to index in 2019 in the first pass. That way the issues were\n\"rediscovered\" in 2019.\n\n\n\n## Commands to Convert an ARC into a WARC File\n\nIf not specified otherwise, all commands in this and following section use these shell variables:\n```bash\nARC_FILE=test/resources/arc/crawl-002_2010_02_16_114_1266352769711_14-7060652.arc.gz\nWARC_FILE=test/output/warc/$(basename $arc_file .arc.gz).warc.gz\n```\n\n- warcio\n  ```\n  warcio recompress --verbose $ARC_FILE $WARC_FILE\n  ```\n\n\n## Commands to Validate WARC Files\n\n- `warcio check --verbose $WARC_FILE`\n- `fastwarc check --verify-payloads $WARC_FILE`\n- `java -jar jwarc-0.31.1.jar validate --verbose $WARC`\n- `warc-tiny verify $WARC`\n\n\n## ARC and WARC Metadata\n\n- file-level metadata\n  - [warcinfo record](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warcinfo)\n  - cf. \u003chttps://groups.google.com/g/warc-tools/c/YKInCZg2BGw\u003e\n  - field mappings from ARC `filedesc` to `warcinfo`\n  - ARC-to-WARC conversion metadata\n    - see [warc-specifications#52](https://github.com/iipc/warc-specifications/issues/52) \"WARC-Conversion-Software and WARC-Conversion-Command fields\"\n\nMetadata stored in HTTP headers of ARC records\n- crawler content limit / truncated payload\n  - [WARC-Truncated header](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-truncated)\n  - cf. \u003chttps://commoncrawl.org/errata/content-is-truncated\u003e\n  - poorly documented for CCF's ARC crawls\n    - \u003chttps://groups.google.com/d/topic/common-crawl/hQTRnWahcHA/discussion\u003e\n    - 2 MiB or 500 kiB?\n    - marked by HTTP header `x-commoncrawl-ContentTruncated` in ARC files\n      - values: `TruncatedInDownload` and `TruncatedInInflate` (can be combined)\n- identified page encoding\n  - ARC: in HTTP header `x-commoncrawl-DetectedCharset`\n  - WARC: in [WARC metadata record](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#metadata)\n- further crawler-specific artifical HTTP headers:\n  - `x_commoncrawl_FetchTimestamp`\n  - `x_commoncrawl_HostFP`\n  - `x_commoncrawl_OriginalURL`\n  - `x_commoncrawl_URLFP`\n  - `x_commoncrawl_CrawlNo`\n  - `x_commoncrawl_ParseSegmentId`\n  - `x_commoncrawl_Signature`\n  - ...?\n\n\n\n## Required Rewriting of HTTP Headers\n\nCCF's ARC and WARC files store the payload with content and transfer encodings removed.\nHowever, in the ARC files the HTTP headers `Content-Encoding` and `Transfer-Encoding` are preserved, e.g.\n```\nContent-Encoding:gzip\nTransfer-Encoding:chunked\n```\n\nThis was also the case for CCF WARC files written in 2013 – 2016/2018 and caused troubles with WARC parsers trying to decode the content or transfer encoding:\n- [Common Crawl saves gzipped body in extracted form](https://groups.google.com/g/common-crawl/c/XiLLXX1KSUs/m/anuZq8FCCgAJ), fixed in [commoncrawl/nutch@3551eb6](https://github.com/commoncrawl/nutch/commit/3551eb6dbb7f7152a13d2e4eb0f8eb6014dc8252)\n- finally addressed in 2018 ([August 2018 crawl](https://commoncrawl.org/blog/august-2018-crawl-archive-now-available)):\n  - the original fields `Content-Encoding`, `Transfer-Encoding` and `Content-Length` are preserved using the prefix `X-Crawler-`\n  - the length of the payload after decoding is saved in a new `Content-Length` header\n- see also: [warc-specifications#22](https://github.com/iipc/warc-specifications/issues/22) \"Clarify whether Transfer-Encoding can or should be preserved\"\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Farc2warc-conversion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcommoncrawl%2Farc2warc-conversion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcommoncrawl%2Farc2warc-conversion/lists"}