{"id":15724431,"url":"https://github.com/cldellow/warc-compression","last_synced_at":"2025-03-31T01:14:36.634Z","repository":{"id":140791632,"uuid":"194524770","full_name":"cldellow/warc-compression","owner":"cldellow","description":"Scripts to experiment with different compression choices for WARCs.","archived":false,"fork":false,"pushed_at":"2019-07-02T21:04:28.000Z","size":21,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-29T12:13:21.032Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cldellow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-30T14:23:31.000Z","updated_at":"2019-07-02T21:04:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"0387e7bd-478e-4273-911d-1b58751bdeb1","html_url":"https://github.com/cldellow/warc-compression","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fwarc-compression","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fwarc-compression/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fwarc-compression/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cldellow%2Fwarc-compression/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cldellow","download_url":"https://codeload.github.com/cldellow/warc-compression/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246399798,"owners_count":20770908,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T22:16:38.191Z","updated_at":"2025-03-31T01:14:36.614Z","avatar_url":"https://github.com/cldellow.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# warc-compression\nScripts to experiment with different compression choices for WARCs. In particular:\n\n- gzip, from level 1 to 9\n- lz4, from level 1 to 9\n- zstd, from level 1 to 19\n- zstd with dictionary, from level 1 to 19\n\nThe variables of interest are:\n\n- how long does it take to compress?\n- how well does it compress?\n- how long does it take to uncompress?\n\nFor our purposes, we'd like to only measure the CPU - not network I/O\nnor disk I/O.\n\n## Benchmarking\n\nBenchmarking on an r5a.xlarge works well. It has 32GB of RAM, of which we can\nallocate 24GB to a RAM drive so that we're not benchmarking fs performance.\n\n```\n# Make a ramdisk so we're not accidentally benchmarking EBS\nsudo mkdir -p /mnt/ramdisk\nsudo mount -t tmpfs -o size=24G tmpfs /mnt/ramdisk\n\n# Install supporting tools\nsudo apt-get update\nsudo apt-get install liblz4-tool build-essential\n\ngit clone https://github.com/cldellow/warc-compression.git\ngit clone https://github.com/facebook/zstd.git\ncd zstd\nmake -j4\nsudo make install\n\n# Fetch sample files\nwget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-22/segments/1558232255092.55/warc/CC-MAIN-20190519181530-20190519203530-00051.warc.gz -O train.warc.gz\nwget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-22/segments/1558232255092.55/wet/CC-MAIN-20190519181530-20190519203530-00051.warc.wet.gz -O train.wet.gz\nwget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-22/segments/1558232255092.55/wat/CC-MAIN-20190519181530-20190519203530-00051.warc.wat.gz -O train.wat.gz\n\nwget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-51/segments/1544376823785.24/warc/CC-MAIN-20181212065445-20181212090945-00451.warc.gz -O test.warc.gz\nwget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-51/segments/1544376823785.24/wet/CC-MAIN-20181212065445-20181212090945-00451.warc.wet.gz -O test.wet.gz\nwget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-51/segments/1544376823785.24/wat/CC-MAIN-20181212065445-20181212090945-00451.warc.wat.gz -O test.wat.gz\n```\n\n# Benchmarking generic dict vs custom dict\n```\n~/warc-compression/extract warc \ncp -ar e e.bak\ncp -ar s s.bak\n\n# Create a dictionary trained on everything\nzstd -9 -o dict.generic --train s/*\n\n# Create train/test folders for metadata and requests\nmkdir sm em sr er\ncp $(grep -l WARC-Type:.metadata s/*) sm\ncp $(grep -l WARC-Type:.metadata e/*) em\ncp $(grep -l WARC-Type:.request s/*) sr\ncp $(grep -l WARC-Type:.request e/*) er\n\n# Measure raw sizes\ndu --bytes er em\n#34893842        er\n#32600931        em\n\n# Compress with zstd -9 as a baseline\nzstd -9 er/* em/*\nfor x in em er; do echo -n \"$x \"; ls -l $x/*.zst | awk '{ N += $5 } END { print N}'; done\n#em 22478177\n#er 22555102\n\n# Compress with generic dictionary\nrm er/*.zst em/*.zst\nzstd -9 -D dict.generic er/* em/*\nfor x in em er; do echo -n \"$x \"; ls -l $x/*.zst | awk '{ N += $5 } END { print N}'; done\n#em 9789227\n#er 9130685\nrm er/*.zst em/*.zst\n\n# Create specific dictionaries, compress with them\nzstd -9 -o dict.metadata --train sm/*\nzstd -9 -o dict.request --train sr/*\nzstd -9 -D dict.request er/*\nzstd -9 -D dict.metadata em/*\nfor x in em er; do echo -n \"$x \"; ls -l $x/*.zst | awk '{ N += $5 } END { print N}'; done\n#em 8488818\n#er 8183580\n\n# Filter responses that are a specific language. Conceptually:\n# 1) Copy all responses to own folder\n# 2) Copy all requests that have cld2 metadata for language X to own folder\n# 3) Extract WARC-Concurrent-To fields from (2) to file\n# 4) Use IDs in (3) to do fgrep in files in (1) for responsive records\n#\n# Or do this insanity:\nmkdir ei si eb sb\nfor lang in deu; do\n  zstd --train -o dict.$lang -9 $(grep Concurrent-To $(grep cld2.*$lang s/* -l) | sed -e 's#.*\u003c##' -e 's#\u003e##' -e 's#\\r##' -e 's#^#WARC-Record-ID: \u003c#' | grep -l --fixed-strings -f - s/*)\n  zstd -9 -D dict.$lang $(grep Concurrent-To $(grep cld2.*$lang e/* -l) | sed -e 's#.*\u003c##' -e 's#\u003e##' -e 's#\\r##' -e 's#^#WARC-Record-ID: \u003c#' | grep -l --fixed-strings -f - e/*)\n  ls -l e/*.zst | awk '{ N += $5 } END { print \"dict lang \" N }'\n\n  rm e/*.zst\n  zstd --train --maxdict 1048576 -o dict.$lang -9 $(grep Concurrent-To $(grep cld2.*$lang s/* -l) | sed -e 's#.*\u003c##' -e 's#\u003e##' -e 's#\\r##' -e 's#^#WARC-Record-ID: \u003c#' | grep -l --fixed-strings -f - s/*)\n  zstd -9 -D dict.$lang $(grep Concurrent-To $(grep cld2.*$lang e/* -l) | sed -e 's#.*\u003c##' -e 's#\u003e##' -e 's#\\r##' -e 's#^#WARC-Record-ID: \u003c#' | grep -l --fixed-strings -f - e/*)\n  ls -l e/*.zst | awk '{ N += $5 } END { print \"dict lang1M \" N }'\n\n\n  rm e/*.zst\n  zstd -9 -D dict.generic $(grep Concurrent-To $(grep cld2.*$lang e/* -l) | sed -e 's#.*\u003c##' -e 's#\u003e##' -e 's#\\r##' -e 's#^#WARC-Record-ID: \u003c#' | grep -l --fixed-strings -f - e/*)\n  ls -l e/*.zst | awk '{ N += $5 } END { print \"dict generic \" N }'\n\n  rm e/*.zst\n  zstd -9 $(grep Concurrent-To $(grep cld2.*$lang e/* -l) | sed -e 's#.*\u003c##' -e 's#\u003e##' -e 's#\\r##' -e 's#^#WARC-Record-ID: \u003c#' | grep -l --fixed-strings -f - e/*)\n  ls -l e/*.zst | awk '{ N += $5 } END { print \"no dict \" N }'\n  zstdcat e/*.zst | wc -c\ndone\n```\n\nDoing a German-specific dictionary gives only marginal benefit:\n\n- raw:              225,203,004\n- zstd, no dict:     40,032,770\n- zstd, dict:        35,659,064\n- zstd, dict deu:    35,249,785\n- zstd, dict deu 1M: 35,079,202\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fwarc-compression","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcldellow%2Fwarc-compression","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcldellow%2Fwarc-compression/lists"}