{"id":15461120,"url":"https://github.com/machawk1/optiwarc","last_synced_at":"2025-09-08T07:37:11.412Z","repository":{"id":68435935,"uuid":"176576534","full_name":"machawk1/optiwarc","owner":"machawk1","description":"GitHub import of https://bitbucket.org/tari/optiwarc/","archived":false,"fork":false,"pushed_at":"2019-03-19T18:40:26.000Z","size":432,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-24T09:45:19.681Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/machawk1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-19T18:38:06.000Z","updated_at":"2019-05-20T07:37:40.000Z","dependencies_parsed_at":"2023-02-21T01:15:29.001Z","dependency_job_id":null,"html_url":"https://github.com/machawk1/optiwarc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machawk1%2Foptiwarc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machawk1%2Foptiwarc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machawk1%2Foptiwarc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/machawk1%2Foptiwarc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/machawk1","download_url":"https://codeload.github.com/machawk1/optiwarc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246634388,"owners_count":20809232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T23:40:48.425Z","updated_at":"2025-04-01T11:34:42.310Z","avatar_url":"https://github.com/machawk1.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"Tools for deduplicating WARC files. Largely based on the tools\nbuilt by the [Bibliotheca Alexandrina](https://github.com/arcalex) for this\npurpose, but modified for easier (or correct) use. See\n[Youssef\nEldakar's](http://netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_16b_Eldakar.pdf)\ndescription of the tools for a little more information.\n\n## Fixes/Changes\n\n`warcsum` mostly worked correctly, but used `int`s in a number of places where\nthey are inappropriate, mainly when dealing with file offsets. If you never use\na WARC larger than 2GB, there's no problem. My changes are just hacks to make it\nwork; it needs some cleanup.\n\n`warccollres` has been reimplemented completely, since the arcalex\nimplementation seems to assume existing archive infrastructure so it can simply\nmake HTTP requests for WARC members, finding the URLs from a mysql database. I\ndon't have any of that, so my version (`warccollres.py`) uses a sqlite database\nand reads WARC files directly.\n\n`warcrefs` suffered from the same problem as `warcsum` where it used 32-bit file\noffsets. I've probably fixed that, but haven't tested and it might be horrible.\n\nThe `warc` package for Python was designed with Python 2 in mind, so I had to\nhack it to work on Python 3. Might be better to replace with `warcat`, but that\nhas its own issues (regarding extremely slow reading mostly).\n\n## Use\n\nFirst build everything:\n\n```\n$ cd gzmulti\n$ ./configure \u0026\u0026 make\n$ sudo make install\n$ cd ../warcsum\n$ ./configure \u0026\u0026 make\n$ sudo make install\n$ cd ../warcrefs\n$ mvn package\n```\n\nThe following libraries will likely be required, in addition to a working C\ncompiler (install `build-essential` on Debian-like Linux distributions):\n\n * zlib (zlib1g-dev package on Debian, Ubuntu or similar Linux distributions)\n * openssl (libssl-dev)\n * mysqlclient (libmysqlclient-dev)\n * curl (libcurl4-openssl-dev)\n * libconfig (libconfig-dev)\n\nYou may need to add the library install directory to your load path so warcsum\nand friends can find gzmulti:\n\n    export LD_LIBRARY_PATH=/usr/local/lib\n\nNow you can dedup. If you want to combine smaller WARCs into a big one, try\n`megawarc`.\n\nFirst generate hash digests of your input file's contents. We'll assume the\ninput is `mega.warc.gz`\n\n    warcsum -i mega.warc.gz -o mega.warcsum\n\nThen run `warccollres` to determine which hash collisions are identical records,\nand which are not. This can take a while.\n\n    python3 warccollres.py mega.warcsum\n\nIt will import the digest file into a database, then go to work finding\nduplicates. If interrupted while finding duplicates, you can resume where it\nleft off by running `warccollres.py` without any arguments.\n\nWhen finished, it will write the manifest back out with duplicate pointers to\n`warccollres.txt`.  Then we run `warcrefs` to rewrite the WARC.\n\n    java -jar warcrefs/target/warcrefs-1.0-SNAPSHOT-jar-with-dependencies.jar \\\n        8129 warccollres.txt `pwd`\n\n..and hopefully that's it.\n\n---\n\nFor performance reasons `warccollres` uses\n[`mysqlclient`](https://github.com/PyMySQL/mysqlclient-python), which implies\nyou need a database server for it. Configure connection information in\n`MYSQL_PARAMS`. You'll want to ensure the database server is configured for\nreasonable performance. In particular, ensure `innodb_buffer_pool_size` is\nlarge. Ideally, at least as large as your dataset so it can all live in RAM.\n\nThe code will automatically create indexes after importing data, but somebody\nmore skilled with SQL than I am might be able to improve that.\n\nYou can use sqlite too, but the queries in the code will need rewriting since\nthe dialects used are incompatible and it tends to be painfully slow when\nwriting.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachawk1%2Foptiwarc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmachawk1%2Foptiwarc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmachawk1%2Foptiwarc/lists"}