{"id":16358454,"url":"https://github.com/rcarmo/python-webarchive","last_synced_at":"2025-07-21T12:03:21.557Z","repository":{"id":138321749,"uuid":"90536971","full_name":"rcarmo/python-webarchive","owner":"rcarmo","description":"Create WebKit/Safari .webarchive files on any platform","archived":false,"fork":false,"pushed_at":"2020-02-04T14:59:51.000Z","size":8,"stargazers_count":44,"open_issues_count":0,"forks_count":4,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-03T15:47:55.415Z","etag":null,"topics":["asyncio","python3","webarchive"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rcarmo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-07T14:06:43.000Z","updated_at":"2024-01-12T03:39:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"af6842e9-09d8-4911-b619-641c318b783e","html_url":"https://github.com/rcarmo/python-webarchive","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rcarmo/python-webarchive","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rcarmo%2Fpython-webarchive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rcarmo%2Fpython-webarchive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rcarmo%2Fpython-webarchive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rcarmo%2Fpython-webarchive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rcarmo","download_url":"https://codeload.github.com/rcarmo/python-webarchive/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rcarmo%2Fpython-webarchive/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266296760,"owners_count":23907012,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-21T11:47:31.412Z","response_time":64,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","python3","webarchive"],"created_at":"2024-10-11T02:05:51.378Z","updated_at":"2025-07-21T12:03:21.534Z","avatar_url":"https://github.com/rcarmo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# python-webarchive\n\nThis is a quick hack demonstrating how to create WebKit/Safari `.webarchive` files, inspired by [pocket-archive-stream][pas].\n\n## Usage\n\n```bash\nTARGET_URL=http://foo.com python3 main.py\n```\n\n## Why `.webarchive`?\n\n`.webarchive` is the native web page archive format on the Mac, and is essentially a serialized snapshot of Safari/WebKit state. On a Mac, these files are Spotlight-indexable and can be opened by just about anything that takes a \"webpage\" as input.\n\nDespite the rising prominence of [WARC][warc] as the standard web archiving format (which to this day requires plug-ins to be viewable on a browser) I quite like `.webarchive`, and built this in order to both demonstrate how to use it and have a minimally viable archive creator I can deploy as a service.\n\n## Anatomy of a `.webarchive` file\n\nThe file format is a nested binary `.plist`, with roughly the following structure:\n\n```json\n{\n    \"WebMainResource\": {\n        \"WebResourceURL\": String(),\n        \"WebResourceMIMEType\": String(),\n        \"WebResourceResponse\": NSKeyedArchiver(NSObject)),\n        \"WebResourceData\": Bytes(),\n        \"WebResourceTextEncodingName\": String(optional=True)\n    },\n    \"WebSubresources\": [\n        {item, item, item...}\n    ]\n\n}\n```\n\nSo creating a `.webarchive` turns out to be fairly straightforward if you simply build a `dict` with the right structure and then serialize it using [`biplist`][biplist] (which works on any platform).\n\nThe only hitch would be `WebResourceResponse` (which uses a [rather more complex way][nska] to encode the HTTP result headers), but fortunately that appears not to be necessary at all.\n\n## Next Steps\n\n* [ ] Tie this into [pocket-archive-stream][pas]\n* [ ] Convert to/from [WARC][warc]\n* [ ] Look into integrating with [warcprox][warcprox]\n\n[biplist]: https://bitbucket.org/wooster/biplist\n[pas]: https://github.com/pirate/pocket-archive-stream\n[warc]: https://en.wikipedia.org/wiki/Web_ARChive\n[warcprox]: https://github.com/internetarchive/warcprox\n[nska]: https://www.mac4n6.com/blog/2016/1/1/manual-analysis-of-nskeyedarchiver-formatted-plist-files-a-review-of-the-new-os-x-1011-recent-items\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frcarmo%2Fpython-webarchive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frcarmo%2Fpython-webarchive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frcarmo%2Fpython-webarchive/lists"}