https://github.com/rcarmo/python-webarchive
Create WebKit/Safari .webarchive files on any platform
https://github.com/rcarmo/python-webarchive
asyncio python3 webarchive
Last synced: 11 months ago
JSON representation
Create WebKit/Safari .webarchive files on any platform
- Host: GitHub
- URL: https://github.com/rcarmo/python-webarchive
- Owner: rcarmo
- License: mit
- Created: 2017-05-07T14:06:43.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2020-02-04T14:59:51.000Z (over 6 years ago)
- Last Synced: 2025-04-03T15:47:55.415Z (about 1 year ago)
- Topics: asyncio, python3, webarchive
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 44
- Watchers: 5
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# python-webarchive
This is a quick hack demonstrating how to create WebKit/Safari `.webarchive` files, inspired by [pocket-archive-stream][pas].
## Usage
```bash
TARGET_URL=http://foo.com python3 main.py
```
## Why `.webarchive`?
`.webarchive` is the native web page archive format on the Mac, and is essentially a serialized snapshot of Safari/WebKit state. On a Mac, these files are Spotlight-indexable and can be opened by just about anything that takes a "webpage" as input.
Despite the rising prominence of [WARC][warc] as the standard web archiving format (which to this day requires plug-ins to be viewable on a browser) I quite like `.webarchive`, and built this in order to both demonstrate how to use it and have a minimally viable archive creator I can deploy as a service.
## Anatomy of a `.webarchive` file
The file format is a nested binary `.plist`, with roughly the following structure:
```json
{
"WebMainResource": {
"WebResourceURL": String(),
"WebResourceMIMEType": String(),
"WebResourceResponse": NSKeyedArchiver(NSObject)),
"WebResourceData": Bytes(),
"WebResourceTextEncodingName": String(optional=True)
},
"WebSubresources": [
{item, item, item...}
]
}
```
So creating a `.webarchive` turns out to be fairly straightforward if you simply build a `dict` with the right structure and then serialize it using [`biplist`][biplist] (which works on any platform).
The only hitch would be `WebResourceResponse` (which uses a [rather more complex way][nska] to encode the HTTP result headers), but fortunately that appears not to be necessary at all.
## Next Steps
* [ ] Tie this into [pocket-archive-stream][pas]
* [ ] Convert to/from [WARC][warc]
* [ ] Look into integrating with [warcprox][warcprox]
[biplist]: https://bitbucket.org/wooster/biplist
[pas]: https://github.com/pirate/pocket-archive-stream
[warc]: https://en.wikipedia.org/wiki/Web_ARChive
[warcprox]: https://github.com/internetarchive/warcprox
[nska]: https://www.mac4n6.com/blog/2016/1/1/manual-analysis-of-nskeyedarchiver-formatted-plist-files-a-review-of-the-new-os-x-1011-recent-items