{"id":15818444,"url":"https://github.com/edsu/warcdb-me","last_synced_at":"2025-04-01T05:13:59.917Z","repository":{"id":193604060,"uuid":"689148732","full_name":"edsu/warcdb-me","owner":"edsu","description":"Convert WARC files to SQLite ","archived":false,"fork":false,"pushed_at":"2023-10-16T15:57:28.000Z","size":50979,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-12T06:34:51.171Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edsu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-08T23:45:21.000Z","updated_at":"2023-10-16T16:34:12.000Z","dependencies_parsed_at":"2023-09-09T01:54:55.542Z","dependency_job_id":"984bffc3-35c3-44f3-83a5-b91bddc584f0","html_url":"https://github.com/edsu/warcdb-me","commit_stats":null,"previous_names":["edsu/warcdb"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fwarcdb-me","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fwarcdb-me/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fwarcdb-me/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edsu%2Fwarcdb-me/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edsu","download_url":"https://codeload.github.com/edsu/warcdb-me/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246586036,"owners_count":20801028,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-05T06:02:00.686Z","updated_at":"2025-04-01T05:13:59.900Z","avatar_url":"https://github.com/edsu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# warcdb\n\n[![Build Status](https://github.com/edsu/warcdb/actions/workflows/test.yml/badge.svg)](https://github.com/edsu/warcdb/actions/workflows/test.yml)\n\nThe [WARC] file format is used extensively in web archiving software, but most people aren't familiar with it because crawling and replay tools tend to hide the nitty gritty details of how to use it. This can make it somewhat difficult to use WARC data directly in analysis and research where you want to query and interact with the collected web archive data. *warcdb* tries to make analyzing WARC data easier by allowing you to import it into an [SQLite] database, and then letting you use SQL and other SQLite tools like [Datasette] to analyze the WARC data.\n\n## Install\n\n```bash\n$ pip install warcdb\n```\n\n## Command Line Usage\n\nOnce installed you should have a *warcdb* utility available on the command line. *warcdb* takes a few subcommands that let you interact with the database.\n\n### add\n\nCreate a warcdb database with the default name `warc.db` in the current working directory using two WARC files:\n\n```bash\n$ warcdb add warc1.warc.gz warc2.warc.gz\n```\n\nCreate a warcdb database with a specific name and location:\n\n```bash\n$ warcdb --db /path/to/my/warcdb/archive.sqlite3 add warc1.warc.gz\n```\n\nAdding another WARC file to the existing database can be achieved using the same `add` command:\n\n```bash\n$ warcdb add warc3.warc.gz\n```\n\n### list\n\nList the WARC files that have been added to a database (looking in the current working directory for `warc.db`:\n\n```bash\n$ warcdb list\n```\n\nOr listing the WARC files in a specific database:\n\n```bash\n$ warcdb --db /path/to/my/warcdb/archive.sqlite3 list\n```\n\n## The Database\n\nYou can use any tools that can interact with SQLite to query the data. For\nexample you can use the `sqlite3` command line tool print out the database\nschema:\n\n```bash\n$ sqlite3 warc.db\n```\n\n### Schema\n\n```sqlite\nsqlite\u003e .schema\nCREATE TABLE [files] (\n   [id] INTEGER PRIMARY KEY,\n   [filename] TEXT,\n   [created] TEXT\n);\nCREATE TABLE [records] (\n   [id] INTEGER PRIMARY KEY,\n   [file_id] INTEGER,\n   [warc_record_id] TEXT,\n   [warc_type] TEXT,\n   [warc_content_length] INTEGER,\n   [warc_date] TEXT,\n   [warc_concurrent_to] TEXT,\n   [warc_block_digest] TEXT,\n   [warc_payload_digest] TEXT,\n   [warc_ip_address] TEXT,\n   [warc_refers_to] TEXT,\n   [warc_refers_to_target_uri] TEXT,\n   [warc_refers_to_date] TEXT,\n   [warc_target_uri] TEXT,\n   [warc_truncated] TEXT,\n   [warc_warcinfo_id] TEXT,\n   [warc_filename] TEXT,\n   [warc_profile] TEXT,\n   [warc_identified_payload_type] TEXT,\n   [warc_segment_number] TEXT,\n   [warc_segment_origin_id] TEXT,\n   [warc_segment_total_length] INTEGER,\n   [http_status] TEXT,\n   [http_content_type] TEXT,\n   [http_server] TEXT,\n   [http_date] TEXT,\n   [http_content_length] TEXT,\n   [http_headers] TEXT,\n   [http_payload] BLOB,\n   [warcio_offset] INTEGER,\n   [warcio_length] INTEGER\n);\n```\n\nAll the WARC records are stored in the same table, but not all the columns will be populated depending on the `warc_type` which corresponds to the records `WARC-Type`: request, response, info, etc. See the [WARC Specification] for details.\n\nMost of the columns in the `records` table come directly from the [WARC Specification], however some HTTP headers have also been extracted to help in using the data, these are prefixed with `http_`. This also includes the payload of the HTTP response. The `warcio_offset` and `warcio_length` columns contain integers that reference where in the original WARC data the record can be found.\n\n### Queries\n\nFor example you could run a SQL query to see what the most popular `Content-Type` values in server responses there are in the database:\n\n```\nsqlite\u003e SELECT http_content_type, COUNT(*) AS total\n   ...\u003e FROM records\n   ...\u003e WHERE warc_type = 'response'\n   ...\u003e GROUP BY http_content_type\n   ...\u003e ORDER BY total DESC;\n   \nhttp_content_type                         total\n----------------------------------------  -----\ntext/html; charset=UTF-8                  116\nimage/jpeg                                99\nimage/gif                                 92\napplication/json; charset=UTF-8           16\ntext/javascript                           14\napplication/json; charset=utf-8           11\napplication/json+protobuf; charset=UTF-8  10\napplication/octet-stream                  7\nvideo/mp4                                 5\nimage/webp                                5\napplication/javascript                    5\n                                          5\ntext/html; charset=iso-8859-1             4\ntext/plain                                3\ntext/html; charset=utf-8                  3\napplication/x-chrome-extension            3\ntext/css                                  1\nimage/x-icon                              1\nfont/woff2                                1\napplication/rss+xml                       1\n```\n\n### HTTP Headers\n\nThe `http_headers` column contains a JSON object that has all the headers sent or received in requests and responses respectively. You can use [SQLite's JSON functions] to select out headers of interest. For example if you wanted to inspect the `Last-Modified` headers that were sent in responses from a server:\n\n```sqlite\nsqlite\u003e SELECT http_headers -\u003e '$.last-modified' AS last_modified\n   ...\u003e FROM records\n   ...\u003e WHERE warc_type = 'response' AND last_modified IS NOT NULL; \n   \nlast_modified\n-------------------------------\n\"Wed, 17 Jul 2019 00:41:02 GMT\"\n\"Wed, 10 Oct 2018 17:49:21 GMT\"\n\"Mon, 11 Apr 2022 19:43:09 GMT\"\n\"Thu, 07 Apr 2022 20:36:58 GMT\"\n\"Thu, 05 May 2022 18:50:23 GMT\"\n\"Tue, 28 Jul 2020 19:50:19 GMT\"\n\"Mon, 28 Mar 2022 17:23:50 GMT\"\n\"Mon, 29 Mar 2021 22:42:38 GMT\"\n\"Wed, 20 Apr 2022 22:39:05 GMT\"\n\"Wed, 23 Mar 2022 16:40:40 GMT\"\n\"Tue, 14 May 2019 19:41:29 GMT\"\n\"Fri, 24 Jun 2022 20:44:18 GMT\"\n\"Tue, 14 May 2019 19:41:29 GMT\"\n\"Fri, 24 Jun 2022 20:44:18 GMT\"\n\"Wed, 13 Apr 2022 21:02:38 GMT\"\n\"Sun, 17 May 1998 03:00:00 GMT\"\n\"Tue, 14 May 2019 19:41:29 GMT\"\n\"Fri, 24 Jun 2022 20:44:18 GMT\"\n```\n\n## Datasette\n\nOne useful way of exploring the database is to view it with [datasette] with some additional plugins enabled for viewing JSON and images.\n\n```\n$ pip install datasette datasette-render-images datasette-pretty-json\n$ datasette warc.db\n```\n\n[WARC Specification]: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/\n[SQLite's JSON functions]: https://www.sqlite.org/json1.html\n[Datasette]: https://datasette.io/\n[SQLite]: https://www.sqlite.org/index.html\n[WARC]: https://en.wikipedia.org/wiki/WARC_(file_format)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fwarcdb-me","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedsu%2Fwarcdb-me","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedsu%2Fwarcdb-me/lists"}