{"id":13589731,"url":"https://github.com/Florents-Tselai/WarcDB","last_synced_at":"2025-04-08T09:34:01.594Z","repository":{"id":37239792,"uuid":"497566500","full_name":"Florents-Tselai/WarcDB","owner":"Florents-Tselai","description":"WarcDB: Web crawl data as SQLite databases.","archived":false,"fork":false,"pushed_at":"2024-07-13T11:24:29.000Z","size":54223,"stargazers_count":393,"open_issues_count":8,"forks_count":11,"subscribers_count":10,"default_branch":"main","last_synced_at":"2024-11-05T05:50:24.354Z","etag":null,"topics":["cli","crawling","database","sqlite","warc","web-archiving","web-data"],"latest_commit_sha":null,"homepage":"https://WarcDB.tselai.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Florents-Tselai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-29T11:09:36.000Z","updated_at":"2024-10-19T21:50:48.000Z","dependencies_parsed_at":"2023-10-30T16:34:03.147Z","dependency_job_id":"ef28c653-ff8b-42af-9f34-0bcff7b655d6","html_url":"https://github.com/Florents-Tselai/WarcDB","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2FWarcDB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2FWarcDB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2FWarcDB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Florents-Tselai%2FWarcDB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Florents-Tselai","download_url":"https://codeload.github.com/Florents-Tselai/WarcDB/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223124287,"owners_count":17091184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","crawling","database","sqlite","warc","web-archiving","web-data"],"created_at":"2024-08-01T16:00:33.605Z","updated_at":"2024-11-06T09:31:26.381Z","avatar_url":"https://github.com/Florents-Tselai.png","language":"Python","funding_links":["https://img.shields.io/static/v1?label=Sponsor\u0026message=%E2%9D%A4\u0026logo=GitHub\u0026link=https://github.com/sponsors/Florents-Tselai/","https://github.com/sponsors/Florents-Tselai/"],"categories":["Python","cli","Web Archiving"],"sub_categories":["Analysis \u0026 Data Processing"],"readme":"# WarcDB: Web crawl data as SQLite databases.\n\n[![PyPI](https://img.shields.io/pypi/v/warcdb.svg)](https://pypi.org/project/warcdb/)\n[![Tests](https://github.com/Florents-Tselai/WarcDB/actions/workflows/run-tests.yaml/badge.svg?branch=main)](https://github.com/Florents-Tselai/WarcDB/actions/workflows/run-tests.yaml)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/Florents-Tselai/WarcDB/blob/main/LICENSE)\n![GitHub Stars](https://img.shields.io/github/stars/Florents-Tselai/WarcDB)\n[![Linkedin](https://img.shields.io/badge/LinkedIn-0077B5?logo=linkedin\u0026logoColor=white)](https://www.linkedin.com/in/florentstselai/)\n[![Github Sponsors](https://img.shields.io/static/v1?label=Sponsor\u0026message=%E2%9D%A4\u0026logo=GitHub\u0026link=https://github.com/sponsors/Florents-Tselai/)](https://github.com/sponsors/Florents-Tselai/)\n\n\n`WarcDB` is an `SQLite`-based file format that makes web crawl data easier to share and query.\n\nIt is based on the standardized [Web ARChive format](https://en.wikipedia.org/wiki/Web_ARChive),\nused by web archives, and defined in [ISO 28500:2017](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/).\n\n## Usage\n\n```shell\npip install warcdb\n```\n\n```shell\n# Load the `archive.warcdb` file with data.\nwarcdb import archive.warcdb ./tests/google.warc ./tests/frontpages.warc.gz \"https://tselai.com/data/google.warc\"\n\nwarcdb enable-fts ./archive.warcdb response payload\n\n# Search for records that mention \"stocks\" in their response body\nwarcdb search ./archive.warcdb response \"stocks\" -c \"WARC-Record-ID\"\n```\nAs you can see you can use any mix of local/remote and raw/compressed archives.\n\nFor example to get a part of the [Common Crawl January 2022 Crawl Archive ](https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/index.html) in a streaming fashion:\n\n```shell\nwarcdb import archive.warcdb \"https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-05/segments/1642320306346.64/warc/CC-MAIN-20220128212503-20220129002503-00719.warc.gz\n```\n\nYou can also import WARC files contained in [WACZ](https://specs.webrecorder.net/wacz/latest) files, that are created by tools like [ArchiveWeb.Page](https://archiveweb.page), [Browsertrix-Crawler](https://github.com/webrecorder/browsertrix-crawler), and [Scoop](https://github.com/harvard-lil/scoop).\n\n```shell\nwarcdb import archive.warcdb tests/scoop.wacz\n```\n\n## How It Works\n\nIndividual `.warc` files are read and parsed and their data is inserted into an SQLite database with the relational schema seen below.\n\n## Schema\n\nIf there is a new major or minor version of warcdb you may need to migrate existing databases to use the new database schema (if there have been any changes). To do this you first upgrade warcdb, and then import into the database, which will make sure all migrations have been run. If you want to migrate the database explicitly you can:\n\n```shell\nwarcdb migrate archive.warcdb\n```\n\nIf there are no migrations to run the `migrate` command will do nothing.\n\nHere's the relational schema of the `.warcdb` file.\n\n![WarcDB Schema](schema.png)\n\n### Views\n\nIn addition to the core tables that map to the WARC record types there are also helper *views* that make it a bit easier to query data:\n\n#### v_request_http_header\n\nA view of HTTP headers in WARC request records:\n\n| Column Name    | Column Type | Description                                                              |\n| -------------- | ----------- | ----------------------------------------------------------------------   |\n| warc_record_id | text        | The WARC-Record-Id for the *request* record that it was extracted from.  |\n| name           | text        | The lowercased HTTP header name (e.g. content-type)                      |\n| value          | text        | The HTTP header value (e.g. text/html)                                   |\n\n#### v_response_http_header\n\nA view of HTTP headers in WARC response records:\n\n| Column Name    | Column Type | Description                                                              |\n| -------------- | ----------- | ----------------------------------------------------------------------   |\n| warc_record_id | text        | The WARC-Record-Id for the *response* record that it was extracted from. |\n| name           | text        | The lowercased HTTP header name (e.g. content-type)                      |\n| value          | text        | The HTTP header value (e.g. text/html)                                   |\n\n## Motivation\n\nFrom the `WARC` [formal specification](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/):\n\n\u003e The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects),\n\u003e each consisting of a set of simple text headers and an arbitrary data block into one long file.\n\nMany organizations such as Commoncrawl, WebRecorder, Archive.org and libraries around the world, use the `warc` format\nto archive and store web data.\n\nThe full datasets of these services range in the few pebibytes(PiB),\nmaking them impractical to query using non-distributed systems.\n\nThis project aims to make **subsets** such data easier to access and query using SQL.\n\nCurrently, this is implemented on top of SQLite and is a wrapper around the\nexcellent [SQLite-Utils](https://sqlite-utils.datasette.io/en/stable/) utility.\n\n`\"wrapper\"` means that all\nexisting `sqlite-utils` [CLI commands](https://sqlite-utils.datasette.io/en/stable/cli-reference.html)\ncan be called as expected like\n\n```shell\nsqlite-utils \u003ccommand\u003e archive.warcdb`\n```\nor\n```shell\nwarcdb \u003ccommand\u003e example.warcdb\n```\n\n## Examples\n\n### Populate with `wget`\n\n```shell\nwget --warc-file tselai \"https://tselai.com\"\n\nwarcdb import archive.warcdb tselai.warc.gz\n```\n\n### Get all response headers\n\n```shell\nsqlite3 archive.warcdb \u003c\u003cSQL\nselect  json_extract(h.value, '$.header') as header, \n        json_extract(h.value, '$.value') as value\nfrom response,\n     json_each(http_headers) h\nSQL\n```\n\n### Get Cookie Headers for requests and responses\n```shell\nsqlite3 archive.warcdb \u003c\u003cSQL\nselect json_extract(h.value, '$.header') as header, json_extract(h.value, '$.value') as value\nfrom response,\n     json_each(http_headers) h\nwhere json_extract(h.value, '$.header') like '%Cookie%'\nunion\nselect json_extract(h.value, '$.header') as header, json_extract(h.value, '$.value') as value\nfrom request,\n     json_each(http_headers) h\nwhere json_extract(h.value, '$.header') like '%Cookie%'\nSQL\n```\n\n## Develop\n\nYou can use poetry to install dependencies and run the tests:\n\n```\n$ git clone https://github.com/Florents-Tselai/WarcDB.git\n$ cd WarcDB\n$ poetry install\n$ poetry run pytest\n```\n\nThen when you are ready to publish to PyPI:\n\n```\n$ poetry publish --build\n```\n\nResources on WARC\n----------------\n\n* [The stack: An introduction to the WARC file](https://archive-it.org/blog/post/the-stack-warc-file/)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFlorents-Tselai%2FWarcDB","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFlorents-Tselai%2FWarcDB","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFlorents-Tselai%2FWarcDB/lists"}