{"id":14155655,"url":"https://github.com/internetarchive/cdx-summary","last_synced_at":"2025-05-07T16:25:08.241Z","repository":{"id":37036731,"uuid":"423597450","full_name":"internetarchive/cdx-summary","owner":"internetarchive","description":"Summarize web archive capture index (CDX) files.","archived":false,"fork":false,"pushed_at":"2022-07-29T19:17:23.000Z","size":232,"stargazers_count":67,"open_issues_count":1,"forks_count":14,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-05-03T05:08:17.308Z","etag":null,"topics":["archive","cdx","collection","nodejs","python","report","statistics","summary","warc","web-archive","webcomponents"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/internetarchive.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-11-01T19:52:28.000Z","updated_at":"2025-04-11T00:58:02.000Z","dependencies_parsed_at":"2022-06-25T11:03:31.048Z","dependency_job_id":null,"html_url":"https://github.com/internetarchive/cdx-summary","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/internetarchive%2Fcdx-summary","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/internetarchive%2Fcdx-summary/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/internetarchive%2Fcdx-summary/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/internetarchive%2Fcdx-summary/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/internetarchive","download_url":"https://codeload.github.com/internetarchive/cdx-summary/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252913961,"owners_count":21824286,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archive","cdx","collection","nodejs","python","report","statistics","summary","warc","web-archive","webcomponents"],"created_at":"2024-08-17T08:04:45.690Z","updated_at":"2025-05-07T16:25:08.219Z","avatar_url":"https://github.com/internetarchive.png","language":"Python","funding_links":[],"categories":["statistics"],"sub_categories":[],"readme":"# CDX Summary\n\nSummarize web archive capture index (CDX) files.\n\n## Installation\n\n```\n$ pip install cdxsummary\n```\n\nAlternatively, install from the source.\n\n```\n$ python3 setup.py install\n```\n\nTo run the tool as a one-off Docker container, build the image as following, which will place the `cdxsummary` executable as the entrypoint script of the container.\n\n```\n$ docker image build -t cdxsummary .\n$ docker container run -it --rm cdxsummary\n```\n\n## Features\n\n* Summarize local CDX files or remote ones over HTTP\n* Handle `gz` and `bz2` compression seamlessly\n* Handle CDX data input to `STDIN` from pipe\n* Support [Internet Archive Petabox web item](https://archive.org/services/docs/api/items.html) summarization\n* Support [Wayback Machine CDX Server API](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) summarization\n* Seamless authorization to Internet Archive via the [`ia` CLI tool](https://archive.org/services/docs/api/internetarchive/quickstart.html#configuring)\n* Human-friendly summary by default, but support summarized or detailed JSON reports\n* Self-aware, as the input can be a previously generated JSON report in place of CDX data\n* Summary includes:\n  * An overview of numbers of captures, consecutive unique URIs, unique hosts, accumulated WARC records size, and the first and last datetimes\n  * A grid of media types and status codes and their respective capture counts\n  * A grid of path and query segment length and their respective capture counts\n  * A grid of year and month and their respective capture counts\n  * Top-N (configurable) hosts and their capture counts\n  * A random sample of N (configurable) memento URIs for `200 OK` HTML pages\n\n## Usage\n\n```\n$ cdxsummary --help\nusage: cdxsummary [-h] [-a [QUERY]] [-i] [-j] [-l] [-o [FILE]] [-r] [-s [N]] [-t [N]] [-v] [input]\n\nSummarize web archive capture index (CDX) files.\n\npositional arguments:\n  input                 CDX file path/URL (plain/gz/bz2) or an IA item ID to process (reads from the STDIN, if empty or '-')\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -a [QUERY], --api [QUERY]\n                        CDX API query parameters (default: 'matchType=exact'), treats the last argument as the lookup URL\n  -i, --item            Treat the input argument as a Petabox item identifier instead of a file path\n  -j, --json            Generate summary in JSON format\n  -l, --load            Load JSON report instead of CDX\n  -o [FILE], --out [FILE]\n                        Write output to the given file (default: STDOUT)\n  -r, --report          Generate non-summarized JSON report\n  -s [N], --samples [N]\n                        Number of sample memento URLs in summary (default: 10)\n  -t [N], --tophosts [N]\n                        Number of hosts with maximum captures in summary (default: 10)\n  -v, --version         Show version number\n```\n\n## Sample Output\n\n### Plain Text Summary\n\n\u003cdetails\u003e\n  \u003csummary\u003e$ cdxsummary sample.cdx.gz\u003c/summary\u003e\n\n```\n             CDX Overview             \n ──────────────────────────────────── \n Total Captures in CDX         74,460 \n Consecutive Unique URLs       71,599 \n Consecutive Unique Hosts      12,133 \n Total WARC Records Size      10.2 GB \n First Memento Date       Mar 18 2021 \n Last Memento Date        Mar 18 2021 \n ──────────────────────────────────── \n\n     MIME Type and Status Code Distribution      \n ─────────────────────────────────────────────── \n MIME          2XX    3XX   4XX 5XX Other  TOTAL \n ─────────────────────────────────────────────── \n HTML       25,853  8,419 6,138 177     1 40,588 \n Image       9,337      8    39   0     0  9,384 \n CSS         4,027      0     0   0     0  4,027 \n JavaScript  4,219      0     0   0     0  4,219 \n JSON          192      1    24   1     0    218 \n XML           463      9    80  13     0    565 \n Text        5,729    185   128   5     0  6,047 \n PDF         3,282     12     1   0     0  3,295 \n Font           83      0     0   0     0     83 \n Audio           7      0     0   0     0      7 \n Video          36      0     0   0     0     36 \n Other       1,250  4,443   270  28     0  5,991 \n ─────────────────────────────────────────────── \n TOTAL      54,478 13,077 6,680 224     1 74,460 \n ─────────────────────────────────────────────── \n\n            Path and Query Segments            \n ───────────────────────────────────────────── \n Path      Q0    Q1    Q2  Q3  Q4 Other  TOTAL \n ───────────────────────────────────────────── \n P0     3,625   296    52  38  19    13  4,043 \n P1    22,874 1,309   625 151  48   110 25,117 \n P2    12,790 1,357   624 173 190    84 15,218 \n P3     9,558   809   231 110  61   113 10,882 \n P4     5,770   694   150  30  16   126  6,786 \n Other  8,515 3,375   252  36  94   142 12,414 \n ───────────────────────────────────────────── \n TOTAL 63,132 7,840 1,934 538 428   588 74,460 \n ───────────────────────────────────────────── \n\n             Year and Month Distribution             \n ─────────────────────────────────────────────────── \n Year 01 02     03 04 05 06 07 08 09 10 11 12  TOTAL \n ─────────────────────────────────────────────────── \n 2021  0  0 74,460  0  0  0  0  0  0  0  0  0 74,460 \n ─────────────────────────────────────────────────── \n\n   Top 10 Out of 12,133 Hosts    \n ─────────────────────────────── \n Host                   Captures \n ─────────────────────────────── \n cdc.gov                     550 \n facebook.com                508 \n sec.gov                     476 \n youtube.com                 382 \n fws.gov                     374 \n twitter.com                 370 \n census.gov                  317 \n online.star.bnl.gov         298 \n biomarkers.nlm.nih.gov      289 \n cancer.gov                  248 \n ─────────────────────────────── \n OTHERS (12,123 Hosts)    70,648 \n ─────────────────────────────── \n\n       Random Sample of 10 OK HTML Mementos       \n ────────────────────────────────────────────────\n * https://web.archive.org/web/20210318000647/https://www.anl.gov/argonne-impacts\n * https://web.archive.org/web/20210318000929/http://www.usarmyjrotc.com/instructor/automation/jcims.php\n * https://web.archive.org/web/20210318000243/https://loc.gov/help/\n * https://web.archive.org/web/20210318000148/http://gp2.pawg.cap.gov/group-2-squadrons/reading-composite-sqdn-811\n * https://web.archive.org/web/20210318001600/https://era.nih.gov/help-tutorials/iedison\n * https://web.archive.org/web/20210318000451/https://www.ftc.gov/policy/hearings-competition-consumer-protection\n * https://web.archive.org/web/20210318000124/https://asap.gov/\n * https://web.archive.org/web/20210318001530/https://espfl.epa.gov/secondary/dataMap\n * https://web.archive.org/web/20210318000510/https://roundme.com/embed/ro6VYzBNE5vePdZ3xyph\n * https://web.archive.org/web/20210318000510/https://prevention.cancer.gov/news-and-events/videos-and-webinars\n```\n\u003c/details\u003e\n\n### JSON Summary\n\n\u003cdetails\u003e\n  \u003csummary\u003e$ cdxsummary --json sample.cdx.gz\u003c/summary\u003e\n\n```\n$ cdxsummary --json sample.cdx.gz\n{\n  \"captures\": 74460,\n  \"urls\": 71599,\n  \"hosts\": 12133,\n  \"bytes\": 10237687828,\n  \"first\": \"20210318000104\",\n  \"last\": \"20210318003748\",\n  \"tophosts\": {\n    \"cdc.gov\": 550,\n    \"facebook.com\": 508,\n    \"sec.gov\": 476,\n    \"youtube.com\": 382,\n    \"fws.gov\": 374,\n    \"twitter.com\": 370,\n    \"census.gov\": 317,\n    \"online.star.bnl.gov\": 298,\n    \"biomarkers.nlm.nih.gov\": 289,\n    \"cancer.gov\": 248\n  },\n  \"mimestatus\": {\n    \"HTML\": {\n      \"2XX\": 25853,\n      \"3XX\": 8419,\n      \"4XX\": 6138,\n      \"5XX\": 177,\n      \"Other\": 1\n    },\n    \"Image\": {\n      \"2XX\": 9337,\n      \"3XX\": 8,\n      \"4XX\": 39,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"CSS\": {\n      \"2XX\": 4027,\n      \"3XX\": 0,\n      \"4XX\": 0,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"JavaScript\": {\n      \"2XX\": 4219,\n      \"3XX\": 0,\n      \"4XX\": 0,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"JSON\": {\n      \"2XX\": 192,\n      \"3XX\": 1,\n      \"4XX\": 24,\n      \"5XX\": 1,\n      \"Other\": 0\n    },\n    \"XML\": {\n      \"2XX\": 463,\n      \"3XX\": 9,\n      \"4XX\": 80,\n      \"5XX\": 13,\n      \"Other\": 0\n    },\n    \"Text\": {\n      \"2XX\": 5729,\n      \"3XX\": 185,\n      \"4XX\": 128,\n      \"5XX\": 5,\n      \"Other\": 0\n    },\n    \"PDF\": {\n      \"2XX\": 3282,\n      \"3XX\": 12,\n      \"4XX\": 1,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"Font\": {\n      \"2XX\": 83,\n      \"3XX\": 0,\n      \"4XX\": 0,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"Audio\": {\n      \"2XX\": 7,\n      \"3XX\": 0,\n      \"4XX\": 0,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"Video\": {\n      \"2XX\": 36,\n      \"3XX\": 0,\n      \"4XX\": 0,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"Revisit\": {\n      \"2XX\": 0,\n      \"3XX\": 0,\n      \"4XX\": 0,\n      \"5XX\": 0,\n      \"Other\": 0\n    },\n    \"Other\": {\n      \"2XX\": 1250,\n      \"3XX\": 4443,\n      \"4XX\": 270,\n      \"5XX\": 28,\n      \"Other\": 0\n    }\n  },\n  \"pathquery\": {\n    \"P0\": {\n      \"Q0\": 3625,\n      \"Q1\": 296,\n      \"Q2\": 52,\n      \"Q3\": 38,\n      \"Q4\": 19,\n      \"Other\": 13\n    },\n    \"P1\": {\n      \"Q0\": 22874,\n      \"Q1\": 1309,\n      \"Q2\": 625,\n      \"Q3\": 151,\n      \"Q4\": 48,\n      \"Other\": 110\n    },\n    \"P2\": {\n      \"Q0\": 12790,\n      \"Q1\": 1357,\n      \"Q2\": 624,\n      \"Q3\": 173,\n      \"Q4\": 190,\n      \"Other\": 84\n    },\n    \"P3\": {\n      \"Q0\": 9558,\n      \"Q1\": 809,\n      \"Q2\": 231,\n      \"Q3\": 110,\n      \"Q4\": 61,\n      \"Other\": 113\n    },\n    \"P4\": {\n      \"Q0\": 5770,\n      \"Q1\": 694,\n      \"Q2\": 150,\n      \"Q3\": 30,\n      \"Q4\": 16,\n      \"Other\": 126\n    },\n    \"Other\": {\n      \"Q0\": 8515,\n      \"Q1\": 3375,\n      \"Q2\": 252,\n      \"Q3\": 36,\n      \"Q4\": 94,\n      \"Other\": 142\n    }\n  },\n  \"yearmonth\": {\n    \"2021\": {\n      \"01\": 0,\n      \"02\": 0,\n      \"03\": 74460,\n      \"04\": 0,\n      \"05\": 0,\n      \"06\": 0,\n      \"07\": 0,\n      \"08\": 0,\n      \"09\": 0,\n      \"10\": 0,\n      \"11\": 0,\n      \"12\": 0\n    }\n  },\n  \"samples\": [\n    [\n      \"20210318000647\",\n      \"https://www.anl.gov/argonne-impacts\"\n    ],\n    [\n      \"20210318000929\",\n      \"http://www.usarmyjrotc.com/instructor/automation/jcims.php\"\n    ],\n    [\n      \"20210318000243\",\n      \"https://loc.gov/help/\"\n    ],\n    [\n      \"20210318000148\",\n      \"http://gp2.pawg.cap.gov/group-2-squadrons/reading-composite-sqdn-811\"\n    ],\n    [\n      \"20210318001600\",\n      \"https://era.nih.gov/help-tutorials/iedison\"\n    ],\n    [\n      \"20210318000451\",\n      \"https://www.ftc.gov/policy/hearings-competition-consumer-protection\"\n    ],\n    [\n      \"20210318000124\",\n      \"https://asap.gov/\"\n    ],\n    [\n      \"20210318001530\",\n      \"https://espfl.epa.gov/secondary/dataMap\"\n    ],\n    [\n      \"20210318000510\",\n      \"https://roundme.com/embed/ro6VYzBNE5vePdZ3xyph\"\n    ],\n    [\n      \"20210318000510\",\n      \"https://prevention.cancer.gov/news-and-events/videos-and-webinars\"\n    ]\n  ]\n}\n```\n\u003c/details\u003e\n\n## Testing\n\nAn [interactive test interface](https://internetarchive.github.io/cdx-summary/webcomponent/) is available for the Web Component that renders the JSON summary.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finternetarchive%2Fcdx-summary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finternetarchive%2Fcdx-summary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finternetarchive%2Fcdx-summary/lists"}