{"id":17720209,"url":"https://github.com/miku/xmlcutty","last_synced_at":"2025-04-11T09:14:31.186Z","repository":{"id":57562738,"uuid":"46054972","full_name":"miku/xmlcutty","owner":"miku","description":"Select elements from large XML files, fast.","archived":false,"fork":false,"pushed_at":"2023-09-11T11:51:25.000Z","size":93,"stargazers_count":53,"open_issues_count":2,"forks_count":5,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-06-20T02:05:44.884Z","etag":null,"topics":["xml"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-11-12T13:34:59.000Z","updated_at":"2024-02-26T09:56:19.000Z","dependencies_parsed_at":"2024-06-20T01:38:44.431Z","dependency_job_id":"fc7e1bd4-2f5b-4f30-850f-9255d3a32922","html_url":"https://github.com/miku/xmlcutty","commit_stats":null,"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fxmlcutty","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fxmlcutty/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fxmlcutty/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fxmlcutty/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miku","download_url":"https://codeload.github.com/miku/xmlcutty/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248365663,"owners_count":21091823,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["xml"],"created_at":"2024-10-25T15:26:31.553Z","updated_at":"2025-04-11T09:14:31.153Z","avatar_url":"https://github.com/miku.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"README\n======\n\n\u003e The game ain't in me no more. [None of it](https://www.youtube.com/watch?v=h7yf8Vp2KAI\u0026feature=youtu.be\u0026t=1m46s).\n\nxmlcutty is a simple tool for carving out elements from *large* XML files,\n*fast*. Since it works in a streaming fashion, it uses almost no memory and\ncan process around 1G of XML per minute.\n\nWhy? [Background](http://stackoverflow.com/q/33653844/89391).\n\nInstall\n-------\n\nUse a deb or rpm [release](https://github.com/miku/xmlcutty/releases). It's in\n[AUR](https://aur.archlinux.org/packages/?K=xmlcutty), too.\n\nOr install with the go tool:\n\n    $ go install github.com/miku/xmlcutty/cmd/xmlcutty@latest\n\nUsage\n-----\n\n```sh\n$ cat fixtures/sample.xml\n\u003ca\u003e\n    \u003cb\u003e\n        \u003cc\u003e\u003c/c\u003e\n    \u003c/b\u003e\n    \u003cb\u003e\n        \u003cc\u003e\u003c/c\u003e\n    \u003c/b\u003e\n\u003c/a\u003e\n```\n\nOptions:\n\n```sh\n$ xmlcutty -h\nUsage of xmlcutty:\n  -path string\n        select path (default \"/\")\n  -rename string\n        rename wrapper element to this name\n  -root string\n        synthetic root element\n  -v    show version\n```\n\nIt *looks* a bit like [XPath](https://en.wikipedia.org/wiki/XPath), but it really\nis only a simple matcher.\n\n```sh\n$ xmlcutty -path /a fixtures/sample.xml\n\u003ca\u003e\n    \u003cb\u003e\n        \u003cc\u003e\u003c/c\u003e\n    \u003c/b\u003e\n    \u003cb\u003e\n        \u003cc\u003e\u003c/c\u003e\n    \u003c/b\u003e\n\u003c/a\u003e\n```\n\nYou specify a path, e.g. `/a/b` and all elements matching this path are printed:\n\n```sh\n$ xmlcutty -path /a/b fixtures/sample.xml\n\u003cb\u003e\n    \u003cc\u003e\u003c/c\u003e\n\u003c/b\u003e\n\u003cb\u003e\n    \u003cc\u003e\u003c/c\u003e\n\u003c/b\u003e\n```\n\nYou can end up with an XML document without a root. To make tools like\n[xmllint](http://xmlsoft.org/xmllint.html) happy, you can add a\nsynthetic root element on the fly:\n\n```sh\n$ xmlcutty -root hello -path /a/b fixtures/sample.xml | xmllint --format -\n\u003c?xml version=\"1.0\"?\u003e\n\u003chello\u003e\n    \u003cb\u003e\n        \u003cc\u003e\u003c/c\u003e\n    \u003c/b\u003e\n    \u003cb\u003e\n        \u003cc\u003e\u003c/c\u003e\n    \u003c/b\u003e\n\u003c/hello\u003e\n```\n\nRename wrapper element - that is the last element of the matching path:\n\n```sh\n$ xmlcutty -rename beee -path /a/b fixtures/sample.xml\n\u003cbeee\u003e\n    \u003cc\u003e\u003c/c\u003e\n\u003c/beee\u003e\n\u003cbeee\u003e\n    \u003cc\u003e\u003c/c\u003e\n\u003c/beee\u003e\n```\n\nAll options, synthetic root element and a renamed path element:\n\n```sh\n$ xmlcutty -root hi -rename ceee -path /a/b/c fixtures/sample.xml | xmllint --format -\n\u003c?xml version=\"1.0\"?\u003e\n\u003chi\u003e\n    \u003cceee/\u003e\n    \u003cceee/\u003e\n\u003c/hi\u003e\n```\n\nIt will parse XML files without a root element just fine.\n\n```sh\n$ head fixtures/oai.xml\n\u003crecord\u003e\n    \u003cheader\u003e\n        \u003cidentifier\u003eoai:arXiv.org:0704.0004\u003c/identifier\u003e\n        \u003cdatestamp\u003e2007-05-23\u003c/datestamp\u003e\n        \u003csetSpec\u003emath\u003c/setSpec\u003e\n    \u003c/header\u003e\n    \u003cmetadata\u003e\n        \u003coai_dc:dc xmlns:oai_dc=\"http://www.openarchives.org/OAI/2.0/oai_dc/\"... \u003e\n            \u003cdc:title\u003eA determinant of Stirling cycle numbers counts ...\n            \u003cdc:type\u003etext\u003c/dc:type\u003e\n            \u003cdc:identifier\u003ehttp://arxiv.org/abs/0704.0004\u003c/dc:identifier\u003e\n...\n```\n\nThis is an example XML response from a web service. We can slice out the\nidentifier elements. Note that any namespace - here `oai_dc` - is completely\nignored for the sake of simplicity:\n\n```sh\n$ cat fixtures/oai.xml | xmlcutty -root x -path /record/metadata/dc/identifier \\\n                       | xmllint --format -\n\u003c?xml version=\"1.0\"?\u003e\n\u003cx\u003e\n    \u003cidentifier\u003ehttp://arxiv.org/abs/0704.0004\u003c/identifier\u003e\n    \u003cidentifier\u003ehttp://arxiv.org/abs/0704.0010\u003c/identifier\u003e\n    \u003cidentifier\u003ehttp://arxiv.org/abs/0704.0012\u003c/identifier\u003e\n\u003c/x\u003e\n```\n\nWe can go a bit further and extract the text element, which is like a poor man\n`text()` in XPath terms. By using the a newline as argument to rename, we\neffectively get rid of the enclosing XML tag:\n\n```sh\n$ cat fixtures/oai.xml | xmlcutty -rename '\\n' -path /record/metadata/dc/identifier \\\n                       | grep -v \"^$\"\nhttp://arxiv.org/abs/0704.0004\nhttp://arxiv.org/abs/0704.0010\nhttp://arxiv.org/abs/0704.0012\n```\n\nThis last feature is nice to quickly extract text from large XML files.\n\n## Misc/Citations\n\n* [Enabling Massive XML-Based Biological Data Management in HBase](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8712548)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fxmlcutty","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiku%2Fxmlcutty","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fxmlcutty/lists"}