Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/miku/xmlcutty
Select elements from large XML files, fast.
https://github.com/miku/xmlcutty
xml
Last synced: about 2 months ago
JSON representation
Select elements from large XML files, fast.
- Host: GitHub
- URL: https://github.com/miku/xmlcutty
- Owner: miku
- License: gpl-3.0
- Created: 2015-11-12T13:34:59.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2023-09-11T11:51:25.000Z (over 1 year ago)
- Last Synced: 2024-06-20T02:05:44.884Z (7 months ago)
- Topics: xml
- Language: Go
- Homepage:
- Size: 90.8 KB
- Stars: 53
- Watchers: 5
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
README
======> The game ain't in me no more. [None of it](https://www.youtube.com/watch?v=h7yf8Vp2KAI&feature=youtu.be&t=1m46s).
xmlcutty is a simple tool for carving out elements from *large* XML files,
*fast*. Since it works in a streaming fashion, it uses almost no memory and
can process around 1G of XML per minute.Why? [Background](http://stackoverflow.com/q/33653844/89391).
Install
-------Use a deb or rpm [release](https://github.com/miku/xmlcutty/releases). It's in
[AUR](https://aur.archlinux.org/packages/?K=xmlcutty), too.Or install with the go tool:
$ go install github.com/miku/xmlcutty/cmd/xmlcutty@latest
Usage
-----```sh
$ cat fixtures/sample.xml
```Options:
```sh
$ xmlcutty -h
Usage of xmlcutty:
-path string
select path (default "/")
-rename string
rename wrapper element to this name
-root string
synthetic root element
-v show version
```It *looks* a bit like [XPath](https://en.wikipedia.org/wiki/XPath), but it really
is only a simple matcher.```sh
$ xmlcutty -path /a fixtures/sample.xml
```You specify a path, e.g. `/a/b` and all elements matching this path are printed:
```sh
$ xmlcutty -path /a/b fixtures/sample.xml
```You can end up with an XML document without a root. To make tools like
[xmllint](http://xmlsoft.org/xmllint.html) happy, you can add a
synthetic root element on the fly:```sh
$ xmlcutty -root hello -path /a/b fixtures/sample.xml | xmllint --format -
```
Rename wrapper element - that is the last element of the matching path:
```sh
$ xmlcutty -rename beee -path /a/b fixtures/sample.xml
```
All options, synthetic root element and a renamed path element:
```sh
$ xmlcutty -root hi -rename ceee -path /a/b/c fixtures/sample.xml | xmllint --format -
```
It will parse XML files without a root element just fine.
```sh
$ head fixtures/oai.xml
oai:arXiv.org:0704.0004
2007-05-23
math
A determinant of Stirling cycle numbers counts ...
text
http://arxiv.org/abs/0704.0004
...
```This is an example XML response from a web service. We can slice out the
identifier elements. Note that any namespace - here `oai_dc` - is completely
ignored for the sake of simplicity:```sh
$ cat fixtures/oai.xml | xmlcutty -root x -path /record/metadata/dc/identifier \
| xmllint --format -http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012```
We can go a bit further and extract the text element, which is like a poor man
`text()` in XPath terms. By using the a newline as argument to rename, we
effectively get rid of the enclosing XML tag:```sh
$ cat fixtures/oai.xml | xmlcutty -rename '\n' -path /record/metadata/dc/identifier \
| grep -v "^$"
http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012
```This last feature is nice to quickly extract text from large XML files.
## Misc/Citations
* [Enabling Massive XML-Based Biological Data Management in HBase](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8712548)