Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/miku/xmlcutty

Select elements from large XML files, fast.
https://github.com/miku/xmlcutty

xml

Last synced: about 2 months ago
JSON representation

Select elements from large XML files, fast.

Awesome Lists containing this project

README

        

README
======

> The game ain't in me no more. [None of it](https://www.youtube.com/watch?v=h7yf8Vp2KAI&feature=youtu.be&t=1m46s).

xmlcutty is a simple tool for carving out elements from *large* XML files,
*fast*. Since it works in a streaming fashion, it uses almost no memory and
can process around 1G of XML per minute.

Why? [Background](http://stackoverflow.com/q/33653844/89391).

Install
-------

Use a deb or rpm [release](https://github.com/miku/xmlcutty/releases). It's in
[AUR](https://aur.archlinux.org/packages/?K=xmlcutty), too.

Or install with the go tool:

$ go install github.com/miku/xmlcutty/cmd/xmlcutty@latest

Usage
-----

```sh
$ cat fixtures/sample.xml








```

Options:

```sh
$ xmlcutty -h
Usage of xmlcutty:
-path string
select path (default "/")
-rename string
rename wrapper element to this name
-root string
synthetic root element
-v show version
```

It *looks* a bit like [XPath](https://en.wikipedia.org/wiki/XPath), but it really
is only a simple matcher.

```sh
$ xmlcutty -path /a fixtures/sample.xml








```

You specify a path, e.g. `/a/b` and all elements matching this path are printed:

```sh
$ xmlcutty -path /a/b fixtures/sample.xml






```

You can end up with an XML document without a root. To make tools like
[xmllint](http://xmlsoft.org/xmllint.html) happy, you can add a
synthetic root element on the fly:

```sh
$ xmlcutty -root hello -path /a/b fixtures/sample.xml | xmllint --format -






```

Rename wrapper element - that is the last element of the matching path:

```sh
$ xmlcutty -rename beee -path /a/b fixtures/sample.xml

```

All options, synthetic root element and a renamed path element:

```sh
$ xmlcutty -root hi -rename ceee -path /a/b/c fixtures/sample.xml | xmllint --format -


```

It will parse XML files without a root element just fine.

```sh
$ head fixtures/oai.xml


oai:arXiv.org:0704.0004
2007-05-23
math



A determinant of Stirling cycle numbers counts ...
text
http://arxiv.org/abs/0704.0004
...
```

This is an example XML response from a web service. We can slice out the
identifier elements. Note that any namespace - here `oai_dc` - is completely
ignored for the sake of simplicity:

```sh
$ cat fixtures/oai.xml | xmlcutty -root x -path /record/metadata/dc/identifier \
| xmllint --format -

http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012

```

We can go a bit further and extract the text element, which is like a poor man
`text()` in XPath terms. By using the a newline as argument to rename, we
effectively get rid of the enclosing XML tag:

```sh
$ cat fixtures/oai.xml | xmlcutty -rename '\n' -path /record/metadata/dc/identifier \
| grep -v "^$"
http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012
```

This last feature is nice to quickly extract text from large XML files.

## Misc/Citations

* [Enabling Massive XML-Based Biological Data Management in HBase](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8712548)