https://github.com/raynardj/xdwarf
Mining XML to tabular data, FAAAAAAST
https://github.com/raynardj/xdwarf
html mining pmc pyo3 rust xml xpath
Last synced: 9 months ago
JSON representation
Mining XML to tabular data, FAAAAAAST
- Host: GitHub
- URL: https://github.com/raynardj/xdwarf
- Owner: raynardj
- License: agpl-3.0
- Created: 2022-03-01T08:29:08.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-03-02T02:23:14.000Z (almost 4 years ago)
- Last Synced: 2025-03-02T08:42:50.093Z (10 months ago)
- Topics: html, mining, pmc, pyo3, rust, xml, xpath
- Language: Jupyter Notebook
- Homepage:
- Size: 6.75 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# xdwarf
Mining XML to tabular data, FAAAAAAST
[](https://pypi.org/project/xdwarf)


[](https://github.com/raynardj/xdwarf/actions/workflows/test.yml)

## Install
```shell
pip install xdwarf
```
The library is an wrapping on ```rust_dwarf```, a rust based mining tool.
## Mining
```python
# finding in glob pattern, project name, use all - 2 CPUs
dwarf = Dwarf.from_glob("../test/data/*.xml", "PMC",-2)
```
Define the mining detail as xpath query pattern, chaining multistage mining is well supported.
```python
dwarf.find_one('article-meta > article-id[pub-id-type=pmid]' , "pmid")
dwarf.find_one("abstract", "abstract").find_many("p", "paragraph")
# mining stage can be chained to longer detials
reference = dwarf.find_one("ref-list", "ref_list").find_many("ref","reference")
reference.find_one("pub-id[pub-id-type=pmid]", "ref_id")
reference.find_one("pub-id[pub-id-type=doi]", "doi")
ref_name = reference.find_many("name", "ref_name")
ref_name.find_one("surname", "ref_surname")
```
```python
dwarf.set_necessary("pmid")
dwarf.create_children()
```
Mining start
```python
result = dwarf()
```
See result
```python
result.child_df().head(2)
```
See child result
```python
result['ref_list'].child_df().head()
```