https://github.com/raynardj/xdwarf

Mining XML to tabular data, FAAAAAAST
https://github.com/raynardj/xdwarf

html mining pmc pyo3 rust xml xpath

Last synced: 9 months ago
JSON representation

Mining XML to tabular data, FAAAAAAST

Host: GitHub
URL: https://github.com/raynardj/xdwarf
Owner: raynardj
License: agpl-3.0
Created: 2022-03-01T08:29:08.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2022-03-02T02:23:14.000Z (almost 4 years ago)
Last Synced: 2025-03-02T08:42:50.093Z (10 months ago)
Topics: html, mining, pmc, pyo3, rust, xml, xpath
Language: Jupyter Notebook
Homepage:
Size: 6.75 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # xdwarf

Mining XML to tabular data, FAAAAAAST

[![PyPI version](https://img.shields.io/pypi/v/xdwarf)](https://pypi.org/project/xdwarf)

![Python version](https://img.shields.io/pypi/pyversions/xdwarf)

![License](https://img.shields.io/github/license/raynardj/xdwarf)

[![Test xdwarf](https://github.com/raynardj/xdwarf/actions/workflows/test.yml/badge.svg)](https://github.com/raynardj/xdwarf/actions/workflows/test.yml)

![PyPI Downloads](https://img.shields.io/pypi/dm/xdwarf)

## Install

```shell

pip install xdwarf

```

The library is an wrapping on ```rust_dwarf```, a rust based mining tool.

## Mining

```python

# finding in glob pattern, project name, use all - 2 CPUs

dwarf = Dwarf.from_glob("../test/data/*.xml", "PMC",-2)

```

Define the mining detail as xpath query pattern, chaining multistage mining is well supported.

```python

dwarf.find_one('article-meta > article-id[pub-id-type=pmid]' , "pmid")

dwarf.find_one("abstract", "abstract").find_many("p", "paragraph")

# mining stage can be chained to longer detials

reference = dwarf.find_one("ref-list", "ref_list").find_many("ref","reference")

reference.find_one("pub-id[pub-id-type=pmid]", "ref_id")

reference.find_one("pub-id[pub-id-type=doi]", "doi")

ref_name = reference.find_many("name", "ref_name")

ref_name.find_one("surname", "ref_surname")

```

```python

dwarf.set_necessary("pmid")

dwarf.create_children()

```

Mining start

```python

result = dwarf()

```

See result

```python

result.child_df().head(2)

```

See child result

```python

result['ref_list'].child_df().head()

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raynardj/xdwarf

Awesome Lists containing this project

README