https://github.com/mohammadraziei/pygixml
a python wrapper over pugixml
https://github.com/mohammadraziei/pygixml
cython pugixml xml xml-parser xpath
Last synced: about 1 month ago
JSON representation
a python wrapper over pugixml
- Host: GitHub
- URL: https://github.com/mohammadraziei/pygixml
- Owner: MohammadRaziei
- License: other
- Created: 2024-01-28T15:43:10.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2026-04-03T21:18:29.000Z (about 2 months ago)
- Last Synced: 2026-04-03T23:19:22.325Z (about 2 months ago)
- Topics: cython, pugixml, xml, xml-parser, xpath
- Language: Python
- Homepage: https://mohammadraziei.github.io/pygixml/
- Size: 176 KB
- Stars: 20
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pygixml

[](https://www.python.org/)
[](https://pypi.org/project/pygixml/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/MohammadRaziei/pygixml/actions)
[](https://mohammadraziei.github.io/pygixml/)
[](https://github.com/MohammadRaziei/pygixml)
A high-performance XML parser for Python based on Cython and [pugixml](https://pugixml.org/), providing fast XML parsing, manipulation, XPath queries, text extraction, and advanced XML processing capabilities.
📚 **[View Full Documentation](https://mohammadraziei.github.io/pygixml/)**
## 🚀 Performance
pygixml delivers exceptional performance compared to other XML libraries:
### Performance Comparison (5000 XML elements)
| Library | Parsing Time | Speedup vs ElementTree |
|-----------------|--------------|------------------------|
| **pygixml** | 0.00077s | **15.9x faster** |
| **lxml** | 0.00407s | 3.0x faster |
| **ElementTree** | 0.01220s | 1.0x (baseline) |

### Key Performance Highlights
- **15.9x faster** than Python's ElementTree for XML parsing
- **5.3x faster** than lxml for XML parsing
- **Memory efficient** - uses pugixml's optimized C++ memory management
- **Scalable performance** - maintains speed advantage across different XML sizes
## Installation
### From PyPI
```bash
pip install pygixml
```
### From GitHub
```bash
pip install git+https://github.com/MohammadRaziei/pygixml.git
```
### Supported XPath Features
- **Node selection**: `//book`, `/library/book`, `book[1]`
- **Attribute selection**: `book[@id]`, `book[@category='fiction']`
- **Boolean operations**: `and`, `or`, `not()`
- **Comparison operators**: `=`, `!=`, `<`, `>`, `<=`, `>=`
- **Mathematical operations**: `+`, `-`, `*`, `div`, `mod`
- **Functions**: `position()`, `last()`, `count()`, `sum()`, `string()`, `number()`
- **Axes**: `child::`, `attribute::`, `descendant::`, `ancestor::`
- **Wildcards**: `*`, `@*`, `node()`
## API Overview
### Core Classes
- **XMLDocument**: Create, parse, save XML documents
- **XMLNode**: Navigate and manipulate XML nodes
- **XMLAttribute**: Handle XML attributes
- **XPathQuery**: Compile and execute XPath queries
- **XPathNode**: Result of XPath queries (wraps nodes and attributes)
- **XPathNodeSet**: Collection of XPath results
### Key Methods
#### XMLDocument Methods
- `parse_string(xml_string)` - Parse XML from string
- `parse_file(file_path)` - Parse XML from file
- `save_file(file_path)` - Save XML to file
- `append_child(name)` - Add child node
- `first_child()` - Get first child node
- `child(name)` - Get child by name
- `reset()` - Clear document
#### XMLNode Methods
- `name` - Get/set node name
- `value` - Get/set node value (for text nodes only)
- `child_value(name)` - Get text content of child node
- `append_child(name)` - Add child node
- `first_child()` - Get first child
- `child(name)` - Get child by name
- `next_sibling` - Get next sibling
- `previous_sibling` - Get previous sibling
- `parent` - Get parent node
- `text(recursive, join)` - Get text content
- `to_string(indent)` - Serialize to XML string
- `xml` - XML representation property
- `xpath` - Absolute XPath of node
- `is_null()` - Check if node is null
- `mem_id` - Memory identifier for debugging
#### XPath Methods
- `select_nodes(query)` - Select multiple nodes using XPath
- `select_node(query)` - Select single node using XPath
- `XPathQuery(query)` - Create reusable XPath query object
- `evaluate_node_set(context)` - Evaluate query and return node set
- `evaluate_node(context)` - Evaluate query and return first node
- `evaluate_boolean(context)` - Evaluate query and return boolean
- `evaluate_number(context)` - Evaluate query and return number
- `evaluate_string(context)` - Evaluate query and return string
## Quick Start
```python
import pygixml
# Parse XML from string
xml_string = """
The Great Gatsby
F. Scott Fitzgerald
1925
"""
doc = pygixml.parse_string(xml_string)
root = doc.first_child()
# Access elements
book = root.first_child()
title = book.child("title")
print(f"Title: {title.child_value()}") # Output: Title: The Great Gatsby
# Create new XML
doc = pygixml.XMLDocument()
root = doc.append_child("catalog")
product = root.append_child("product")
product.name = "product"
# To add text content to an element, append a text node
text_node = product.append_child("") # Empty name creates text node
text_node.value = "content"
```
## Advanced Features
### Text Content Extraction
```python
import pygixml
xml_string = """
Hello World
Child Text
More text
Text with mixed content
"""
doc = pygixml.parse_string(xml_string)
root = doc.first_child()
# Get direct text content
simple = root.child("simple")
print(simple.child_value()) # "Hello World"
# Get recursive text content
nested = root.child("nested")
print(nested.text(recursive=True)) # "Child Text\nMore text"
# Get direct text only (non-recursive)
mixed = root.child("mixed")
print(mixed.text(recursive=False)) # "Text "
# Custom join character
print(nested.text(recursive=True, join=" | ")) # "Child Text | More text"
```
### XML Serialization
```python
import pygixml
doc = pygixml.XMLDocument()
root = doc.append_child("root")
child = root.append_child("item")
child.name = "product"
# Serialize to string
print(root.to_string()) # \n \n
print(root.to_string(" ")) # Custom indentation
# Convenience property
print(root.xml) # Same as to_string() with default indent
```
### Node Iteration
```python
import pygixml
xml_string = """
First
Second
Third
"""
doc = pygixml.parse_string(xml_string)
# Iterate over document (depth-first)
for node in doc:
print(f"Node: {node.name}, XPath: {node.xpath}")
# Iterate over children
root = doc.first_child()
for child in root:
print(f"Child: {child.name}, Value: {child.child_value()}")
```
### Node Comparison and Identity
```python
import pygixml
doc = pygixml.parse_string("")
root = doc.first_child()
a = root.child("a")
b = root.child("b")
a2 = root.child("a")
print(a == a2) # True - same node
print(a == b) # False - different nodes
print(a.mem_id) # Memory address for debugging
```
## XPath Support
pygixml provides full XPath 1.0 support through pugixml's powerful XPath engine:
```python
import pygixml
xml_string = """
The Great Gatsby
F. Scott Fitzgerald
1925
12.99
1984
George Orwell
1949
10.99
"""
doc = pygixml.parse_string(xml_string)
root = doc.first_child()
# Select all books
books = root.select_nodes("book")
print(f"Found {len(books)} books")
# Select fiction books
fiction_books = root.select_nodes("book[@category='fiction']")
print(f"Found {len(fiction_books)} fiction books")
# Select specific book by ID
book_2 = root.select_node("book[@id='2']")
if book_2:
title = book_2.node.child("title").child_value()
print(f"Book ID 2: {title}")
# Use XPathQuery for repeated queries
query = pygixml.XPathQuery("book[year > 1930]")
recent_books = query.evaluate_node_set(root)
print(f"Found {len(recent_books)} books published after 1930")
# XPath boolean evaluation
has_orwell = pygixml.XPathQuery("book[author='George Orwell']").evaluate_boolean(root)
print(f"Has George Orwell books: {has_orwell}")
# XPath number evaluation
avg_price = pygixml.XPathQuery("sum(book/price) div count(book)").evaluate_number(root)
print(f"Average price: ${avg_price:.2f}")
```
## Important Note: Element Nodes vs Text Nodes
In pugixml (and therefore pygixml), **element nodes do not have values directly**. Instead, they contain child text nodes that hold the text content.
```python
# ❌ This will NOT work (element nodes don't have values):
element_node.value = "some text"
# ✅ Correct approach - use child_value() to get text content:
text_content = element_node.child_value()
# ✅ To set text content, you need to append a text node:
text_node = element_node.append_child("") # Empty name creates text node
text_node.value = "some text"
```
## Benchmarks
Run performance comparisons:
```bash
# Run complete benchmark suite
python benchmarks/clean_visualization.py
# View results
cat benchmarks/results/benchmark_results.csv
```
The benchmark suite compares pygixml against:
- **lxml** - Industry-standard C-based parser
- **xml.etree.ElementTree** - Python standard library
**Benchmark Files:**
- `benchmarks/clean_visualization.py` - Main benchmark runner
- `benchmarks/benchmark_parsing.py` - Core benchmark logic
- `benchmarks/results/` - Generated CSV data and SVG charts
## Documentation
📖 **Full documentation** is available at: [https://mohammadraziei.github.io/pygixml/](https://mohammadraziei.github.io/pygixml/)
The documentation includes:
- Complete API reference with examples
- Installation guides for all platforms
- Performance benchmarks and optimization tips
- XPath 1.0 usage guide with comprehensive examples
- Real-world usage scenarios
## License
MIT License - see [LICENSE](LICENSE) file for details.
**To use this library, you must star the project on GitHub!**
This helps support the development and shows appreciation for the work. Please star the repository before using the library:
👉 **[Star pygixml on GitHub](https://github.com/MohammadRaziei/pygixml)**
## Acknowledgments
- [pugixml](https://pugixml.org/) - Fast and lightweight C++ XML processing library
- [Cython](https://cython.org/) - C extensions for Python
- [scikit-build](https://scikit-build.readthedocs.io/) - Modern Python build system