https://github.com/amogorkon/cidstore
CID. B+Tree. For Triplestore.
https://github.com/amogorkon/cidstore
Last synced: 4 months ago
JSON representation
CID. B+Tree. For Triplestore.
- Host: GitHub
- URL: https://github.com/amogorkon/cidstore
- Owner: amogorkon
- License: mit
- Created: 2025-05-11T18:58:47.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-06-06T09:36:03.000Z (4 months ago)
- Last Synced: 2025-06-06T10:36:11.621Z (4 months ago)
- Language: Python
- Homepage:
- Size: 440 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CIDStore
CIDStore is a high-performance, disk-backed B+Tree implementation designed as a backend for triplestores. It efficiently maps 128-bit composite keys to variable-sized sets of 128-bit values, supporting massive-scale many-to-many relationships with high throughput and low latency.
## Features
- **Optimized for Triplestores**: Handles relationships like `(A, loves, B)` and `(A, loves, C)` using composite keys.
- **Multi-Value Key Support**: Efficiently stores and queries keys with multiple associated values.
- **HDF5 Integration**: Uses HDF5 for scalable, compressed, and chunked storage.
- **Crash Consistency**: Write-Ahead Logging (WAL) ensures atomicity and recoverability.
- **Concurrency**: Supports single-writer, multiple-reader (SWMR) mode for concurrent access.
- **Dynamic Scalability**: Handles billions of keys and values with efficient sharding and chunking.## Key Characteristics
- **Keys**: Immutable 128-bit identifiers derived from SHA3 hashes or composite triplestore logic.
- **Values**: Stored in contiguous, compressed datasets for efficient retrieval.
- **Performance**:
- Insert throughput: >1M ops/sec.
- Lookup latency: <50µs (avg), <100µs (P99).
- **Hybrid Multi-Value Handling**: Combines duplicate key storage with external value-list datasets for high-cardinality keys.## Use Cases
CIDStore is ideal for triplestore backends requiring:
- Massive-scale many-to-many relationships.
- High insert throughput and low-latency lookups.
- Crash consistency with minimal recovery time.## Architecture Overview
CIDStore uses a B+Tree structure with the following components:
- **Leaf Nodes**: Store semantic keys and values, linked for in-order traversal.
- **Internal Nodes**: Guide traversal using synthetic routing keys.
- **Write-Ahead Log (WAL)**: Ensures atomic updates and crash recovery.
- **HDF5 Storage**: Provides scalable, chunked, and compressed storage for nodes and value datasets.## Multi-Value Key Handling
CIDStore supports two approaches for multi-value keys:
1. **Duplicate Keys**: Stores multiple entries with the same key directly in the B+Tree.
2. **Value Lists**: Promotes high-cardinality keys to external HDF5 datasets for efficient bulk operations.A hybrid approach dynamically switches between these strategies based on key cardinality.
## Concurrency and Crash Recovery
- **Concurrency**: Single-writer, multiple-reader (SWMR) mode ensures safe concurrent access.
- **Crash Recovery**: WAL and shadow paging provide atomicity and consistency.## Performance Targets
| Metric | Target |
|-------------------|-------------------------|
| Insert Throughput | >1M ops/sec (batched) |
| Lookup Latency | <50µs (avg), <100µs (P99) |
| Recovery Time | <30s after crash |
| Scalability | 1B keys, 100B values |## Getting Started
### Prerequisites
- Python 3.12 or later.
- HDF5 library for storage.### Installation
Clone the repository and install the required dependencies:
```bash
# Clone the repository
git clone https://github.com/your-repo/cidstore.git
cd cidstore# Install dependencies
pip install -r requirements.txt
```### Usage
Refer to the documentation in the `docs/` folder for detailed usage instructions and examples.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.