https://github.com/fairdataihub/dmpbridge
Convert DMPs (PDF) to RDA Common Standard structured JSON metadata with DMPTool extentions using Large Language Models.
https://github.com/fairdataihub/dmpbridge
Last synced: about 1 month ago
JSON representation
Convert DMPs (PDF) to RDA Common Standard structured JSON metadata with DMPTool extentions using Large Language Models.
- Host: GitHub
- URL: https://github.com/fairdataihub/dmpbridge
- Owner: fairdataihub
- Created: 2026-05-04T16:11:57.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-06-01T15:59:13.000Z (about 1 month ago)
- Last Synced: 2026-06-01T17:15:08.452Z (about 1 month ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 2.45 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DMP Bridge
An open-source Python pipeline for extracting Data Management Plan (DMP) fields from PDF documents and converting them into **RDA Common Standard JSON** with DMPTool extensions.
## Features
- **PDF Extraction**: Extract structured content from DMP PDFs using pdfplumber
- **LLM-Powered Processing**: Leverage Llama models for intelligent narrative block labeling
- **Text Cleaning**: Automated text normalization and preprocessing
- **RDA Compliance**: Convert extracted data to RDA Common Standard JSON format
- **DMPTool Extensions**: Support for DMPTool-specific extensions and custom fields
- **Evaluation Framework**: Built-in tools for validating extraction accuracy
- **Modular Architecture**: Clean separation of concerns with dedicated modules for each processing stage
## Repository Structure
```
dmpbridge/
├── data/ # Sample data and extraction outputs
│ ├── reference_pdfs/ # Original PDF documents
│ │ ├── sample1.pdf
│ │ └── sample10.pdf
│ │
│ ├── reference_text/ # Reference text for validation
│ │ ├── sample1_reference.txt
│ │ └── sample10_reference.txt
│ │
│ ├── reference_structure_blocks/ # Reference structured blocks for comparison
│ │ ├── sample1_reference.json
│ │ └── sample10_reference.json
│ │
│ ├── pdfplumber_extracted_blocks/ # Structured block extraction (JSON)
│ │ ├── sample1.json
│ │ └── sample10.json
│ │
│ ├── pdfplumber_extracted_blocks_debug/ # Debug output from block extraction
│ │ ├── sample1_debug.json
│ │ └── sample10_debug.json
│ │
│ ├── pdfplumber_extracted_text/ # Raw text extraction
│ │ ├── sample1.txt
│ │ └── sample10.txt
│ │
│ ├── pdfplumber_extracted_markdown/ # Markdown-formatted extraction
│ │ ├── sample1.md
│ │ └── sample10.md
│ │
│ └── llama_structured_blocks/ # LLM-labeled structured data
│ ├── sample1_llama_blocks.json
│ └── sample10_llama_blocks.json
│
├── src/dmpbridge/ # Main package source code
│ ├── __init__.py
│ │
│ ├── pdf/ # PDF extraction module
│ │ ├── __init__.py
│ │ └── pdfplumber_extractor.py # pdfplumber-based PDF parser
│ │
│ ├── llm/ # LLM integration module
│ │ ├── __init__.py
│ │ ├── llama_client.py # Llama model client
│ │ └── llm_narrative_blocks.py # Narrative block labeling
│ │
│ ├── vision/ # Vision-based processing (future)
│ │ └── __init__.py
│ │
│ ├── processing/ # Data processing module
│ │ ├── __init__.py
│ │ ├── text_cleaner.py # Text normalization and cleanup
│ │ └── structure_json_builder.py # JSON structure conversion
│ │
│ ├── evaluation/ # Evaluation framework
│ │ ├── __init__.py
│ │ ├── pdfplumber_text_evaluator.py # Text extraction validation
│ │ └── narrative_json_evaluator.py # LLM output validation
│ │
│ └── utils/ # Utility functions
│ ├── __init__.py
│ ├── logger.py # Logging configuration
│ └── file_io.py # File I/O operations
│
├── notebooks/ # Jupyter notebooks for testing
│ ├── 01_pdfplumber_batch_test.ipynb # PDF extraction batch processing
│ ├── 02_evaluation_pdfplumber_test.ipynb # Text extraction evaluation
│ ├── 03_llama_dmp_narrative_labeling_batch_test.ipynb
│ └── 04_evaluation_llama_dmp_narrative_batch_test.ipynb
│
├── outputs/ # Generated outputs
│ ├── debug/ # Debug information
│ ├── logs/ # Application logs
│ └── reports/ # Evaluation reports
│
├── schemas/ # JSON schemas
│ └── rda_dmp_dmptool_extension_skeleton.json
│
├── tests/ # Unit and integration tests
│
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
└── README.md
```
## Quick Start
### Prerequisites
- Python 3.8 or higher
- pip package manager
- Git
### Setup (Local Development)
#### Step 1: Clone the Repository
```bash
git clone https://github.com/fairdataihub/dmpbridge.git
cd dmpbridge
```
#### Step 2: Create and Activate Virtual Environment
**Windows (cmd):**
```bash
python -m venv venv
venv\Scripts\activate.bat
```
**Windows (PowerShell):**
```powershell
python -m venv venv
.\venv\Scripts\Activate.ps1
```
**macOS/Linux:**
```bash
python -m venv venv
source venv/bin/activate
```
#### Step 3: Install Dependencies
```bash
# Standard installation
pip install -r requirements.txt
# Recommended for local development (editable mode)
pip install -e .
```
## Usage
### Basic PDF Extraction
```python
from dmpbridge.pdf import pdfplumber_extractor
# Extract text from a PDF
extractor = pdfplumber_extractor.PDFExtractor()
text = extractor.extract_text("path/to/dmp.pdf")
```
### Running Jupyter Notebooks
Start Jupyter and navigate to the `notebooks/` directory:
```bash
jupyter notebook
```
Then open any of the provided notebooks to explore:
- **01_pdfplumber_batch_test.ipynb** — Batch PDF extraction
- **02_evaluation_pdfplumber_test.ipynb** — Evaluate extraction quality
- **03_llama_dmp_narrative_labeling_batch_test.ipynb** — LLM-based labeling
- **04_evaluation_llama_dmp_narrative_batch_test.ipynb** — Evaluate LLM output
We are still working on it...
```