{"id":50536729,"url":"https://github.com/fairdataihub/dmpbridge","last_synced_at":"2026-06-03T17:01:02.472Z","repository":{"id":355668847,"uuid":"1229024894","full_name":"fairdataihub/dmpbridge","owner":"fairdataihub","description":"Convert DMPs (PDF) to RDA Common Standard structured JSON metadata  with DMPTool extentions using Large Language Models.","archived":false,"fork":false,"pushed_at":"2026-06-01T15:59:13.000Z","size":2569,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-01T17:15:08.452Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fairdataihub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-04T16:11:57.000Z","updated_at":"2026-06-01T16:12:27.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fairdataihub/dmpbridge","commit_stats":null,"previous_names":["fairdataihub/dmpbridge"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fairdataihub/dmpbridge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fdmpbridge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fdmpbridge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fdmpbridge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fdmpbridge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fairdataihub","download_url":"https://codeload.github.com/fairdataihub/dmpbridge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fairdataihub%2Fdmpbridge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33874679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-03T17:01:01.523Z","updated_at":"2026-06-03T17:01:02.461Z","avatar_url":"https://github.com/fairdataihub.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DMP Bridge\n \nAn open-source Python pipeline for extracting Data Management Plan (DMP) fields from PDF documents and converting them into **RDA Common Standard JSON** with DMPTool extensions.\n \n## Features\n \n- **PDF Extraction**: Extract structured content from DMP PDFs using pdfplumber\n- **LLM-Powered Processing**: Leverage Llama models for intelligent narrative block labeling\n- **Text Cleaning**: Automated text normalization and preprocessing\n- **RDA Compliance**: Convert extracted data to RDA Common Standard JSON format\n- **DMPTool Extensions**: Support for DMPTool-specific extensions and custom fields\n- **Evaluation Framework**: Built-in tools for validating extraction accuracy\n- **Modular Architecture**: Clean separation of concerns with dedicated modules for each processing stage\n## Repository Structure\n \n```\ndmpbridge/\n├── data/                                    # Sample data and extraction outputs\n│   ├── reference_pdfs/                      # Original PDF documents\n│   │   ├── sample1.pdf\n│   │   └── sample10.pdf\n│   │\n│   ├── reference_text/                      # Reference text for validation\n│   │   ├── sample1_reference.txt\n│   │   └── sample10_reference.txt\n│   │\n│   ├── reference_structure_blocks/          # Reference structured blocks for comparison\n│   │   ├── sample1_reference.json\n│   │   └── sample10_reference.json\n│   │\n│   ├── pdfplumber_extracted_blocks/         # Structured block extraction (JSON)\n│   │   ├── sample1.json\n│   │   └── sample10.json\n│   │\n│   ├── pdfplumber_extracted_blocks_debug/   # Debug output from block extraction\n│   │   ├── sample1_debug.json\n│   │   └── sample10_debug.json\n│   │\n│   ├── pdfplumber_extracted_text/           # Raw text extraction\n│   │   ├── sample1.txt\n│   │   └── sample10.txt\n│   │\n│   ├── pdfplumber_extracted_markdown/       # Markdown-formatted extraction\n│   │   ├── sample1.md\n│   │   └── sample10.md\n│   │\n│   └── llama_structured_blocks/             # LLM-labeled structured data\n│       ├── sample1_llama_blocks.json\n│       └── sample10_llama_blocks.json\n│\n├── src/dmpbridge/                           # Main package source code\n│   ├── __init__.py\n│   │\n│   ├── pdf/                                 # PDF extraction module\n│   │   ├── __init__.py\n│   │   └── pdfplumber_extractor.py          # pdfplumber-based PDF parser\n│   │\n│   ├── llm/                                 # LLM integration module\n│   │   ├── __init__.py\n│   │   ├── llama_client.py                  # Llama model client\n│   │   └── llm_narrative_blocks.py          # Narrative block labeling\n│   │\n│   ├── vision/                              # Vision-based processing (future)\n│   │   └── __init__.py\n│   │\n│   ├── processing/                          # Data processing module\n│   │   ├── __init__.py\n│   │   ├── text_cleaner.py                  # Text normalization and cleanup\n│   │   └── structure_json_builder.py        # JSON structure conversion\n│   │\n│   ├── evaluation/                          # Evaluation framework\n│   │   ├── __init__.py\n│   │   ├── pdfplumber_text_evaluator.py     # Text extraction validation\n│   │   └── narrative_json_evaluator.py      # LLM output validation\n│   │\n│   └── utils/                               # Utility functions\n│       ├── __init__.py\n│       ├── logger.py                        # Logging configuration\n│       └── file_io.py                       # File I/O operations\n│\n├── notebooks/                               # Jupyter notebooks for testing\n│   ├── 01_pdfplumber_batch_test.ipynb       # PDF extraction batch processing\n│   ├── 02_evaluation_pdfplumber_test.ipynb  # Text extraction evaluation\n│   ├── 03_llama_dmp_narrative_labeling_batch_test.ipynb\n│   └── 04_evaluation_llama_dmp_narrative_batch_test.ipynb\n│\n├── outputs/                                 # Generated outputs\n│   ├── debug/                               # Debug information\n│   ├── logs/                                # Application logs\n│   └── reports/                             # Evaluation reports\n│\n├── schemas/                                 # JSON schemas\n│   └── rda_dmp_dmptool_extension_skeleton.json\n│\n├── tests/                                   # Unit and integration tests\n│\n├── requirements.txt                         # Python dependencies\n├── pyproject.toml                           # Package configuration\n└── README.md\n```\n \n## Quick Start\n \n### Prerequisites\n \n- Python 3.8 or higher\n- pip package manager\n- Git\n### Setup (Local Development)\n \n#### Step 1: Clone the Repository\n \n```bash\ngit clone https://github.com/fairdataihub/dmpbridge.git\ncd dmpbridge\n```\n \n#### Step 2: Create and Activate Virtual Environment\n \n**Windows (cmd):**\n```bash\npython -m venv venv\nvenv\\Scripts\\activate.bat\n```\n \n**Windows (PowerShell):**\n```powershell\npython -m venv venv\n.\\venv\\Scripts\\Activate.ps1\n```\n \n**macOS/Linux:**\n```bash\npython -m venv venv\nsource venv/bin/activate\n```\n \n#### Step 3: Install Dependencies\n \n```bash\n# Standard installation\npip install -r requirements.txt\n \n# Recommended for local development (editable mode)\npip install -e .\n```\n \n## Usage\n \n### Basic PDF Extraction\n \n```python\nfrom dmpbridge.pdf import pdfplumber_extractor\n \n# Extract text from a PDF\nextractor = pdfplumber_extractor.PDFExtractor()\ntext = extractor.extract_text(\"path/to/dmp.pdf\")\n```\n \n### Running Jupyter Notebooks\n \nStart Jupyter and navigate to the `notebooks/` directory:\n \n```bash\njupyter notebook\n```\n \nThen open any of the provided notebooks to explore:\n- **01_pdfplumber_batch_test.ipynb** — Batch PDF extraction\n- **02_evaluation_pdfplumber_test.ipynb** — Evaluate extraction quality\n- **03_llama_dmp_narrative_labeling_batch_test.ipynb** — LLM-based labeling\n- **04_evaluation_llama_dmp_narrative_batch_test.ipynb** — Evaluate LLM output\n\n\nWe are still working on it...\n```\n \n\n\n ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffairdataihub%2Fdmpbridge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffairdataihub%2Fdmpbridge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffairdataihub%2Fdmpbridge/lists"}