https://github.com/benjaminr/statutory-duty-extractor
https://github.com/benjaminr/statutory-duty-extractor
Last synced: 9 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/benjaminr/statutory-duty-extractor
- Owner: benjaminr
- License: other
- Created: 2025-07-30T07:58:10.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-04T15:30:24.000Z (11 months ago)
- Last Synced: 2025-08-25T22:38:25.514Z (10 months ago)
- Language: Python
- Size: 1.01 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Statutory Duty Extractor 📜🤖
**Note: This repository contains proprietary materials for authorised use only. Do not share further. See [LICENSE](./LICENSE) for important usage restrictions.**
## The Mission 🎯
The Prime Minister's Office has issued an urgent request: Local Authorities across the UK are drowning in statutory instruments, struggling to identify their legal obligations buried within thousands of pages of legislation. Your mission is to build an AI-powered solution that can automatically extract and index statutory duties from these documents, helping councils understand exactly what they're legally required to do.
This prototype tool uses cutting-edge LLMs to transform impenetrable legal PDFs into clear, structured data about who must do what under UK law.
## Overview
This tool processes UK statutory instruments (in PDF format) and extracts:
- **Duty descriptions**: The specific legal obligations
- **Duty holders**: Who must fulfil each duty (e.g., "local authority", "Secretary of State")
- **Legislative references**: Where the duty appears (e.g., "regulation 3")
## Quick Start
1. **Set up environment**
```bash
# Clone the repository
git clone
cd statutory-duty-extractor
# Install dependencies using UV
uv sync
# Copy and configure environment variables
cp .env.example .env
# Edit .env with your Azure OpenAI credentials
```
2. **Run extraction**
```bash
# Extract from a single PDF
uv run statutory-duty-extractor data/statutory_instruments_pdf/2089.pdf
```
## Project Structure
```
statutory-duty-extractor/
├── src/statutory_duty_extractor/
│ ├── models.py # Pydantic models for duties
│ ├── extractor.py # Core extraction logic
│ └── cli.py # Command-line interface
├── data/
│ ├── statutory_instruments_pdf/ # Original PDF documents to process
│ ├── ground_truth_json/ # Simplified examples (matches data model)
├── prompts/ # Extraction prompts
│ ├── system_prompt.txt
│ └── user_prompt.txt
└── tests/ # Unit tests
```
### Data Directory Structure
- **`statutory_instruments_pdf/`**: The original UK statutory instrument PDFs that need processing
- **`ground_truth_json/`**: Simplified ground truth examples in JSON format that match our minimal data model (3 fields per duty)
## Current Implementation
The current approach:
1. Extracts text from PDFs using PyMuPDF
2. Sends the full text to Azure OpenAI with a prompt
3. Uses structured outputs to get back `StatutoryInstrument` objects
4. Displays results in a formatted table
## Development
```bash
# Run tests
uv run pytest
# Format code
uv run ruff format .
# Type check
uv run mypy src/
```