https://github.com/lean-dojo/leandojo-v2

LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4.
https://github.com/lean-dojo/leandojo-v2
lean4 library machine-learning theorem-proving
Last synced: 3 months ago
JSON representation
LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4.
Host: GitHub
URL: https://github.com/lean-dojo/leandojo-v2
Owner: lean-dojo
License: apache-2.0
Created: 2025-10-16T17:44:32.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-12-31T08:15:34.000Z (7 months ago)
Last Synced: 2026-01-04T11:22:56.764Z (7 months ago)
Topics: lean4, library, machine-learning, theorem-proving
Language: Python
Homepage: https://leandojo.org/
Size: 113 KB
Stars: 5
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # LeanDojo-v2

LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4. It combines repository tracing, lifelong dataset management, retrieval-augmented agents, Hugging Face fine-tuning, and external inference APIs into one toolkit.

## Table of Contents

1. [Overview](#overview)

2. [Key Features](#key-features)

3. [Repository Layout](#repository-layout)

4. [Requirements](#requirements)

5. [Installation](#installation)

6. [Environment Setup](#environment-setup)

7. [Quick Start](#quick-start)

8. [Working with Agents and Trainers](#working-with-agents-and-trainers)

9. [Tracing and Dataset Generation](#tracing-and-dataset-generation)

10. [LeanProgress Step-Prediction](#leanprogress-step-prediction)

11. [Proving Theorems](#proving-theorems)

12. [Testing](#testing)

13. [Troubleshooting & Tips](#troubleshooting--tips)

14. [Contributing](#contributing)

15. [License](#license)

## Overview

LeanDojo-v2 extends the original LeanDojo stack with the LeanAgent lifelong learning pipeline. It automates the entire loop of:

1. Cloning Lean repositories (GitHub or local) and tracing them with Lean instrumentation.

2. Storing structured theorem information in a dynamic database.

3. Training agent policies with supervised fine-tuning (SFT), GRPO-style RL, or retrieval objectives.

4. Driving Pantograph-based provers to fill in sorrys or verify solutions.

5. Using HuggingFace API for large model inference.

The codebase is modular: you can reuse the tracing pipeline without the agents, swap in custom trainers, or stand up your own inference service via the external API layer.

## Key Features

- **Unified Agent Abstractions**: `BaseAgent` orchestrates repository setup, training, and proving. Concrete implementations (`HFAgent`, `LeanAgent`, and `ExternalAgent`) tailor the workflow to Hugging Face models, retrieval-based provers, or REST-backed models.

- **Powerful Trainers**: `SFTTrainer`, `GRPOTrainer`, and `RetrievalTrainer` cover LoRA-enabled supervised fine-tuning, group-relative policy optimization, and retriever-only curriculum learning.

- **Multi-Modal Provers**: `HFProver`, `RetrievalProver`, and `ExternalProver` run on top of Pantograph’s Lean RPC server to search for tactics, generate whole proofs, or delegate to custom models.

- **Lean Tracing Pipeline**: `lean_dojo` includes the Lean 4 instrumentation (`ExtractData.lean`) and Python utilities to trace commits, normalize ASTs, and cache proof states.

- **Dynamic Repository Database**: `database` tracks repositories, theorems, curriculum difficulty, and sorry status, enabling lifelong training schedules.

- **External API**: The `external_api` folder exposes HTTP endpoints (FastAPI + uvicorn) and Lean frontend snippets so you can query LLMs from Lean editors.

## Repository Layout

| Path | Description |

|------|-------------|

| `lean_dojo_v2/agent/` | Base class plus `HFAgent`, `LeanAgent`, and helpers to manage repositories and provers. |

| `lean_dojo_v2/trainer/` | SFT, GRPO, and retrieval trainers with Hugging Face + DeepSpeed integration. |

| `lean_dojo_v2/prover/` | Pantograph-based prover implementations (HF, retrieval, external). |

| `lean_dojo_v2/lean_dojo/` | Lean tracing, dataset generation, caching, and AST utilities. |

| `lean_dojo_v2/lean_agent/` | Lifelong learning pipeline (configs, database, retrieval stack, generator). |

| `lean_dojo_v2/external_api/` | LeanCopilot code (Lean + Python server) to query external models. |

| `lean_dojo_v2/utils/` | Shared helpers for Git, filesystem operations, and constants. |

| `lean_dojo_v2/tests/` | Pytest regression suite. |

For deeper documentation on the lifelong learning component, see `lean_dojo_v2/lean_agent/README.md`.

## Requirements

- Python ≥ 3.11.

- CUDA-capable GPU for training and inference (tested with CUDA 12.6).

- Git ≥ 2.25 and `wget`.

- [elan](https://github.com/leanprover/elan) Lean toolchain to trace repositories locally.

- Adequate disk space for the `raid/` working directory (datasets, checkpoints, traces).

Python dependencies are declared in `pyproject.toml` and include PyTorch, PyTorch Lightning, Transformers, DeepSpeed, TRL, PEFT, and more.

## Installation

### Option 1: From PyPI

```sh

# Install the core package

pip install lean-dojo-v2

# Pantograph is required for Lean RPC

pip install git+https://github.com/stanford-centaur/PyPantograph

# Install a CUDA-enabled torch build (adjust the index URL for your CUDA version)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

```

### Option 2: From Source (development)

```sh

git clone https://github.com/lean-dojo/LeanDojo-v2.git

cd LeanDojo-v2

python -m venv .venv

source .venv/bin/activate

pip install --upgrade pip

pip install -e ".[dev]"

pip install git+https://github.com/stanford-centaur/PyPantograph

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

```

> Tip: You can use [uv](https://github.com/astral-sh/uv) (`uv pip install lean-dojo-v2`) as an alternative Python package manager.

## Environment Setup

1. **GitHub Access Token (required)**  

   The tracing pipeline calls the GitHub API extensively. Create a personal access token and export it before running any agent:

   ```sh

   export GITHUB_ACCESS_TOKEN=

   ```

2. **Hugging Face Token (optional but needed for gated models)**  

   ```sh

   export HF_TOKEN=

   ```

3. **Working directories**  

   By default all datasets, caches, and checkpoints live under `/raid`. Change the layout by editing `lean_dojo_v2/utils/constants.py` or by pointing `RAID_DIR` to faster storage.

4. **Lean toolchains**  

   Ensure `elan` is configured and Lean 4 (e.g., `leanprover/lean4:nightly`) is available on your `$PATH`. The tracing scripts look under `~/.elan/toolchains/`.

## Quick Start

```python

from lean_dojo_v2.agent.hf_agent import HFAgent

from lean_dojo_v2.trainer.sft_trainer import SFTTrainer

url = "https://github.com/durant42040/lean4-example"

commit = "3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2"

trainer = SFTTrainer(

    model_name="deepseek-ai/DeepSeek-Prover-V2-7B",

    output_dir="outputs-deepseek",

    epochs_per_repo=1,

    batch_size=2,

    lr=2e-5,

)

agent = HFAgent(trainer=trainer)

agent.setup_github_repository(url=url, commit=commit)

agent.train()

agent.prove()

```

This example:

1. Downloads and traces the target Lean repository + commit.

2. Builds a supervised dataset from sorry theorems.

3. Fine-tunes the specified Hugging Face model (optionally with LoRA).

4. Launches an `HFProver` backed by Pantograph to search for proofs.

## Tracing and Dataset Generation

The `lean_dojo_v2/lean_dojo/data_extraction` package powers repository tracing:

- `lean.py` clones repositories (GitHub, remote, or local), validates Lean versions, and normalizes URLs.

- `trace.py` drives Lean with the custom `ExtractData.lean` instrumented module to capture theorem states.

- `dataset.py` converts traced files to JSONL datasets ready for trainers.

- `cache.py` memoizes repository metadata to avoid redundant downloads.

- `traced_data.py` exposes typed wrappers for traced AST nodes and sorrys.

Typical usage:

```python

from lean_dojo_v2.database import DynamicDatabase

url = "https://github.com/durant42040/lean4-example"

commit = "3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2"

database = DynamicDatabase()

database.trace_repository(

    url=url,

    commit=commit,

    build_deps=False,

)

```

The `build_deps` options decides whether LeanDojo will extract the premises from the repository's external dependencies, it is set to `False` by default. However, if you are using the traced data to train LeanAgent, it must be set to `True`.  The generated artifacts flow into the `DynamicDatabase`, which keeps repositories sorted by difficulty and appends new sorrys without retracing everything.

## Working with Agents and Trainers

### Agents

Agents orchestrate the full workflow of repository setup, training, and theorem proving. Each agent pairs a trainer with a compatible prover.

#### `HFAgent`

Uses Hugging Face models fine-tuned with `SFTTrainer` or `GRPOTrainer` for theorem proving. Loads checkpoints locally and uses `HFProver` for proof search. Ideal for training custom models on your traced repositories. Does not build Lean dependencies by default.

```python

from lean_dojo_v2.agent.hf_agent import HFAgent

from lean_dojo_v2.trainer.sft_trainer import SFTTrainer

trainer = SFTTrainer(model_name="deepseek-ai/DeepSeek-Prover-V2-7B", ...)

agent = HFAgent(trainer=trainer)

agent.setup_github_repository(url, commit)

agent.train()  

agent.prove()   

```

#### `ExternalAgent`

Uses the Hugging Face Inference API to access large models like DeepSeek-Prover-V2-671B without local model loading. Pairs with `ExternalProver` for whole-proof generation or proof search. Best for quick experiments or when you don't have GPU resources for local inference.

```python

from lean_dojo_v2.agent.external_agent import ExternalAgent

agent = ExternalAgent()

agent.setup_github_repository(url, commit)

agent.prove()  

```

#### `LeanAgent`

Implements the lifelong learning pipeline with retrieval-augmented generation. Uses `RetrievalTrainer` to train premise retrievers, then pairs with `RetrievalProver` for retrieval-augmented tactic generation. Maintains repository curricula and builds Lean dependencies by default.

```python

from lean_dojo_v2.agent.lean_agent import LeanAgent

agent = LeanAgent()

agent.setup_github_repository(url, commit)

agent.train()  

agent.prove()   

```

### Trainers

#### Supervised Fine-Tuning (`SFTTrainer`)

- Accepts any Hugging Face causal LM identifier.

- Supports LoRA by passing a `peft.LoraConfig`.

- Key arguments: `epochs_per_repo`, `batch_size`, `max_seq_len`, `lr`, `warmup_steps`, `gradient_checkpointing`.

- Produces checkpoints under `output_dir` that the `HFProver` consumes.

#### GRPO Trainer (`GRPOTrainer`)

- Implements Group Relative Policy Optimization for reinforcement-style refinement.

- Accepts `reference_model`, `reward_weights`, and `kl_beta` settings.

- Useful for improving search policies on curated theorem batches.

#### Retrieval Trainer (`RetrievalTrainer`)

- Trains the dense retriever that scores prior proofs from the corpus.

- Used by `LeanAgent` to build retrieval-augmented generation models.

- Requires indexed corpus and generator checkpoints.

Each agent inherits `BaseAgent`, so you can implement your own by overriding `_get_build_deps()` and `_setup_prover()` to register new trainer/prover pairs.

## LeanProgress Step-Prediction

- Generate a JSONL dataset with remaining-step targets (or replace it with your own LeanProgress export):

  ```sh

  python -m lean_dojo_v2.lean_progress.create_sample_dataset --output raid/data/sample_leanprogress_dataset.jsonl

  ```

- Fine-tune a regression head that predicts `steps_remaining`:

  ```python

  from pathlib import Path

  from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer

  sample_dataset_path = Path("raid/data/sample_leanprogress_dataset.jsonl")

  trainer = ProgressTrainer(

      model_name="bert-base-uncased",

      data_path=str(sample_dataset_path),

      output_dir="outputs-progress",

  )

  trainer.train()

  ```

## Proving Theorems

LeanDojo-v2 provides three prover implementations, each for different use cases:

### `HFProver`

Loads a fine-tuned Hugging Face model from a local checkpoint (supports full models and LoRA adapters) and generates tactics directly, used for locally trained Hugging Face model (e.g. with `SFTTrainer` and `GRPOTrainer`).

### `ExternalProver`

Performs inference with the Hugging Face Inference API to access large models without local GPU resources. Defaults to DeepSeek-Prover-V2-671B. Supports both proof search and whole-proof generation.

### `RetrievalProver`

Used directly with LeanAgent.

### Proof Methods

LeanDojo-v2 supports two methods for theorem proving:

- **Whole-proof generation**: generate complete proof in one forward pass of the prover.

  ```python

  from lean_dojo_v2.prover import ExternalProver

  theorem = "theorem my_and_comm : ∀ {p q : Prop}, And p q → And q p := by"

  prover = ExternalProver()

  proof = prover.generate_whole_proof(theorem)

  ```

- **Proof search**: generate tactics sequentially and update the goal state through interaction with Pantograph until the proof is complete.

  ```python

  from pantograph.server import Server

  from lean_dojo_v2.prover import HFProver

  server = Server()

  prover = HFProver(ckpt_path="outputs-deepseek")

  result, used_tactics = prover.search(

      server=server, goal="∀ {p q : Prop}, p ∧ q → q ∧ p", verbose=False

  )

  ```

## Testing

We use `pytest` for regression coverage.

```sh

pip install -e .[dev]          # make sure dev extras like pytest/trl are present

export GITHUB_ACCESS_TOKEN=

export HF_TOKEN=     # only required for tests touching HF APIs

pytest -v

```

## Troubleshooting & Tips

- **401 Bad Credentials / rate limits**: Ensure `GITHUB_ACCESS_TOKEN` is exported and has `repo` + `read:org` scopes.

- **Lean tracing failures**: Confirm that the repo’s Lean version exists locally (`elan toolchain install `).

- **Missing CUDA libraries**: Install the PyTorch wheel that matches your driver and CUDA version.

- **Dataset location**: The default `raid/` directory can grow large. Point it to high-throughput storage or use symlinks.

- **Pantograph errors**: Reinstall Pantograph from source (`pip install git+https://github.com/stanford-centaur/PyPantograph`) whenever Lean upstream changes.

## Contributing

Issues and pull requests are welcome! Please:

1. Open an issue describing the bug or feature.

2. Run formatters (`black`, `isort`) and `pytest` before submitting.

3. Mention if your change touches Lean tracing files so reviewers can re-generate artifacts.

## License

LeanDojo-v2 is released under the MIT License. See `LICENSE` for details.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lean-dojo/leandojo-v2

Awesome Lists containing this project

README