{"id":37598295,"url":"https://github.com/lean-dojo/leandojo-v2","last_synced_at":"2026-04-26T09:04:38.592Z","repository":{"id":331207545,"uuid":"1077761409","full_name":"lean-dojo/LeanDojo-v2","owner":"lean-dojo","description":"LeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4. ","archived":false,"fork":false,"pushed_at":"2025-12-31T08:15:34.000Z","size":116,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-04T11:22:56.764Z","etag":null,"topics":["lean4","library","machine-learning","theorem-proving"],"latest_commit_sha":null,"homepage":"https://leandojo.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lean-dojo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-16T17:44:32.000Z","updated_at":"2026-01-04T07:08:21.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lean-dojo/LeanDojo-v2","commit_stats":null,"previous_names":["lean-dojo/leandojo-v2"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lean-dojo/LeanDojo-v2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lean-dojo%2FLeanDojo-v2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lean-dojo%2FLeanDojo-v2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lean-dojo%2FLeanDojo-v2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lean-dojo%2FLeanDojo-v2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lean-dojo","download_url":"https://codeload.github.com/lean-dojo/LeanDojo-v2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lean-dojo%2FLeanDojo-v2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lean4","library","machine-learning","theorem-proving"],"created_at":"2026-01-16T10:00:18.218Z","updated_at":"2026-04-26T09:04:38.584Z","avatar_url":"https://github.com/lean-dojo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LeanDojo-v2\n\nLeanDojo-v2 is an end-to-end framework for training, evaluating, and deploying AI-assisted theorem provers for Lean 4. It combines repository tracing, lifelong dataset management, retrieval-augmented agents, Hugging Face fine-tuning, and external inference APIs into one toolkit.\n\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Key Features](#key-features)\n3. [Repository Layout](#repository-layout)\n4. [Requirements](#requirements)\n5. [Installation](#installation)\n6. [Environment Setup](#environment-setup)\n7. [Quick Start](#quick-start)\n8. [Working with Agents and Trainers](#working-with-agents-and-trainers)\n9. [Tracing and Dataset Generation](#tracing-and-dataset-generation)\n10. [LeanProgress Step-Prediction](#leanprogress-step-prediction)\n11. [Proving Theorems](#proving-theorems)\n12. [Testing](#testing)\n13. [Troubleshooting \u0026 Tips](#troubleshooting--tips)\n14. [Contributing](#contributing)\n15. [License](#license)\n\n\n## Overview\n\nLeanDojo-v2 extends the original LeanDojo stack with the LeanAgent lifelong learning pipeline. It automates the entire loop of:\n\n1. Cloning Lean repositories (GitHub or local) and tracing them with Lean instrumentation.\n2. Storing structured theorem information in a dynamic database.\n3. Training agent policies with supervised fine-tuning (SFT), GRPO-style RL, or retrieval objectives.\n4. Driving Pantograph-based provers to fill in sorrys or verify solutions.\n5. Using HuggingFace API for large model inference.\n\nThe codebase is modular: you can reuse the tracing pipeline without the agents, swap in custom trainers, or stand up your own inference service via the external API layer.\n\n\n## Key Features\n\n- **Unified Agent Abstractions**: `BaseAgent` orchestrates repository setup, training, and proving. Concrete implementations (`HFAgent`, `LeanAgent`, and `ExternalAgent`) tailor the workflow to Hugging Face models, retrieval-based provers, or REST-backed models.\n- **Powerful Trainers**: `SFTTrainer`, `GRPOTrainer`, and `RetrievalTrainer` cover LoRA-enabled supervised fine-tuning, group-relative policy optimization, and retriever-only curriculum learning.\n- **Multi-Modal Provers**: `HFProver`, `RetrievalProver`, and `ExternalProver` run on top of Pantograph’s Lean RPC server to search for tactics, generate whole proofs, or delegate to custom models.\n- **Lean Tracing Pipeline**: `lean_dojo` includes the Lean 4 instrumentation (`ExtractData.lean`) and Python utilities to trace commits, normalize ASTs, and cache proof states.\n- **Dynamic Repository Database**: `database` tracks repositories, theorems, curriculum difficulty, and sorry status, enabling lifelong training schedules.\n- **External API**: The `external_api` folder exposes HTTP endpoints (FastAPI + uvicorn) and Lean frontend snippets so you can query LLMs from Lean editors.\n\n\n## Repository Layout\n\n| Path | Description |\n|------|-------------|\n| `lean_dojo_v2/agent/` | Base class plus `HFAgent`, `LeanAgent`, and helpers to manage repositories and provers. |\n| `lean_dojo_v2/trainer/` | SFT, GRPO, and retrieval trainers with Hugging Face + DeepSpeed integration. |\n| `lean_dojo_v2/prover/` | Pantograph-based prover implementations (HF, retrieval, external). |\n| `lean_dojo_v2/lean_dojo/` | Lean tracing, dataset generation, caching, and AST utilities. |\n| `lean_dojo_v2/lean_agent/` | Lifelong learning pipeline (configs, database, retrieval stack, generator). |\n| `lean_dojo_v2/external_api/` | LeanCopilot code (Lean + Python server) to query external models. |\n| `lean_dojo_v2/utils/` | Shared helpers for Git, filesystem operations, and constants. |\n| `lean_dojo_v2/tests/` | Pytest regression suite. |\n\nFor deeper documentation on the lifelong learning component, see `lean_dojo_v2/lean_agent/README.md`.\n\n\n## Requirements\n\n- Python ≥ 3.11.\n- CUDA-capable GPU for training and inference (tested with CUDA 12.6).\n- Git ≥ 2.25 and `wget`.\n- [elan](https://github.com/leanprover/elan) Lean toolchain to trace repositories locally.\n- Adequate disk space for the `raid/` working directory (datasets, checkpoints, traces).\n\nPython dependencies are declared in `pyproject.toml` and include PyTorch, PyTorch Lightning, Transformers, DeepSpeed, TRL, PEFT, and more.\n\n\n## Installation\n\n### Option 1: From PyPI\n\n```sh\n# Install the core package\npip install lean-dojo-v2\n\n# Pantograph is required for Lean RPC\npip install git+https://github.com/stanford-centaur/PyPantograph\n\n# Install a CUDA-enabled torch build (adjust the index URL for your CUDA version)\npip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126\n```\n\n### Option 2: From Source (development)\n\n```sh\ngit clone https://github.com/lean-dojo/LeanDojo-v2.git\ncd LeanDojo-v2\npython -m venv .venv\nsource .venv/bin/activate\npip install --upgrade pip\npip install -e \".[dev]\"\npip install git+https://github.com/stanford-centaur/PyPantograph\npip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126\n```\n\n\u003e Tip: You can use [uv](https://github.com/astral-sh/uv) (`uv pip install lean-dojo-v2`) as an alternative Python package manager.\n\n\n## Environment Setup\n\n1. **GitHub Access Token (required)**  \n   The tracing pipeline calls the GitHub API extensively. Create a personal access token and export it before running any agent:\n\n   ```sh\n   export GITHUB_ACCESS_TOKEN=\u003ctoken\u003e\n   ```\n\n2. **Hugging Face Token (optional but needed for gated models)**  \n\n   ```sh\n   export HF_TOKEN=\u003chf-token\u003e\n   ```\n\n3. **Working directories**  \n   By default all datasets, caches, and checkpoints live under `\u003crepo\u003e/raid`. Change the layout by editing `lean_dojo_v2/utils/constants.py` or by pointing `RAID_DIR` to faster storage.\n\n4. **Lean toolchains**  \n   Ensure `elan` is configured and Lean 4 (e.g., `leanprover/lean4:nightly`) is available on your `$PATH`. The tracing scripts look under `~/.elan/toolchains/`.\n\n\n## Quick Start\n\n```python\nfrom lean_dojo_v2.agent.hf_agent import HFAgent\nfrom lean_dojo_v2.trainer.sft_trainer import SFTTrainer\n\nurl = \"https://github.com/durant42040/lean4-example\"\ncommit = \"3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2\"\n\ntrainer = SFTTrainer(\n    model_name=\"deepseek-ai/DeepSeek-Prover-V2-7B\",\n    output_dir=\"outputs-deepseek\",\n    epochs_per_repo=1,\n    batch_size=2,\n    lr=2e-5,\n)\n\nagent = HFAgent(trainer=trainer)\nagent.setup_github_repository(url=url, commit=commit)\nagent.train()\nagent.prove()\n```\n\nThis example:\n\n1. Downloads and traces the target Lean repository + commit.\n2. Builds a supervised dataset from sorry theorems.\n3. Fine-tunes the specified Hugging Face model (optionally with LoRA).\n4. Launches an `HFProver` backed by Pantograph to search for proofs.\n\n## Tracing and Dataset Generation\n\nThe `lean_dojo_v2/lean_dojo/data_extraction` package powers repository tracing:\n\n- `lean.py` clones repositories (GitHub, remote, or local), validates Lean versions, and normalizes URLs.\n- `trace.py` drives Lean with the custom `ExtractData.lean` instrumented module to capture theorem states.\n- `dataset.py` converts traced files to JSONL datasets ready for trainers.\n- `cache.py` memoizes repository metadata to avoid redundant downloads.\n- `traced_data.py` exposes typed wrappers for traced AST nodes and sorrys.\n\nTypical usage:\n\n```python\nfrom lean_dojo_v2.database import DynamicDatabase\n\nurl = \"https://github.com/durant42040/lean4-example\"\ncommit = \"3e23ab0bfdcfdbd5b11ab53c2cd8b5d16492e9c2\"\n\ndatabase = DynamicDatabase()\n\ndatabase.trace_repository(\n    url=url,\n    commit=commit,\n    build_deps=False,\n)\n```\n\nThe `build_deps` options decides whether LeanDojo will extract the premises from the repository's external dependencies, it is set to `False` by default. However, if you are using the traced data to train LeanAgent, it must be set to `True`.  The generated artifacts flow into the `DynamicDatabase`, which keeps repositories sorted by difficulty and appends new sorrys without retracing everything.\n\n## Working with Agents and Trainers\n\n### Agents\n\nAgents orchestrate the full workflow of repository setup, training, and theorem proving. Each agent pairs a trainer with a compatible prover.\n\n#### `HFAgent`\n\nUses Hugging Face models fine-tuned with `SFTTrainer` or `GRPOTrainer` for theorem proving. Loads checkpoints locally and uses `HFProver` for proof search. Ideal for training custom models on your traced repositories. Does not build Lean dependencies by default.\n\n```python\nfrom lean_dojo_v2.agent.hf_agent import HFAgent\nfrom lean_dojo_v2.trainer.sft_trainer import SFTTrainer\n\ntrainer = SFTTrainer(model_name=\"deepseek-ai/DeepSeek-Prover-V2-7B\", ...)\nagent = HFAgent(trainer=trainer)\nagent.setup_github_repository(url, commit)\nagent.train()  \nagent.prove()   \n```\n\n#### `ExternalAgent`\n\nUses the Hugging Face Inference API to access large models like DeepSeek-Prover-V2-671B without local model loading. Pairs with `ExternalProver` for whole-proof generation or proof search. Best for quick experiments or when you don't have GPU resources for local inference.\n\n```python\nfrom lean_dojo_v2.agent.external_agent import ExternalAgent\n\nagent = ExternalAgent()\nagent.setup_github_repository(url, commit)\nagent.prove()  \n```\n\n#### `LeanAgent`\n\nImplements the lifelong learning pipeline with retrieval-augmented generation. Uses `RetrievalTrainer` to train premise retrievers, then pairs with `RetrievalProver` for retrieval-augmented tactic generation. Maintains repository curricula and builds Lean dependencies by default.\n\n```python\nfrom lean_dojo_v2.agent.lean_agent import LeanAgent\n\nagent = LeanAgent()\nagent.setup_github_repository(url, commit)\nagent.train()  \nagent.prove()   \n```\n\n### Trainers\n\n#### Supervised Fine-Tuning (`SFTTrainer`)\n\n- Accepts any Hugging Face causal LM identifier.\n- Supports LoRA by passing a `peft.LoraConfig`.\n- Key arguments: `epochs_per_repo`, `batch_size`, `max_seq_len`, `lr`, `warmup_steps`, `gradient_checkpointing`.\n- Produces checkpoints under `output_dir` that the `HFProver` consumes.\n\n#### GRPO Trainer (`GRPOTrainer`)\n\n- Implements Group Relative Policy Optimization for reinforcement-style refinement.\n- Accepts `reference_model`, `reward_weights`, and `kl_beta` settings.\n- Useful for improving search policies on curated theorem batches.\n\n#### Retrieval Trainer (`RetrievalTrainer`)\n\n- Trains the dense retriever that scores prior proofs from the corpus.\n- Used by `LeanAgent` to build retrieval-augmented generation models.\n- Requires indexed corpus and generator checkpoints.\n\nEach agent inherits `BaseAgent`, so you can implement your own by overriding `_get_build_deps()` and `_setup_prover()` to register new trainer/prover pairs.\n\n## LeanProgress Step-Prediction\n\n- Generate a JSONL dataset with remaining-step targets (or replace it with your own LeanProgress export):\n\n  ```sh\n  python -m lean_dojo_v2.lean_progress.create_sample_dataset --output raid/data/sample_leanprogress_dataset.jsonl\n  ```\n\n- Fine-tune a regression head that predicts `steps_remaining`:\n\n  ```python\n  from pathlib import Path\n\n  from lean_dojo_v2.trainer.progress_trainer import ProgressTrainer\n\n  sample_dataset_path = Path(\"raid/data/sample_leanprogress_dataset.jsonl\")\n\n  trainer = ProgressTrainer(\n      model_name=\"bert-base-uncased\",\n      data_path=str(sample_dataset_path),\n      output_dir=\"outputs-progress\",\n  )\n\n  trainer.train()\n  ```\n\n## Proving Theorems\n\nLeanDojo-v2 provides three prover implementations, each for different use cases:\n\n### `HFProver`\n\nLoads a fine-tuned Hugging Face model from a local checkpoint (supports full models and LoRA adapters) and generates tactics directly, used for locally trained Hugging Face model (e.g. with `SFTTrainer` and `GRPOTrainer`).\n\n### `ExternalProver`\n\nPerforms inference with the Hugging Face Inference API to access large models without local GPU resources. Defaults to DeepSeek-Prover-V2-671B. Supports both proof search and whole-proof generation.\n\n### `RetrievalProver`\n\nUsed directly with LeanAgent.\n\n### Proof Methods\n\nLeanDojo-v2 supports two methods for theorem proving:\n\n- **Whole-proof generation**: generate complete proof in one forward pass of the prover.\n\n  ```python\n  from lean_dojo_v2.prover import ExternalProver\n\n  theorem = \"theorem my_and_comm : ∀ {p q : Prop}, And p q → And q p := by\"\n  prover = ExternalProver()\n  proof = prover.generate_whole_proof(theorem)\n  ```\n\n- **Proof search**: generate tactics sequentially and update the goal state through interaction with Pantograph until the proof is complete.\n\n  ```python\n  from pantograph.server import Server\n  from lean_dojo_v2.prover import HFProver\n\n  server = Server()\n  prover = HFProver(ckpt_path=\"outputs-deepseek\")\n\n  result, used_tactics = prover.search(\n      server=server, goal=\"∀ {p q : Prop}, p ∧ q → q ∧ p\", verbose=False\n  )\n  ```\n\n## Testing\n\nWe use `pytest` for regression coverage.\n\n```sh\npip install -e .[dev]          # make sure dev extras like pytest/trl are present\nexport GITHUB_ACCESS_TOKEN=\u003ctoken\u003e\nexport HF_TOKEN=\u003chf-token\u003e     # only required for tests touching HF APIs\npytest -v\n```\n\n## Troubleshooting \u0026 Tips\n\n- **401 Bad Credentials / rate limits**: Ensure `GITHUB_ACCESS_TOKEN` is exported and has `repo` + `read:org` scopes.\n- **Lean tracing failures**: Confirm that the repo’s Lean version exists locally (`elan toolchain install \u003cversion\u003e`).\n- **Missing CUDA libraries**: Install the PyTorch wheel that matches your driver and CUDA version.\n- **Dataset location**: The default `raid/` directory can grow large. Point it to high-throughput storage or use symlinks.\n- **Pantograph errors**: Reinstall Pantograph from source (`pip install git+https://github.com/stanford-centaur/PyPantograph`) whenever Lean upstream changes.\n\n\n## Contributing\n\nIssues and pull requests are welcome! Please:\n\n1. Open an issue describing the bug or feature.\n2. Run formatters (`black`, `isort`) and `pytest` before submitting.\n3. Mention if your change touches Lean tracing files so reviewers can re-generate artifacts.\n\n## License\n\nLeanDojo-v2 is released under the MIT License. See `LICENSE` for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flean-dojo%2Fleandojo-v2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flean-dojo%2Fleandojo-v2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flean-dojo%2Fleandojo-v2/lists"}