{"id":49456189,"url":"https://github.com/gafnts/agentic-kie-evals","last_synced_at":"2026-04-30T06:04:40.605Z","repository":{"id":349261502,"uuid":"1200661183","full_name":"gafnts/agentic-kie-evals","owner":"gafnts","description":"Benchmarking agentic and single-pass extraction strategies across LLM providers on the Kleister NDA dataset","archived":false,"fork":false,"pushed_at":"2026-04-12T20:12:58.000Z","size":47089,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-12T21:21:48.863Z","etag":null,"topics":["agentic-ai","agentic-kie","document-ai","evals","key-information-extraction","kie","langsmith"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gafnts.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-03T17:18:19.000Z","updated_at":"2026-04-12T20:13:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/gafnts/agentic-kie-evals","commit_stats":null,"previous_names":["gafnts/agentic-kie-evals"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gafnts/agentic-kie-evals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gafnts%2Fagentic-kie-evals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gafnts%2Fagentic-kie-evals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gafnts%2Fagentic-kie-evals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gafnts%2Fagentic-kie-evals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gafnts","download_url":"https://codeload.github.com/gafnts/agentic-kie-evals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gafnts%2Fagentic-kie-evals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32456168,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T22:27:22.272Z","status":"online","status_checked_at":"2026-04-30T02:00:05.929Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","agentic-kie","document-ai","evals","key-information-extraction","kie","langsmith"],"created_at":"2026-04-30T06:04:34.978Z","updated_at":"2026-04-30T06:04:39.416Z","avatar_url":"https://github.com/gafnts.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eAgentic KIE Evals\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\n  Benchmarking single-pass and agentic extraction strategies across LLM providers on the Kleister NDA dataset.\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/gafnts/agentic-kie-evals/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/gafnts/agentic-kie-evals/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n\u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-blue.svg\" alt=\"License\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\nExtracting structured fields from legal documents is deceptively hard. This project measures how well modern LLMs handle that on real NDA documents from the SEC Edgar database. The benchmark covers three model families (Claude, Gemini, and GPT) and scores each run in LangSmith using exact and fuzzy F1 evaluators.\n\u003c/p\u003e\n\n## Contents\n\n- [Dataset](#dataset)\n- [Running the benchmark](#running-the-benchmark)\n- [Evaluators](#evaluators)\n- [Contributing](#contributing)\n\n---\n\n## Dataset\n\nThis project uses the [Kleister NDA](https://github.com/applicaai/kleister-nda) dataset from Applica AI, which consists of NDA documents sourced from SEC Edgar, annotated with four entity types: `effective_date`, `jurisdiction`, `party`, and `term`.\n\nDataset preprocessing and delivery is handled by the Python package [kleister-nda-preparation](https://github.com/gafnts/kleister-nda-preparation). The preparation pipeline reads the original TSV partitions, transforms raw labels into structured records validated against a Pydantic schema, relocates the corresponding PDF documents, and writes the results as partitioned Parquet files.\n\n\u003e [!NOTE]\n\u003e This step runs automatically as part of `make install`.\n\n### Uploading the dataset to LangSmith\n\nBefore running the benchmark, the preprocessed Parquet files and their PDF attachments need to be uploaded to [LangSmith](https://smith.langchain.com/). The [upload_dataset.py](src/agentic_kie_evals/upload_dataset.py) module supports several behaviors:\n\n1. Dry run (validates parquet files and PDF paths, no API calls)\n```bash\nuv run python -m agentic_kie_evals.upload_dataset --dry-run\n```\n\n2. Upload all partitions\n```bash\nuv run python -m agentic_kie_evals.upload_dataset\n```\n\n3. Upload specific partitions\n```bash\nuv run python -m agentic_kie_evals.upload_dataset --partitions train dev-0\n```\n\n4. Delete and recreate the dataset from scratch\n```bash\nuv run python -m agentic_kie_evals.upload_dataset --recreate\n```\n\n\u003e [!TIP]\n\u003e The upload script is idempotent: re-running it is safe. It reuses an existing dataset and deterministic example IDs prevent duplicates.\n\n---\n\n## Running the benchmark\n\nThe benchmark runner evaluates the full experiment matrix (`model × strategy × modality`) against the LangSmith dataset. Each run is scored by the evaluators and logged back to LangSmith.\n\n1. Dry run (print the experiment matrix without making any API calls)\n```bash\nuv run python -m agentic_kie_evals.run_benchmark --dry-run\n```\n\n2. Single quick test (one model, one strategy, 10 examples)\n```bash\nuv run python -m agentic_kie_evals.run_benchmark \\\n    --tier lite --model gemini --strategy single_pass --limit 10\n```\n\n3. Full matrix, lite tier (cost-optimised models) on the dev split\n```bash\nuv run python -m agentic_kie_evals.run_benchmark\n```\n\n4. Full matrix, standard tier (full-capability models) on the dev split\n```bash\nuv run python -m agentic_kie_evals.run_benchmark --tier standard\n```\n\n### CLI reference\n\n| Flag | Choices | Default | Description |\n|---|---|---|---|\n| `--tier` | `lite`, `standard`, `flagship` | `lite` | Model tier: cost-optimised, full-capability, or top-capability |\n| `--model` | `claude`, `gemini`, `gpt` | all | Restrict to a single model |\n| `--strategy` | `single_pass`, `agentic` | both | Restrict to a single extraction strategy |\n| `--split` | `train`, `dev`, `test` | `dev` | Dataset split to evaluate against |\n| `--limit` | int | none | Cap the number of examples evaluated |\n| `--max-concurrency` | int | `3` | Max concurrent evaluations |\n| `--max-retries` | int | `6` | Max retries per extractor call |\n| `--dry-run` | — | false | Print the experiment matrix and exit |\n\n\u003e [!NOTE]\n\u003e Modalities are configured via `SINGLE_PASS_MODALITIES` and `AGENTIC_MODALITIES` in [run_benchmark.py](src/agentic_kie_evals/run_benchmark.py).\n\n---\n\n## Evaluators\n\nEvaluators live in [evaluators.py](src/agentic_kie_evals/evaluators.py) and follow the LangSmith custom evaluator signature `(outputs, reference_outputs) -\u003e {\"key\": str, \"score\": float}`.\n\n| Evaluator | Field | Method | Score |\n|---|---|---|---|\n| `exact_effective_date_f1` | `effective_date` | Exact match | 0 or 1 |\n| `exact_jurisdiction_f1` | `jurisdiction` | Exact match | 0 or 1 |\n| `fuzzy_jurisdiction_f1` | `jurisdiction` | SequenceMatcher ≥ 0.85 | 0 or 1 |\n| `exact_term_f1` | `term` | Exact match | 0 or 1 |\n| `fuzzy_term_f1` | `term` | SequenceMatcher ≥ 0.85 | 0 or 1 |\n| `exact_party_f1` | `party` | Set F1, exact string | 0–1 continuous |\n| `fuzzy_party_f1` | `party` | Set F1, SequenceMatcher ≥ 0.85 | 0–1 continuous |\n| `exact_f1` | all fields | Macro-average of exact F1 scores | 0–1 continuous |\n| `fuzzy_f1` | all fields | Macro-average of fuzzy F1 scores | 0–1 continuous |\n\nNormalization (lowercasing, whitespace trimming, trailing-period stripping) is applied to both sides before comparison.\n\n---\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for the development workflow, available `make` targets, and the CI pipeline.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgafnts%2Fagentic-kie-evals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgafnts%2Fagentic-kie-evals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgafnts%2Fagentic-kie-evals/lists"}