https://github.com/igerber/causal-llm-eval

Black-box evaluation framework for LLM agent behavior in causal inference tasks
https://github.com/igerber/causal-llm-eval

Last synced: 2 days ago
JSON representation

Black-box evaluation framework for LLM agent behavior in causal inference tasks

Host: GitHub
URL: https://github.com/igerber/causal-llm-eval
Owner: igerber
License: mit
Created: 2026-05-10T16:28:58.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-10T19:05:10.000Z (about 1 month ago)
Last Synced: 2026-05-10T19:15:07.710Z (about 1 month ago)
Language: Python
Size: 103 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# causal-llm-eval

A black-box evaluation framework for measuring how LLM agents make methodology choices in causal inference tasks, and whether library design (specifically LLM-targeted guidance surfaces like `llms.txt`, fit-time warnings, native diagnostics, and pedagogical docstrings) measurably affects those choices.

## Status

Early development. Phase 1 case study in progress: comparing diff-diff vs statsmodels on a staggered-adoption synthetic DGP, with N=15 cold-start agents per arm.

## Repo layout (planned)

```
harness/ # cold-start agent runner, telemetry capture, venv management
graders/ # AI judge applying the rubric to transcripts
prompts/ # versioned task prompts
rubrics/ # versioned grading rubrics
datasets/ # synthetic DGPs and metadata sidecars
runs/ # per-run records (mostly gitignored)
analysis/ # cell summaries, variability reports, reproducibility checks
writeups/ # case-study writeup drafts
```

## Why a separate repo?

Eval lives independently of the libraries it evaluates. Independence supports the framework's generalizability, isolates dependency footprints, and keeps reproducibility kits self-contained.

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/igerber/causal-llm-eval

Awesome Lists containing this project

README