https://github.com/aerosta/rewardhackwatch

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
https://github.com/aerosta/rewardhackwatch

agent-safety ai-safety alignment deep-learning distilbert fastapi huggingface llm llm-agents machine-learning misalignment monitoring nlp pytorch reward-hacking rlhf streamlit transformers

Last synced: 4 months ago
JSON representation

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Host: GitHub
URL: https://github.com/aerosta/rewardhackwatch
Owner: aerosta
License: mit
Created: 2025-12-09T23:01:49.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-12-11T22:25:32.000Z (6 months ago)
Last Synced: 2025-12-13T03:13:31.165Z (6 months ago)
Topics: agent-safety, ai-safety, alignment, deep-learning, distilbert, fastapi, huggingface, llm, llm-agents, machine-learning, misalignment, monitoring, nlp, pytorch, reward-hacking, rlhf, streamlit, transformers
Language: Python
Homepage: https://huggingface.co/aerosta/rewardhackwatch
Size: 3.62 MB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
- Security: SECURITY.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aerosta/rewardhackwatch

Awesome Lists containing this project