https://github.com/aerosta/rewardhackwatch
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
https://github.com/aerosta/rewardhackwatch
agent-safety ai-safety alignment deep-learning distilbert fastapi huggingface llm llm-agents machine-learning misalignment monitoring nlp pytorch reward-hacking rlhf streamlit transformers
Last synced: 4 months ago
JSON representation
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
- Host: GitHub
- URL: https://github.com/aerosta/rewardhackwatch
- Owner: aerosta
- License: mit
- Created: 2025-12-09T23:01:49.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-11T22:25:32.000Z (6 months ago)
- Last Synced: 2025-12-13T03:13:31.165Z (6 months ago)
- Topics: agent-safety, ai-safety, alignment, deep-learning, distilbert, fastapi, huggingface, llm, llm-agents, machine-learning, misalignment, monitoring, nlp, pytorch, reward-hacking, rlhf, streamlit, transformers
- Language: Python
- Homepage: https://huggingface.co/aerosta/rewardhackwatch
- Size: 3.62 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Citation: CITATION.cff
- Security: SECURITY.md