An open API service indexing awesome lists of open source software.

https://github.com/aerosta/rewardhackwatch

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
https://github.com/aerosta/rewardhackwatch

agent-safety ai-safety alignment deep-learning distilbert fastapi huggingface llm llm-agents machine-learning misalignment monitoring nlp pytorch reward-hacking rlhf streamlit transformers

Last synced: 4 months ago
JSON representation

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

Awesome Lists containing this project