AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Recent research on jailbreak attacks has focused almost exclusively in settings where LLMs act as simple chatbots. However, now LLMs are increasingly used in agentic workflows, i.e., equipped with external tools and potentially using many steps to fulfill a user’s request. To address potential safety and alignment concerns coming from LLM agents, we introduce AgentHarm, a new benchmark for measuring harmfulness of LLM agents.

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, Xander Davies

‍

This research is currently only available at its source.

You can find the research at the below link.
Feel free to contact Gray Swan with any questions or comments.

View research source

Contact