Recent research on jailbreak attacks has focused almost exclusively in settings where LLMs act as simple chatbots. However, now LLMs are increasingly used in agentic workflows, i.e., equipped with external tools and potentially using many steps to fulfill a user’s request. To address potential safety and alignment concerns coming from LLM agents, we introduce AgentHarm, a new benchmark for measuring harmfulness of LLM agents.
You can find the research at the below link.
Feel free to contact Gray Swan with any questions or comments.