D-REX: A Benchmark For Detecting Deceptive Reasoning In Large Language Models

The safety and alignment of Large Language Models (LLMs) are critical for their respon- sible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning.

Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas

This research is currently only available at its source.

You can find the research at the below link.
Feel free to contact Gray Swan with any questions or comments.

View research source

Contact