The safety and alignment of Large Language Models (LLMs) are critical for their respon- sible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning.
You can find the research at the below link.
Feel free to contact Gray Swan with any questions or comments.