Introspection Adapters: Training LLMs to Report Their Learned Behaviors — LessWrong
Summary
Authors: Keshav Shenoy, Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang • 📄Paper, 💻 Code, 🤖Models …
Description
Authors: Keshav Shenoy, Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang • 📄Paper, 💻 Code, 🤖Models …
Original reporting
AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.
Open original source