What Sentences Cause Alignment Faking? — LessWrong

What Sentences Cause Alignment Faking? — LessWrong

Summary

TL;DR The decision to fake alignment is concentrated in a small number of sentences per reasoning trace, and those sentences share common features. T…

Description

TL;DR The decision to fake alignment is concentrated in a small number of sentences per reasoning trace, and those sentences share common features. T…

Original reporting

AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.

Open original source

Related coverage