What Sentences Cause Alignment Faking? — LessWrong
Summary
TL;DR The decision to fake alignment is concentrated in a small number of sentences per reasoning trace, and those sentences share common features. T…
Description
TL;DR The decision to fake alignment is concentrated in a small number of sentences per reasoning trace, and those sentences share common features. T…
Original reporting
AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.
Open original source