Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training — LessWrong
Summary
Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). …
Description
Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). …
Original reporting
AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.
Open original source