Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training — LessWrong

Probe-Based Data Attribution: Surfacing and Mitigating Undesirable Behaviors in LLM Post-Training — LessWrong

Summary

Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). …

Description

Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). …

Original reporting

AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.

Open original source

Related coverage