Pando: A Controlled Benchmark for Interpretability Methods — LessWrong
Summary
> TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We fin…
Description
> TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We fin…
Original reporting
AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.
Open original source