Pando: A Controlled Benchmark for Interpretability Methods — LessWrong

Pando: A Controlled Benchmark for Interpretability Methods — LessWrong

Summary

> TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We fin…

Description

> TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We fin…

Original reporting

AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.

Open original source

Related coverage