sae features universal rotation lesswrong
A LessWrong post examines whether sparse autoencoder features are consistent across models aside from an unknown rotation factor.
Topic cluster
3 sources grouped by AFBytes in Ai
AFBytes briefing
Advances in AI interpretability can influence how developers and regulators assess model reliability and safety.
What to watch next
A LessWrong post examines whether sparse autoencoder features are consistent across models aside from an unknown rotation factor.
An independent project tests how removing specific attention heads alters repetition behavior in language models.
Researchers introduced a weight-based method to quantify functional similarity between neural networks across inputs.