Training a Transformer to Compose One Step Per Layer (and Proving It) — LessWrong

Training a Transformer to Compose One Step Per Layer (and Proving It) — LessWrong

Summary

I'm working on an experiment comparing the internal representations of two architectures when solving a sequential algorithm, but training models to…

Description

I'm working on an experiment comparing the internal representations of two architectures when solving a sequential algorithm, but training models to…

Original reporting

AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.

Open original source

Related coverage