Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation — LessWrong

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation — LessWrong

Summary

It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others…

Description

It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others…

Original reporting

AFBytes is a read-only aggregator. Use the original source for full context and complete reporting.

Open original source

Related coverage