[2606.03131] HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models
AI disclosure
Summary
Abstract page for arXiv paper 2606.03131: HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models