LURE: Alignment Evaluations to Reduce Evaluation Awareness — LessWrong
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…
America Forever Bytes
Other
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…
Abstract page for arXiv paper 2511.16992: FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models
Abstract page for arXiv paper 2605.30808: Differentially Private Preference Data Synthesis for Large Language Model Alignment
---------------------------------------- …
Defenders have structural advantages but there’s work to be done
Abstract page for arXiv paper 2605.29365: Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset
Abstract page for arXiv paper 2605.29442: How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessio...
Abstract page for arXiv paper 2605.29458: Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment
Abstract page for arXiv paper 2605.28188: Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment
Abstract page for arXiv paper 2605.28597: Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation
Abstract page for arXiv paper 2605.27969: Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost