[2606.03812] Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis
Abstract page for arXiv paper 2606.03812: Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis
America Forever Bytes
Other
Abstract page for arXiv paper 2606.03812: Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis
Abstract page for arXiv paper 2606.03376: P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimizat...
Abstract page for arXiv paper 2606.03437: Large Language Models Are Overconfident in Their Own Responses
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…
Article / 1st Jun 2026 Constraining LLMs Just Like Users This post accompanies my recent video on this topic. Large Language Models (LLMs) - often called "AI...
Suppose you are a technical AI safety researcher who has done some research, but has not yet landed a full-time job. In this post, I argue that takin…
Summary Safe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on th…
At the risk of embarrassing myself, I’ll share a confession. …
Abstract page for arXiv paper 2605.30085: Conformal Certification of Reasoning Trace Prefixes
Abstract page for arXiv paper 2605.29427: FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions
Abstract page for arXiv paper 2605.28467: Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
Abstract page for arXiv paper 2605.28588: Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem
Abstract page for arXiv paper 2605.28591: Models That Know How Evaluations Are Designed Score Safer
Abstract page for arXiv paper 2605.28597: Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation