Other

benchmarks

Tracked in 8 AFBytes stories. First seen May 28, 2026. Last seen Jun 02, 2026.

Recent coverage

lesswrong.com · Jun 2, 2026 18:20 UTC

LURE: Alignment Evaluations to Reduce Evaluation Awareness — LessWrong

TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…

science

Read story

arxiv.org · Jun 2, 2026 04:00 UTC

[2606.01804] SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Abstract page for arXiv paper 2606.01804: SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

science tech

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.31086] Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

Abstract page for arXiv paper 2605.31086: Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

science

Read story

arxiv.org · Jun 1, 2026 04:00 UTC

[2605.30788] XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

Abstract page for arXiv paper 2605.30788: XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks

science tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29872] Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

Abstract page for arXiv paper 2605.29872: Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

science

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29170] UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

Abstract page for arXiv paper 2605.29170: UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

science tech

Read story

arxiv.org · May 29, 2026 04:00 UTC

[2605.29230] Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

Abstract page for arXiv paper 2605.29230: Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

science tech

Read story

arxiv.org · May 28, 2026 04:00 UTC

[2605.27566] DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Abstract page for arXiv paper 2605.27566: DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

science tech

Read story

Related entities

ai-safety · other
alignment · other
arxiv · other
speech processing · other
multilingual · other
mathematics · other
dialogue systems · other
research paper · other
LLM · technology
ai · other
quantum · other
physics · other

Browse all entities

benchmarks · AFBytes

Recent coverage