[2606.02584] IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation
Abstract page for arXiv paper 2606.02584: IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation
America Forever Bytes
Other
Abstract page for arXiv paper 2606.02584: IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation
TLDR: * Frontier models can detect when they're being evaluated and change their behavior, which risks compromising safety benchmarks. * We introdu…
Abstract page for arXiv paper 2606.01804: SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing
Abstract page for arXiv paper 2605.31086: Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory
Abstract page for arXiv paper 2605.30788: XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks
Abstract page for arXiv paper 2605.29872: Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks
Abstract page for arXiv paper 2605.29170: UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Abstract page for arXiv paper 2605.29230: Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data
Abstract page for arXiv paper 2605.27566: DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents