5 Fun Papers That Explain LLMs Clearly
Want to understand LLMs better? Start with these five foundational papers that explain how they work.
America Forever Bytes
Technology
Want to understand LLMs better? Start with these five foundational papers that explain how they work.
Abstract page for arXiv paper 2606.03618: Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing
Abstract page for arXiv paper 2606.03641: Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
Abstract page for arXiv paper 2606.03645: The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models
Abstract page for arXiv paper 2606.03867: A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs
Abstract page for arXiv paper 2606.03876: From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data w...
Abstract page for arXiv paper 2606.03535: Can LLM Rerankers Predict Their Own Ranking Performance?
Abstract page for arXiv paper 2606.03565: Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing
Abstract page for arXiv paper 2606.03030: Do Matching Mechanisms Work with LLM Agents?
Abstract page for arXiv paper 2606.03965: Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
Abstract page for arXiv paper 2606.03647: Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs
Abstract page for arXiv paper 2606.03648: Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability
Abstract page for arXiv paper 2606.03021: Hint-Guided Diversified Policy Optimization for LLM Reasoning
Abstract page for arXiv paper 2606.03022: Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization
Abstract page for arXiv paper 2606.03029: Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates
Abstract page for arXiv paper 2606.03032: The Deliberative Illusion: Diagnosing Factual Attrition and Stance Homogenization in Multi-Agent LLM Deliberation
Abstract page for arXiv paper 2606.03043: The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment
Abstract page for arXiv paper 2606.03291: Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Abstract page for arXiv paper 2606.03330: FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences
Abstract page for arXiv paper 2606.03331: Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions
Abstract page for arXiv paper 2606.03344: RogueMerge: Robust and Unified Attacks against LLM Model Merging
Abstract page for arXiv paper 2606.03437: Large Language Models Are Overconfident in Their Own Responses
Abstract page for arXiv paper 2606.03503: ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Abstract page for arXiv paper 2606.02606: ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
Abstract page for arXiv paper 2606.02776: Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers
Abstract page for arXiv paper 2606.02822: Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing
Abstract page for arXiv paper 2606.02835: Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
Background: Large language models (LLMs) require specialized methodologies to quantify model confidence for safe deployment in health care systems; however, there is a lack of established methods for confidence assessment. Objective: This study aimed to evaluate confidence metrics for multimodal LLMs interpreting ultrasound-based radiology cases and to compare self-reported, consistency-based, and hybrid methods. Methods: From a total of 330 quizzes on the Korean Society of Ultrasound in Medicine digital platform, we selected 94 multiple-choice cases. Four multimodal LLMs were evaluated: 3 reasoning models (GPT-5, Claude-4.5-Sonnet, and Gemini-3-Pro) and 1 general model (GPT-4o). Temperature was fixed at 1.0. Multiple confidence metrics were assessed: (1) self-reported metrics generated by LLMs using prompts that elicited direct confidence percentages with answers, including first self-reported confidence and mean self-reported confidence; (2) consistency-based metrics derived from 20 repeated outputs per case, including relative entropy calculated as 1 − H/log k (H=Shannon entropy, k=number of answer choices) and majority-vote percentage; and (3) a Top Weighted Score combining response frequency with self-reported confidence. Receiver operating characteristic analysis for discrimination and Spearman correlation between accuracy and each confidence metric was conducted. Additionally, model calibration was assessed using expected calibration error and Brier score. Processing time and token consumption (input, output, and total) were recorded for each application programming interface call to evaluate resource use across models. Results: Diagnostic accuracy varied across models, with Gemini-3-Pro achieving the highest accuracy (70/94, 74.47%), surpassing the median human accuracy (59%, IQR 40.3%-75%). Top Weighted Score, a hybrid metric combining response frequency and self-reported confidence, was the only metric achieving statistically significant correlations across all 4 models: Gemini-3-Pro (ρ=0.52), GPT-5 (ρ=0.43), Claude-4.5-Sonnet (ρ=0.30), and GPT-4o (ρ=0.22). Receiver operating characteristic analysis revealed that Top Weighted Score demonstrated the highest discriminative ability, with area under the curve values of 0.826 (95% CI 0.731‐0.920) for Gemini-3-Pro and 0.767 (95% CI 0.668‐0.866) for GPT-5. Top Weighted Score was the only metric achieving statistical significance in GPT-4o. Calibration analysis showed that Top Weighted Score achieved the lowest expected calibration error in GPT-5 (0.098) and Claude-4.5-Sonnet (0.192), while Gemini-3-Pro showed comparable calibration between relative entropy (0.119) and Top Weighted Score (0.122). Resource use analysis demonstrated that reasoning models required substantially longer processing times and higher token consumption compared to general models. Conclusions: In multimodal LLMs applied to ultrasound-based radiology cases, hybrid methods (Top Weighted Score) demonstrated significant associations across all evaluated models and appear to serve as more reliable indicators of diagnostic confidence compared to self-reported or consistency-based metrics alone, although the strength of these associations varied across models, and external validation is warranted before broader clinical application. These findings support integrative confidence estimation approaches that incorporate response consistency while highlighting the need for resource-efficient sampling strategies to enable practical clinical deployment.
Want ChatGPT, Gemini, and Perplexity to understand your local business better? Use localized SEO for LLMs to strengthen entity signals, reviews, content, and ci...
Abstract page for arXiv paper 2606.01859: Improving LLM-Based Go Code Review through Issue-List Generation and Context Augmentation