Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state g
The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of st
Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. R
Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial r
Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reason
Named moves are sparse, but the desk still signals team formation and expansion rather than churn. The items point to new structures taking shape around founders and a fresh lead hire.
Named moves are sparse, but the desk still signals team formation and expansion rather than churn. The items point to new structures taking shape around founders and a fresh lead hire.
Named moves are sparse, but the desk still signals team formation and expansion rather than churn. The items point to new structures taking shape around founders and a fresh lead hire.
Benchmark updates stay concentrated on a single model, but the spread across math, knowledge, and arithmetic matters. The cycle adds a compact snapshot of where one system sits across core evaluation surfaces.
Benchmark updates stay concentrated on a single model, but the spread across math, knowledge, and arithmetic matters. The cycle adds a compact snapshot of where one system sits across core evaluation surfaces.
Benchmark updates stay concentrated on a single model, but the spread across math, knowledge, and arithmetic matters. The cycle adds a compact snapshot of where one system sits across core evaluation surfaces.