The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks, leaving a key question open: Do we need LLMs specifically tailored for algorithm des
Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synt
Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fund
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning lang
This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation
Capital is concentrated in large, nonstandard checks rather than a broad spread of venture rounds. The mix points to institutional-scale deployment and public or quasi-public backing, not a typical startup funding pattern.
Capital is concentrated in large, nonstandard checks rather than a broad spread of venture rounds. The mix points to institutional-scale deployment and public or quasi-public backing, not a typical startup funding pattern.
Capital is concentrated in large, nonstandard checks rather than a broad spread of venture rounds. The mix points to institutional-scale deployment and public or quasi-public backing, not a typical startup funding pattern.
Named moves are sparse, but the desk still signals where teams are adding operational and infrastructure depth. The visible hires cluster around large platform organizations rather than frontier lab reshuffles.
Named moves are sparse, but the desk still signals where teams are adding operational and infrastructure depth. The visible hires cluster around large platform organizations rather than frontier lab reshuffles.
Named moves are sparse, but the desk still signals where teams are adding operational and infrastructure depth. The visible hires cluster around large platform organizations rather than frontier lab reshuffles.
A single model now posts results across multiple evaluation tiers, giving the desk a compact read on agent performance spread. The main signal is not one score, but the breadth of surfaces now being tracked together.
A single model now posts results across multiple evaluation tiers, giving the desk a compact read on agent performance spread. The main signal is not one score, but the breadth of surfaces now being tracked together.
A single model now posts results across multiple evaluation tiers, giving the desk a compact read on agent performance spread. The main signal is not one score, but the breadth of surfaces now being tracked together.