Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the m
Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for jo
The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losin
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions,
Hyperparameter Optimization (HPO) can lift the burden of tuning hyperparameters (HPs) of neural networks. HPO algorithms from the Population Based Training (PBT) family are efficient thanks to dynamically adjusting HPs every few steps of the weight optimization. Recent results indicate that the number of steps between HP updates is an important meta-HP of all PBT variants that can substantially af
Named moves remain concentrated around senior AI leadership and academic-to-industry shifts. The desk matters when a single departure or hire changes who controls research direction, not when teams merely reshuffle.
Named moves remain concentrated around senior AI leadership and academic-to-industry shifts. The desk matters when a single departure or hire changes who controls research direction, not when teams merely reshuffle.
Named moves remain concentrated around senior AI leadership and academic-to-industry shifts. The desk matters when a single departure or hire changes who controls research direction, not when teams merely reshuffle.
Three benchmark entries for the same model point to a narrow but useful snapshot of performance across core reasoning and knowledge tests. The signal here is surface coverage: which evaluations are being tracked, and where a single system sits across them.
Three benchmark entries for the same model point to a narrow but useful snapshot of performance across core reasoning and knowledge tests. The signal here is surface coverage: which evaluations are being tracked, and where a single system sits across them.
Three benchmark entries for the same model point to a narrow but useful snapshot of performance across core reasoning and knowledge tests. The signal here is surface coverage: which evaluations are being tracked, and where a single system sits across them.