Learnable Multipliers for Adaptive Scale in LLM Matrix Layers

← Back to Papers

Research Paper

Learnable Multipliers for Adaptive Scale in LLM Matrix Layers

Authors: Maksim Velikanov,

nlp efficiency advanced ▲ 27 • arXiv ↗ • HuggingFace ↗

Organization: Hugging Face

Published: 2026-01-09 • Added: 2026-01-09

Key Insights

Attaching a learnable scalar multiplier to each weight matrix lets the model escape the suboptimal weight‑norm equilibrium imposed by fixed weight decay.
Extending this idea to per‑row and per‑column multipliers further frees individual dimension scales, yielding a more expressive variant of μP‑style scaling.
The learned multipliers automatically adapt to data and model width, improving downstream performance and matching gains obtained by switching from Adam to the newer Muon optimizer.
Compared to hand‑tuned μP multipliers, the learnable approach reduces hyperparameter search overhead while achieving higher accuracy.

Abstract

Applyingweight decay(WD) tomatrix layersis standard practice in large-language-model pretraining. Prior work suggests thatstochastic gradient noiseinduces aBrownian-like expansionof the weight matrices W, whose growth is counteracted by WD, leading to aWD-noise equilibriumwith a certainweight norm||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducinglearnable multipliersto learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that theWD-noise equilibriumnorm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization ofmuP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validatelearnable multiplierswith both Adam andMuon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

Full Analysis

# Learnable Multipliers for Adaptive Scale in LLM Matrix Layers **Authors:** Maksim Velikanov, **Source:** [HuggingFace](https://huggingface.co/papers/2601.04890) | [arXiv](https://arxiv.org/abs/2601.04890) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Attaching a learnable scalar multiplier to each weight matrix lets the model escape the suboptimal weight‑norm equilibrium imposed by fixed weight decay. - Extending this idea to per‑row and per‑column multipliers further frees individual dimension scales, yielding a more expressive variant of μP‑style scaling. - The learned multipliers automatically adapt to data and model width, improving downstream performance and matching gains obtained by switching from Adam to the newer Muon optimizer. - Compared to hand‑tuned μP multipliers, the learnable approach reduces hyperparameter search overhead while achieving higher accuracy. ## Abstract Applyingweight decay(WD) tomatrix layersis standard practice in large-language-model pretraining. Prior work suggests thatstochastic gradient noiseinduces aBrownian-like expansionof the weight matrices W, whose growth is counteracted by WD, leading to aWD-noise equilibriumwith a certainweight norm||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducinglearnable multipliersto learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that theWD-noise equilibriumnorm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization ofmuP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validatelearnable multiplierswith both Adam andMuon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon. --- *Topics: nlp, efficiency* *Difficulty: advanced* *Upvotes: 27*