Token‑Level Collaborative Decoding for Efficient LLM Reasoning

← Back to Papers

Research Paper

Token‑Level Collaborative Decoding for Efficient LLM Reasoning

Authors: Chengsong Huang,

nlp efficiency advanced ▲ 16 • arXiv ↗ • HuggingFace ↗

Organization: Hugging Face

Published: 2026-01-09 • Added: 2026-01-09

Key Insights

RelayLLM lets a small language model act as a controller, emitting a special command token to summon the large model only for critical tokens, reducing LLM usage to ~1 % of generated tokens.
A two‑stage training regimen (warm‑up plus Group Relative Policy Optimization) teaches the SLM when to generate autonomously and when to request help, balancing independence with strategic assistance.
Across six reasoning benchmarks, RelayLLM attains 49.52 % accuracy—close to the full LLM performance—while achieving a >98 % reduction in computational cost versus random token‑level routing.
Token‑level collaboration outperforms coarse‑grained routing/cascading by avoiding unnecessary offloading of entire queries when the SLM can handle most reasoning steps.
The framework is model‑agnostic: any SLM‑LLM pair sharing a tokenizer can be combined, making it broadly applicable.

Abstract

Large Language Models(LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficientSmall Language Models(SLMs) typically lack the necessaryreasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-levelcollaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up andGroup Relative Policy Optimization(GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

Full Analysis

# Token‑Level Collaborative Decoding for Efficient LLM Reasoning **Authors:** Chengsong Huang, **Source:** [HuggingFace](https://huggingface.co/papers/2601.05167) | [arXiv](https://arxiv.org/abs/2601.05167) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - RelayLLM lets a small language model act as a controller, emitting a special command token to summon the large model only for critical tokens, reducing LLM usage to ~1 % of generated tokens. - A two‑stage training regimen (warm‑up plus Group Relative Policy Optimization) teaches the SLM when to generate autonomously and when to request help, balancing independence with strategic assistance. - Across six reasoning benchmarks, RelayLLM attains 49.52 % accuracy—close to the full LLM performance—while achieving a >98 % reduction in computational cost versus random token‑level routing. - Token‑level collaboration outperforms coarse‑grained routing/cascading by avoiding unnecessary offloading of entire queries when the SLM can handle most reasoning steps. - The framework is model‑agnostic: any SLM‑LLM pair sharing a tokenizer can be combined, making it broadly applicable. ## Abstract Large Language Models(LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficientSmall Language Models(SLMs) typically lack the necessaryreasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-levelcollaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up andGroup Relative Policy Optimization(GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers. --- *Topics: nlp, efficiency* *Difficulty: advanced* *Upvotes: 16*