Simpler is Better: Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models

Luning Wang; Zichen Zhang; Junkuan Liu

Simpler is Better: Finding the Best Reward Function in Long Chain-of-Thought Reinforcement Learning for Small Language Models

Authors

Luning Wang, Zichen Zhang, Junkuan Liu

Abstract

Large Language Models (LLMs) have recently achieved strong reasoning performance through Reinforcement Learning (RL) and long chain-of-thought (CoT) reasoning. This paper studies how different reward functions affect the reasoning behavior of Small Language Models (SLMs), particularly models under 7B parameters. We propose a dynamic reward function extending cosine rewards and compare multiple reward strategies on reasoning benchmarks.

Paper

Download PDF