Luning Wang, Zichen Zhang, Junkuan Liu
Large Language Models (LLMs) have recently achieved strong reasoning performance through Reinforcement Learning (RL) and long chain-of-thought (CoT) reasoning. This paper studies how different reward functions affect the reasoning behavior of Small Language Models (SLMs), particularly models under 7B parameters. We propose a dynamic reward function extending cosine rewards and compare multiple reward strategies on reasoning benchmarks.