Autonomous Reward Shaping via Self-Generated Trajectories for Sparse-Reward Reinforcement Learning

Published in IEEE Xplore: 09 February 2026
Authors: Kota Minoshima and Sachiyo Arai
monday picture

A central challenge in reinforcement learning is enabling agents to efficiently learn in environments where rewards are sparse or significantly delayed. Many reward shaping approaches rely on handcrafted signals or expert demonstrations, limiting scalability and increasing task-specific engineering cost. We propose an autonomous reward shaping framework that leverages self-generated trajectories. The agent generates trajectories, ranks them using sparse environmental feedback, and learns a dense reward model that best explains these rankings. This learned shaping signal is then added to the environmental reward to accelerate policy optimization, creating a closed loop in which improved policies yield higher-quality trajectories that further refine the reward model. Experiments on MuJoCo continuous-control benchmarks under two delayed-reward settings (20-step delayed and completely delayed) show that our method improves learning efficiency over strong baselines, including PPO, intrinsic-motivation methods (ICM, RND), Self-Imitation Learning (SIL), and Randomized Return Decomposition (RRD). In the 20-step delayed setting, our method matches or exceeds the final performance of these baselines on most tasks while learning faster. In the completely delayed setting, where only a single terminal reward is available at the end of each episode, our approach reliably learns high-return policies, whereas the baselines tend to converge to low-return regimes. By autonomously converting sparse and delayed feedback into dense learning signals without expert input, this framework reduces the burden of manual reward engineering and advances scalable reinforcement learning in challenging sparse-reward domains.