Reward function noise reduction for live trading agents
Reward functions in trading systems are rarely clean. Raw PnL mixes market beta, spread capture, slippage, position sizing, and risk exposure into one unstable scalar. In production research we treat that scalar as a measurement problem before treating it as an optimization target.
The approach starts by decomposing realized outcome into components: price movement during holding period, execution cost versus decision price, adverse excursion, and capital-at-risk. Each component is normalized independently, then recombined with guardrails so a lucky high-volatility move does not teach the agent the wrong habit.
Noise reduction also needs temporal discipline. We avoid rewarding a policy on every tick when the outcome horizon is naturally trade-level or session-level. Short-horizon shaping rewards can help exploration, but they must not overpower the terminal risk-adjusted outcome.
The result is a reward stream that is slower, less theatrical, and more useful. Policies trained against it tend to make fewer overfit entries, produce more stable protective-order placement, and degrade more gracefully when volatility regimes change.