Skip to content

Reinforcement Learning: Agents, Rewards, and Policies

DS
LDS Team
Let's Data Science
17 minAudio · 1 listens
Listen Along
0:00/ 0:00
AI voice

A robot arm knocks a cup off a table 47 times before finally learning to place it gently. A game-playing agent loses 10,000 matches of Go, then beats the world champion. Reinforcement learning is the science behind both stories: an agent learns not from labeled examples, but from the consequences of its own actions. As of March 2026, RL has become the secret ingredient behind how large language models actually work, powering the alignment techniques (RLHF, GRPO, DPO) that turn raw language models into helpful assistants.

Unlike supervised learning (correct answers provided) or unsupervised learning (find structure in data), reinforcement learning drops an agent into an environment and says: "figure it out." The agent tries actions, receives rewards or penalties, and gradually discovers a strategy (a policy) that maximizes long-term payoff. This trial-and-error approach is how DeepMind's AlphaGo defeated Lee Sedol, how recommendation engines optimize engagement, and how OpenAI and DeepSeek train reasoning models to think step by step.

We will build intuition through one consistent example: a 4x4 gridworld where an agent moves from a start cell to a goal, avoiding traps. Every formula, every algorithm, every comparison will reference this grid.

The RL Framework: Agent, Environment, State, Action, Reward

Reinforcement learning is a feedback loop between two entities: an agent (the decision-maker) and an environment (everything the agent interacts with). At each timestep tt, the agent observes a state sts_t, selects an action ata_t, receives a reward rtr_t, and transitions to a new state st+1s_{t+1}.

The agent-environment interaction loop in reinforcement learningClick to expandThe agent-environment interaction loop in reinforcement learning

In our gridworld, the state is the robot's cell position (row, column). Actions are {up, down, left, right}. The reward is -1 per step, -10 for a trap, and +10 for the goal. If the robot moves right from cell (1,2), the environment places it at (1,3) and returns that cell's reward.

This loop repeats until the agent reaches a terminal state (goal or trap). One complete sequence is called an episode. The agent's objective across episodes is to maximize total accumulated reward.

Key Insight: The reward signal is the only feedback the agent gets. It never sees "the correct action." This is what makes RL fundamentally different from supervised learning, and fundamentally harder.

Markov Decision Processes: The Mathematical Foundation

A Markov Decision Process (MDP) formalizes the RL problem as a tuple (S,A,P,R,γ)(S, A, P, R, \gamma). The "Markov" part means the future depends only on the current state, not on the history of how the agent got there.

ComponentSymbolGridworld Example
State spaceSSAll 16 cells on the 4x4 grid
Action spaceAA{up, down, left, right}
Transition functionP(ss,a)P(s' \mid s, a)Probability of reaching cell ss' from ss via action aa
Reward functionR(s,a)R(s, a)-1 per step, -10 trap, +10 goal
Discount factorγ\gamma0.9 (values future rewards at 90% per step)

The discount factor γ[0,1]\gamma \in [0, 1] controls how much the agent cares about future versus immediate rewards. A γ\gamma of 0.9 means a reward of +10 received two steps from now is worth $10 \times 0.9^2 = 8.1$ today.

In Plain English: An MDP is the complete rulebook for the gridworld: all possible cells, all possible moves, what happens when you move, and the score you get. The Markov property means the robot only needs to know where it is right now, not its entire path history.

Policies: The Agent's Strategy

A policy π\pi maps states to actions. It is the agent's strategy, the complete rule that dictates behavior.

Deterministic policy: π(s)=a\pi(s) = a assigns one specific action to each state. In our gridworld, a deterministic policy might say "in cell (0,0), always go right."

Stochastic policy: π(as)\pi(a \mid s) gives a probability distribution over actions for each state. The agent in cell (0,0) might go right with 70% probability and down with 30%. Stochastic policies are essential for exploration and for mixed strategies in game theory.

The goal of RL is to find the optimal policy π\pi^* that maximizes expected cumulative reward from every state.

Value Functions: Measuring How Good a State Is

Value functions estimate the expected total reward the agent will collect, starting from a given state (or state-action pair) and following a particular policy.

State-Value Function V(s)

Vπ(s)=Eπ[k=0γkrt+kst=s]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s \right]

Where:

  • Vπ(s)V^\pi(s) is the expected return starting from state ss and following policy π\pi
  • γ\gamma is the discount factor (0.9 in our gridworld)
  • rt+kr_{t+k} is the reward received kk steps into the future
  • Eπ\mathbb{E}_\pi denotes the expectation under policy π\pi

In Plain English: Vπ(s)V^\pi(s) answers the question: "if the robot is at cell ss and follows policy π\pi from now on, how much total (discounted) reward will it earn on average?" Cells near the goal have high value; cells near traps have low value.

Action-Value Function Q(s, a)

Qπ(s,a)=Eπ[k=0γkrt+kst=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s, a_t = a \right]

Where:

  • Qπ(s,a)Q^\pi(s, a) is the expected return after taking action aa in state ss, then following π\pi

The Q-function is more informative than V because knowing Q(s,a)Q^*(s, a) for all state-action pairs lets you extract the optimal policy directly: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a).

The Bellman Equations: Recursive Value Decomposition

The Bellman equation is the backbone of nearly every RL algorithm. It expresses the value of a state as the immediate reward plus the discounted value of the next state.

Bellman Expectation Equation

Vπ(s)=aπ(as)sP(ss,a)[R(s,a)+γVπ(s)]V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s'} P(s' \mid s, a) \left[ R(s, a) + \gamma V^\pi(s') \right]

Where:

  • π(as)\pi(a \mid s) is the probability of taking action aa in state ss under policy π\pi
  • P(ss,a)P(s' \mid s, a) is the transition probability from state ss to ss' given action aa
  • R(s,a)R(s, a) is the immediate reward for taking action aa in state ss
  • γVπ(s)\gamma V^\pi(s') is the discounted future value from the next state

In Plain English: The value of a cell equals the reward for leaving it, plus 0.9 times the value of wherever the robot ends up. If cell (2,3) is one step from the goal, its value is roughly (1)+0.9×10=8.0(-1) + 0.9 \times 10 = 8.0. This recursion creates a system of equations you can solve iteratively.

Bellman Optimality Equation

V(s)=maxasP(ss,a)[R(s,a)+γV(s)]V^*(s) = \max_{a} \sum_{s'} P(s' \mid s, a) \left[ R(s, a) + \gamma V^*(s') \right]

The optimal value function picks the best action rather than averaging over a policy. Dynamic Programming algorithms (Value Iteration, Policy Iteration) solve this directly when PP is known.

Exploration vs. Exploitation: The Central Tension

Exploration vs. exploitation is the defining dilemma in RL. The agent must exploit actions it already knows are good to collect reward, but also explore new actions that might be even better.

Exploration strategies comparison for reinforcement learning agentsClick to expandExploration strategies comparison for reinforcement learning agents

StrategyMechanismStrengthsWeaknesses
ϵ\epsilon-greedyRandom action with probability ϵ\epsilon, greedy otherwiseSimple, widely usedExplores uniformly, wastes tries on clearly bad actions
UCB (Upper Confidence Bound)Favors actions with high uncertaintyPrincipled, theoretically optimalHarder to implement in large state spaces
Thompson SamplingSamples from posterior distribution of rewardsNaturally balances explore/exploitRequires Bayesian framework
Boltzmann (Softmax)Actions weighted by exponentiated Q-valuesSmooth exploration scalingSensitive to temperature parameter

In our gridworld, pure exploitation means always following the highest Q-value. But early on, those estimates are wrong. With ϵ=0.1\epsilon = 0.1, the robot takes a random action 10% of the time, occasionally discovering shorter paths. Over time, ϵ\epsilon decays toward zero.

Common Pitfall: Setting ϵ\epsilon too low too early locks the agent into a suboptimal policy. A common schedule is ϵt=max(0.01,1.0t/N)\epsilon_t = \max(0.01, 1.0 - t/N) where NN is the total number of episodes.

Model-Based vs. Model-Free RL

RL algorithms split into two broad families depending on whether the agent tries to learn a model of the environment.

RL taxonomy showing model-based vs model-free and value-based vs policy-based approachesClick to expandRL taxonomy showing model-based vs model-free and value-based vs policy-based approaches

Model-based methods learn or are given the transition function P(ss,a)P(s' \mid s, a) and reward function RR, then plan ahead by simulating future trajectories. Dynamic Programming and Monte Carlo Tree Search (used in AlphaGo) are model-based. The upside is sample efficiency; the downside is that learning an accurate model can be as hard as solving the original problem.

Model-free methods learn values or policies directly from experience. Q-learning, SARSA, and policy gradient methods all fall here. They need more data but make fewer assumptions. Most practical RL today (including RLHF for LLMs) is model-free.

Pro Tip: If your environment is cheap to simulate (board games, simple physics), go model-based for faster convergence. If it's complex or only accessible through real interaction (robotics, live recommendation systems), model-free is your only option.

Monte Carlo vs. Temporal Difference Learning

Both are model-free approaches to learning value functions, but they differ in when they update estimates.

Monte Carlo (MC) methods wait until an episode ends, then update values based on the actual total return:

V(s)V(s)+α[GtV(s)]V(s) \leftarrow V(s) + \alpha \left[ G_t - V(s) \right]

Where GtG_t is the total discounted return from timestep tt to the end of the episode, and α\alpha is the learning rate.

Temporal Difference (TD) methods update after every step, using an estimate of the return:

V(s)V(s)+α[r+γV(s)V(s)]V(s) \leftarrow V(s) + \alpha \left[ r + \gamma V(s') - V(s) \right]

The first part of the bracket is the TD target: the immediate reward plus the discounted value of the next state. The full bracket is the TD error: how far off the current estimate was. TD bootstraps by using its own estimate of the next state's value to improve the current state's value.

PropertyMonte CarloTemporal Difference
Update timingEnd of episodeEvery step
BiasUnbiased (uses actual returns)Biased (bootstraps from estimates)
VarianceHigh (returns vary a lot)Lower (single-step updates)
Requires episodes?Yes (must reach terminal state)No (works in continuing tasks)
ConvergenceSlower but guaranteedFaster in practice

In Plain English: Monte Carlo is like grading a student only after a final exam. TD learning is like giving pop quizzes every class. The pop quizzes give noisier individual scores, but the student (agent) improves faster because feedback is immediate.

Q-Learning: The Model-Free Workhorse

Q-learning (Watkins, 1989) is the most influential model-free RL algorithm. It learns the optimal action-value function QQ^* directly, without needing to know the environment's transition probabilities.

The Q-Learning Update Rule

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

Where:

  • Q(s,a)Q(s, a) is the current estimate of the value of taking action aa in state ss
  • α\alpha is the learning rate (typically 0.1 to 0.5)
  • rr is the immediate reward received
  • γ\gamma is the discount factor
  • maxaQ(s,a)\max_{a'} Q(s', a') is the best Q-value achievable from the next state ss'
  • r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a') is the TD target

In Plain English: After the robot steps from cell ss to cell ss' and collects reward rr, it asks: "was this move better or worse than I expected?" The difference between what happened (rr plus the best future value) and what was predicted (Q(s,a)Q(s,a)) is the error. The robot nudges its estimate by α\alpha of that error, and over thousands of episodes, these nudges converge to the true optimal Q-values.

Q-learning is off-policy: it updates using the greedy max\max action, regardless of what the agent actually did. This separation of behavior policy from target policy is what makes Q-learning so powerful.

Q-Learning on Our Gridworld

code
 right   right    down    left
  down    TRAP    down    left
 right    down    down    TRAP
 right   right   right    GOAL

After 5,000 episodes, the agent learns to avoid traps and reach the goal via the shortest safe path. Cells near the goal carry high Q-values; cells near traps carry negative values.

SARSA: The On-Policy Alternative

SARSA (State-Action-Reward-State-Action) is the on-policy cousin of Q-learning. Instead of using maxaQ(s,a)\max_{a'} Q(s', a'), it uses the Q-value of the action the agent actually takes next:

Q(s,a)Q(s,a)+α[r+γQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]

Where aa' is the action chosen by the current policy in state ss' (not the greedy max).

Because SARSA evaluates the policy it's actually following (including exploratory actions), it learns a safer policy. In our gridworld, Q-learning finds the path that skirts the edge of a trap, while SARSA gives traps a wider berth because ϵ\epsilon-greedy exploration sometimes stumbles into them.

Key Insight: Use Q-learning when you want the theoretically optimal policy and can tolerate risky training. Use SARSA when safety during training matters (robotics, finance), since it accounts for its own imperfect exploration behavior.

When to Use RL vs. Supervised vs. Unsupervised Learning

Comparison of RL, supervised, and unsupervised learning approachesClick to expandComparison of RL, supervised, and unsupervised learning approaches

Not every problem needs reinforcement learning. RL introduces significant complexity, and choosing the right learning approach saves months of wasted effort.

Use RL when:

  1. You need sequential decision-making (actions affect future states)
  2. No labeled dataset exists, but you can define a reward signal
  3. The optimal strategy requires long-term planning, not just single-step prediction
  4. You can afford extensive trial-and-error (simulated or real)

Do NOT use RL when:

  1. You have a labeled dataset. Supervised learning will be faster and more stable.
  2. You want to find patterns or clusters in data (use unsupervised learning)
  3. The environment is too expensive to simulate (each trial costs real money or physical risk)
  4. The reward signal is hard to define precisely (a misspecified reward leads to unexpected loophole exploitation).

Why Reinforcement Learning Is Hard

RL sounds elegant in theory. In practice, it's the most unstable branch of machine learning.

Sample efficiency: Q-learning on our tiny 4x4 grid needs 5,000 episodes. Atari games need tens of millions of frames. Real-world robotics can require billions of simulated interactions.

Reward engineering: Specifying what you want is surprisingly hard. OpenAI famously trained a boat-racing agent that earned more reward spinning in circles to collect bonus items than actually finishing the race. This reward hacking problem remains an active area of research.

Instability: Value estimates can diverge, policies can collapse, and small hyperparameter changes (α\alpha, ϵ\epsilon decay, γ\gamma) can completely change the outcome.

Partial observability: Real environments rarely give full state information. When the Markov property is violated, convergence guarantees weaken.

RL in Production: Where It Matters in 2026

Reinforcement learning has quietly become one of the most impactful branches of ML in production.

RLHF and LLM alignment. Every major language model in 2026 uses RL for post-training alignment: pre-train on text, fine-tune with supervised examples, then apply RL to maximize a learned reward model capturing human preferences. OpenAI's InstructGPT paper (Ouyang et al., 2022) established this approach, and it remains the backbone of ChatGPT, Claude, and Gemini.

GRPO and the DeepSeek breakthrough. DeepSeek-R1 (January 2025) introduced Group Relative Policy Optimization, eliminating the critic (value model) from PPO by computing advantages as relative rankings within a group of sampled responses. GRPO reduces memory by 40-60% versus PPO and trains reasoning capabilities directly from verifiable rewards.

DPO as an RL-free alternative. Direct Preference Optimization (Rafailov et al., 2023) reformulates the RLHF objective as a supervised loss on preference pairs. Simpler to implement, but it falls short of RL-based methods on complex reasoning tasks.

Robotics and recommendations. Google DeepMind's RT-2 uses RL to transfer manipulation skills from simulation to physical robots. Meta, Spotify, YouTube, and TikTok all use RL-driven recommendations optimizing long-term engagement.

Pro Tip: If you're building AI agents in 2026, you're using RL whether you realize it or not. ReAct and planning loops are RL formulations: the agent takes actions (tool calls), observes results (environment feedback), and adjusts its strategy (policy).

Conclusion

Reinforcement learning is the third pillar of machine learning, and arguably the most ambitious. Where supervised learning asks "what's the right answer?" and unsupervised learning asks "what structure exists?", RL asks "what should I do next?" The MDP framework, Bellman equations, and Q-learning give you the mathematical tools to answer that question rigorously.

RLHF remains the dominant approach for aligning language models with human values, and newer methods like GRPO have slashed the computational cost. If you're interested in how these language models work under the hood, start with How Large Language Models Actually Work, then read about Reasoning Models to see RL's role in teaching models to think step by step.

The field's hardest problems remain unsolved: sample efficiency, reward specification, and stable training in high-dimensional spaces. But every major AI breakthrough of the past decade has had reinforcement learning at its core. Start with the gridworld example in this article, implement Q-learning yourself, and build from there. For the mathematical foundations underlying these optimization methods, explore Deep Learning Optimizers: SGD to AdamW and Backpropagation: The Engine of Deep Learning.

Interview Questions

Q: What makes reinforcement learning different from supervised and unsupervised learning?

Supervised learning requires labeled input-output pairs. Unsupervised learning discovers hidden structure in unlabeled data. RL has neither labels nor a fixed dataset; an agent interacts with an environment, receives reward signals, and learns a policy that maximizes cumulative reward through trial and error.

Q: Explain the Bellman equation and why it's important.

The Bellman equation decomposes the value of a state into the immediate reward plus the discounted value of the next state: V(s)=R+γV(s)V(s) = R + \gamma V(s'). This recursive structure lets you compute the value of any state from its neighbors' values, making iterative solutions tractable instead of requiring exhaustive simulation of all possible future trajectories.

Q: What is the difference between Q-learning and SARSA?

Q-learning is off-policy, updating with maxaQ(s,a)\max_{a'} Q(s', a') (the best possible next action). SARSA is on-policy, updating with Q(s,a)Q(s', a') (the action actually taken). Q-learning converges to the optimal policy regardless of exploration strategy; SARSA converges to a policy that accounts for the agent's actual behavior, making it safer when exploratory mistakes are costly.

Q: Why is the discount factor important in RL?

The discount factor γ\gamma ensures the sum of future rewards converges to a finite value and encodes how much the agent values future versus immediate rewards. A γ\gamma close to 1 produces far-sighted agents; a γ\gamma close to 0 produces myopic ones. Chess requires γ0.99\gamma \approx 0.99; a simple control task may work with γ=0.9\gamma = 0.9.

Q: What is the exploration-exploitation tradeoff, and how do you handle it?

Exploitation means choosing the action the agent believes is best; exploration means trying other actions to discover potentially better strategies. The most common solution is ϵ\epsilon-greedy (random action with probability ϵ\epsilon, greedy otherwise). More principled approaches include UCB, which favors actions with high uncertainty, and Thompson Sampling, which samples from the posterior distribution of expected rewards.

Q: How does RLHF work for training large language models?

RLHF has three stages: (1) pre-train a language model on text data, (2) train a reward model on human preference rankings of model outputs, (3) fine-tune the language model using RL (PPO or GRPO) to maximize the reward model's score. The reward model acts as a proxy for human judgment, enabling millions of RL updates without human feedback on every response.

Q: What is GRPO and why did it matter for DeepSeek-R1?

GRPO eliminates the separate value network (critic) that PPO requires. It samples a group of responses per prompt and computes advantages as relative rankings within that group, reducing memory by 40-60%. This enabled DeepSeek to train reasoning capabilities using verifiable rewards (checking math answers) without a learned reward model, making billion-parameter reasoning training economically feasible.

Q: Your Q-learning agent is not converging. What would you check?

Check the learning rate α\alpha (too large causes oscillation, too small means slow convergence). Verify the exploration schedule (if ϵ\epsilon decays too fast, the agent exploits a bad policy prematurely). Confirm the reward function and environment dynamics are correct. For large state spaces, consider whether tabular Q-learning is feasible or if you need function approximation (DQN). Finally, verify γ\gamma is appropriate for the task horizon.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths