A robot arm knocks a cup off a table 47 times before finally learning to place it gently. A game-playing agent loses 10,000 matches of Go, then beats the world champion. Reinforcement learning is the science behind both stories: an agent learns not from labeled examples, but from the consequences of its own actions. As of March 2026, RL has become the secret ingredient behind how large language models actually work, powering the alignment techniques (RLHF, GRPO, DPO) that turn raw language models into helpful assistants.
Unlike supervised learning (correct answers provided) or unsupervised learning (find structure in data), reinforcement learning drops an agent into an environment and says: "figure it out." The agent tries actions, receives rewards or penalties, and gradually discovers a strategy (a policy) that maximizes long-term payoff. This trial-and-error approach is how DeepMind's AlphaGo defeated Lee Sedol, how recommendation engines optimize engagement, and how OpenAI and DeepSeek train reasoning models to think step by step.
We will build intuition through one consistent example: a 4x4 gridworld where an agent moves from a start cell to a goal, avoiding traps. Every formula, every algorithm, every comparison will reference this grid.
The RL Framework: Agent, Environment, State, Action, Reward
Reinforcement learning is a feedback loop between two entities: an agent (the decision-maker) and an environment (everything the agent interacts with). At each timestep , the agent observes a state , selects an action , receives a reward , and transitions to a new state .
Click to expandThe agent-environment interaction loop in reinforcement learning
In our gridworld, the state is the robot's cell position (row, column). Actions are {up, down, left, right}. The reward is -1 per step, -10 for a trap, and +10 for the goal. If the robot moves right from cell (1,2), the environment places it at (1,3) and returns that cell's reward.
This loop repeats until the agent reaches a terminal state (goal or trap). One complete sequence is called an episode. The agent's objective across episodes is to maximize total accumulated reward.
Key Insight: The reward signal is the only feedback the agent gets. It never sees "the correct action." This is what makes RL fundamentally different from supervised learning, and fundamentally harder.
Markov Decision Processes: The Mathematical Foundation
A Markov Decision Process (MDP) formalizes the RL problem as a tuple . The "Markov" part means the future depends only on the current state, not on the history of how the agent got there.
| Component | Symbol | Gridworld Example |
|---|---|---|
| State space | All 16 cells on the 4x4 grid | |
| Action space | {up, down, left, right} | |
| Transition function | Probability of reaching cell from via action | |
| Reward function | -1 per step, -10 trap, +10 goal | |
| Discount factor | 0.9 (values future rewards at 90% per step) |
The discount factor controls how much the agent cares about future versus immediate rewards. A of 0.9 means a reward of +10 received two steps from now is worth $10 \times 0.9^2 = 8.1$ today.
In Plain English: An MDP is the complete rulebook for the gridworld: all possible cells, all possible moves, what happens when you move, and the score you get. The Markov property means the robot only needs to know where it is right now, not its entire path history.
Policies: The Agent's Strategy
A policy maps states to actions. It is the agent's strategy, the complete rule that dictates behavior.
Deterministic policy: assigns one specific action to each state. In our gridworld, a deterministic policy might say "in cell (0,0), always go right."
Stochastic policy: gives a probability distribution over actions for each state. The agent in cell (0,0) might go right with 70% probability and down with 30%. Stochastic policies are essential for exploration and for mixed strategies in game theory.
The goal of RL is to find the optimal policy that maximizes expected cumulative reward from every state.
Value Functions: Measuring How Good a State Is
Value functions estimate the expected total reward the agent will collect, starting from a given state (or state-action pair) and following a particular policy.
State-Value Function V(s)
Where:
- is the expected return starting from state and following policy
- is the discount factor (0.9 in our gridworld)
- is the reward received steps into the future
- denotes the expectation under policy
In Plain English: answers the question: "if the robot is at cell and follows policy from now on, how much total (discounted) reward will it earn on average?" Cells near the goal have high value; cells near traps have low value.
Action-Value Function Q(s, a)
Where:
- is the expected return after taking action in state , then following
The Q-function is more informative than V because knowing for all state-action pairs lets you extract the optimal policy directly: .
The Bellman Equations: Recursive Value Decomposition
The Bellman equation is the backbone of nearly every RL algorithm. It expresses the value of a state as the immediate reward plus the discounted value of the next state.
Bellman Expectation Equation
Where:
- is the probability of taking action in state under policy
- is the transition probability from state to given action
- is the immediate reward for taking action in state
- is the discounted future value from the next state
In Plain English: The value of a cell equals the reward for leaving it, plus 0.9 times the value of wherever the robot ends up. If cell (2,3) is one step from the goal, its value is roughly . This recursion creates a system of equations you can solve iteratively.
Bellman Optimality Equation
The optimal value function picks the best action rather than averaging over a policy. Dynamic Programming algorithms (Value Iteration, Policy Iteration) solve this directly when is known.
Exploration vs. Exploitation: The Central Tension
Exploration vs. exploitation is the defining dilemma in RL. The agent must exploit actions it already knows are good to collect reward, but also explore new actions that might be even better.
Click to expandExploration strategies comparison for reinforcement learning agents
| Strategy | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| -greedy | Random action with probability , greedy otherwise | Simple, widely used | Explores uniformly, wastes tries on clearly bad actions |
| UCB (Upper Confidence Bound) | Favors actions with high uncertainty | Principled, theoretically optimal | Harder to implement in large state spaces |
| Thompson Sampling | Samples from posterior distribution of rewards | Naturally balances explore/exploit | Requires Bayesian framework |
| Boltzmann (Softmax) | Actions weighted by exponentiated Q-values | Smooth exploration scaling | Sensitive to temperature parameter |
In our gridworld, pure exploitation means always following the highest Q-value. But early on, those estimates are wrong. With , the robot takes a random action 10% of the time, occasionally discovering shorter paths. Over time, decays toward zero.
Common Pitfall: Setting too low too early locks the agent into a suboptimal policy. A common schedule is where is the total number of episodes.
Model-Based vs. Model-Free RL
RL algorithms split into two broad families depending on whether the agent tries to learn a model of the environment.
Click to expandRL taxonomy showing model-based vs model-free and value-based vs policy-based approaches
Model-based methods learn or are given the transition function and reward function , then plan ahead by simulating future trajectories. Dynamic Programming and Monte Carlo Tree Search (used in AlphaGo) are model-based. The upside is sample efficiency; the downside is that learning an accurate model can be as hard as solving the original problem.
Model-free methods learn values or policies directly from experience. Q-learning, SARSA, and policy gradient methods all fall here. They need more data but make fewer assumptions. Most practical RL today (including RLHF for LLMs) is model-free.
Pro Tip: If your environment is cheap to simulate (board games, simple physics), go model-based for faster convergence. If it's complex or only accessible through real interaction (robotics, live recommendation systems), model-free is your only option.
Monte Carlo vs. Temporal Difference Learning
Both are model-free approaches to learning value functions, but they differ in when they update estimates.
Monte Carlo (MC) methods wait until an episode ends, then update values based on the actual total return:
Where is the total discounted return from timestep to the end of the episode, and is the learning rate.
Temporal Difference (TD) methods update after every step, using an estimate of the return:
The first part of the bracket is the TD target: the immediate reward plus the discounted value of the next state. The full bracket is the TD error: how far off the current estimate was. TD bootstraps by using its own estimate of the next state's value to improve the current state's value.
| Property | Monte Carlo | Temporal Difference |
|---|---|---|
| Update timing | End of episode | Every step |
| Bias | Unbiased (uses actual returns) | Biased (bootstraps from estimates) |
| Variance | High (returns vary a lot) | Lower (single-step updates) |
| Requires episodes? | Yes (must reach terminal state) | No (works in continuing tasks) |
| Convergence | Slower but guaranteed | Faster in practice |
In Plain English: Monte Carlo is like grading a student only after a final exam. TD learning is like giving pop quizzes every class. The pop quizzes give noisier individual scores, but the student (agent) improves faster because feedback is immediate.
Q-Learning: The Model-Free Workhorse
Q-learning (Watkins, 1989) is the most influential model-free RL algorithm. It learns the optimal action-value function directly, without needing to know the environment's transition probabilities.
The Q-Learning Update Rule
Where:
- is the current estimate of the value of taking action in state
- is the learning rate (typically 0.1 to 0.5)
- is the immediate reward received
- is the discount factor
- is the best Q-value achievable from the next state
- is the TD target
In Plain English: After the robot steps from cell to cell and collects reward , it asks: "was this move better or worse than I expected?" The difference between what happened ( plus the best future value) and what was predicted () is the error. The robot nudges its estimate by of that error, and over thousands of episodes, these nudges converge to the true optimal Q-values.
Q-learning is off-policy: it updates using the greedy action, regardless of what the agent actually did. This separation of behavior policy from target policy is what makes Q-learning so powerful.
Q-Learning on Our Gridworld
right right down left
down TRAP down left
right down down TRAP
right right right GOAL
After 5,000 episodes, the agent learns to avoid traps and reach the goal via the shortest safe path. Cells near the goal carry high Q-values; cells near traps carry negative values.
SARSA: The On-Policy Alternative
SARSA (State-Action-Reward-State-Action) is the on-policy cousin of Q-learning. Instead of using , it uses the Q-value of the action the agent actually takes next:
Where is the action chosen by the current policy in state (not the greedy max).
Because SARSA evaluates the policy it's actually following (including exploratory actions), it learns a safer policy. In our gridworld, Q-learning finds the path that skirts the edge of a trap, while SARSA gives traps a wider berth because -greedy exploration sometimes stumbles into them.
Key Insight: Use Q-learning when you want the theoretically optimal policy and can tolerate risky training. Use SARSA when safety during training matters (robotics, finance), since it accounts for its own imperfect exploration behavior.
When to Use RL vs. Supervised vs. Unsupervised Learning
Click to expandComparison of RL, supervised, and unsupervised learning approaches
Not every problem needs reinforcement learning. RL introduces significant complexity, and choosing the right learning approach saves months of wasted effort.
Use RL when:
- You need sequential decision-making (actions affect future states)
- No labeled dataset exists, but you can define a reward signal
- The optimal strategy requires long-term planning, not just single-step prediction
- You can afford extensive trial-and-error (simulated or real)
Do NOT use RL when:
- You have a labeled dataset. Supervised learning will be faster and more stable.
- You want to find patterns or clusters in data (use unsupervised learning)
- The environment is too expensive to simulate (each trial costs real money or physical risk)
- The reward signal is hard to define precisely (a misspecified reward leads to unexpected loophole exploitation).
Why Reinforcement Learning Is Hard
RL sounds elegant in theory. In practice, it's the most unstable branch of machine learning.
Sample efficiency: Q-learning on our tiny 4x4 grid needs 5,000 episodes. Atari games need tens of millions of frames. Real-world robotics can require billions of simulated interactions.
Reward engineering: Specifying what you want is surprisingly hard. OpenAI famously trained a boat-racing agent that earned more reward spinning in circles to collect bonus items than actually finishing the race. This reward hacking problem remains an active area of research.
Instability: Value estimates can diverge, policies can collapse, and small hyperparameter changes (, decay, ) can completely change the outcome.
Partial observability: Real environments rarely give full state information. When the Markov property is violated, convergence guarantees weaken.
RL in Production: Where It Matters in 2026
Reinforcement learning has quietly become one of the most impactful branches of ML in production.
RLHF and LLM alignment. Every major language model in 2026 uses RL for post-training alignment: pre-train on text, fine-tune with supervised examples, then apply RL to maximize a learned reward model capturing human preferences. OpenAI's InstructGPT paper (Ouyang et al., 2022) established this approach, and it remains the backbone of ChatGPT, Claude, and Gemini.
GRPO and the DeepSeek breakthrough. DeepSeek-R1 (January 2025) introduced Group Relative Policy Optimization, eliminating the critic (value model) from PPO by computing advantages as relative rankings within a group of sampled responses. GRPO reduces memory by 40-60% versus PPO and trains reasoning capabilities directly from verifiable rewards.
DPO as an RL-free alternative. Direct Preference Optimization (Rafailov et al., 2023) reformulates the RLHF objective as a supervised loss on preference pairs. Simpler to implement, but it falls short of RL-based methods on complex reasoning tasks.
Robotics and recommendations. Google DeepMind's RT-2 uses RL to transfer manipulation skills from simulation to physical robots. Meta, Spotify, YouTube, and TikTok all use RL-driven recommendations optimizing long-term engagement.
Pro Tip: If you're building AI agents in 2026, you're using RL whether you realize it or not. ReAct and planning loops are RL formulations: the agent takes actions (tool calls), observes results (environment feedback), and adjusts its strategy (policy).
Conclusion
Reinforcement learning is the third pillar of machine learning, and arguably the most ambitious. Where supervised learning asks "what's the right answer?" and unsupervised learning asks "what structure exists?", RL asks "what should I do next?" The MDP framework, Bellman equations, and Q-learning give you the mathematical tools to answer that question rigorously.
RLHF remains the dominant approach for aligning language models with human values, and newer methods like GRPO have slashed the computational cost. If you're interested in how these language models work under the hood, start with How Large Language Models Actually Work, then read about Reasoning Models to see RL's role in teaching models to think step by step.
The field's hardest problems remain unsolved: sample efficiency, reward specification, and stable training in high-dimensional spaces. But every major AI breakthrough of the past decade has had reinforcement learning at its core. Start with the gridworld example in this article, implement Q-learning yourself, and build from there. For the mathematical foundations underlying these optimization methods, explore Deep Learning Optimizers: SGD to AdamW and Backpropagation: The Engine of Deep Learning.
Interview Questions
Q: What makes reinforcement learning different from supervised and unsupervised learning?
Supervised learning requires labeled input-output pairs. Unsupervised learning discovers hidden structure in unlabeled data. RL has neither labels nor a fixed dataset; an agent interacts with an environment, receives reward signals, and learns a policy that maximizes cumulative reward through trial and error.
Q: Explain the Bellman equation and why it's important.
The Bellman equation decomposes the value of a state into the immediate reward plus the discounted value of the next state: . This recursive structure lets you compute the value of any state from its neighbors' values, making iterative solutions tractable instead of requiring exhaustive simulation of all possible future trajectories.
Q: What is the difference between Q-learning and SARSA?
Q-learning is off-policy, updating with (the best possible next action). SARSA is on-policy, updating with (the action actually taken). Q-learning converges to the optimal policy regardless of exploration strategy; SARSA converges to a policy that accounts for the agent's actual behavior, making it safer when exploratory mistakes are costly.
Q: Why is the discount factor important in RL?
The discount factor ensures the sum of future rewards converges to a finite value and encodes how much the agent values future versus immediate rewards. A close to 1 produces far-sighted agents; a close to 0 produces myopic ones. Chess requires ; a simple control task may work with .
Q: What is the exploration-exploitation tradeoff, and how do you handle it?
Exploitation means choosing the action the agent believes is best; exploration means trying other actions to discover potentially better strategies. The most common solution is -greedy (random action with probability , greedy otherwise). More principled approaches include UCB, which favors actions with high uncertainty, and Thompson Sampling, which samples from the posterior distribution of expected rewards.
Q: How does RLHF work for training large language models?
RLHF has three stages: (1) pre-train a language model on text data, (2) train a reward model on human preference rankings of model outputs, (3) fine-tune the language model using RL (PPO or GRPO) to maximize the reward model's score. The reward model acts as a proxy for human judgment, enabling millions of RL updates without human feedback on every response.
Q: What is GRPO and why did it matter for DeepSeek-R1?
GRPO eliminates the separate value network (critic) that PPO requires. It samples a group of responses per prompt and computes advantages as relative rankings within that group, reducing memory by 40-60%. This enabled DeepSeek to train reasoning capabilities using verifiable rewards (checking math answers) without a learned reward model, making billion-parameter reasoning training economically feasible.
Q: Your Q-learning agent is not converging. What would you check?
Check the learning rate (too large causes oscillation, too small means slow convergence). Verify the exploration schedule (if decays too fast, the agent exploits a bad policy prematurely). Confirm the reward function and environment dynamics are correct. For large state spaces, consider whether tabular Q-learning is feasible or if you need function approximation (DQN). Finally, verify is appropriate for the task horizon.