Hi! I’m @naklecha & I love learning through examples and jumping right into things. It works well for me, it’s fun and imo it’s the best way to learn anything. So, that’s what I’m going to do. :) Fuck it! Let’s start by trying to solve chess.
note: please read this blog in dark theme (the colors are nicer that way)
To start with let’s call the initial “state” of the chessboard - s
In the above board state, white is winning because white has an extra rook! But let's look at it from a computer's pov. This is what a computer would know without calculations or lookahead:
It doesn’t matter if the player is 1 move away from the win, 30 moves away or losing in 3 moves. Without additional calculations, the computer has no idea if the position is good and by how much. Let’s try and calculate how good the position is.
Now, for any given chess board let’s say we have a magical function that can give us a True or False boolean output if the position is winning or losing. Let’s call this magical function (in reinforcement learning it’s called the value function) - V(s) , this function takes in a state of the board and returns True or False if the position is winning or not.
Okay, but realistically can this function ever exist for the game of chess? The branching factor in chess (the number of possible moves at each turn) is around 30-35 on average. After just 5 moves, you're already looking at millions of positions. Now think of a 20-30 move game. We are looking at roughly 2^50 possible moves. That’s fucked up.
You are going to realize that picking chess as an example was a bad idea because its branching factor is too high. But, picking bad examples helps me learn better. Later, we can move on to more complex (actually useful) algorithms that solve some of those problems.
For now let’s ignore the part where I said this is impossible, and look at a simple chessboard state (a checkmate in 3 for example). In this example, it’s white to move let’s determine the value of the position, V(s) for white.
From the starting state (board position), white can make a bunch of possible moves (shown below) — 11 to exact.
It should be pretty intuitive that the “value” of the current position is somehow dependent on the value of the next position. The most naive version of this thought process would be something like:
V(s) = max( V(s’) ) for all actions a (s → after an action a → s’)
The above equation essentially says — “the value of the chess position is the value of the chess position after making the best move (because if we are playing optimally we would always play the best chess move, aka take the best action)” which obviously makes sense.
Math people love to write equations, but to me, it’s a lot easier to look at the above equation as just a tree search (depth-first search). There is a problem here though — even if we had infinite compute and time and traversing a chess game’s tree the equation — V(s) = max( V(s’) ) doesn’t care if the position is checkmate in 3 moves or 20 moves. Sometimes, you would prefer a position that is checkmate in 3 moves and should have a higher “value” than a position in 20.
This is where I’m going to introduce you to something (in the rl world) called the discount factor (γ) which will enable us to penalize “states” that are further away from the “optimal/winning state”. Now, the equation looks something like:
V(s) = max( γ * V(s’) ) for all actions a (s → after an action a → s’) and γ is the discount factor with a value range of [0,1].
Let’s say the “value” of a checkmate position is 100 when you win and -100 when you lose. The values at positions when considering the discount factor (let’s say the discount factor is 0.9).
V(checkmate position) = 100
V(checkmate in 1 position) = 0.9 * 100 = 90
V(checkmate in 2 position) = 0.9 * 90 = 81
V(checkmate in 3 position) = 0.9 * 81 = 72.9
and so on… the “discount factor” ensures that when a position is “more” winning or closer to a win, the state has a higher value.
Okay, I’ll be honest in my chess example the “discount factor” isn’t that useful. Like who cares if a position is checkmate in 20 or 2 moves? It’s a guaranteed win either way. I get it! Hear me out though… Not every situation/game is deterministic.
Think of the situation in which you are trying to solve a more complex game like Dota2 and you have to destroy the enemy’s base. If you haven’t played the game yet, here is a quick explanation — each team of 5 players has a base that they have to protect. The way you win a game of Dota is by destroying the enemy’s base, if the enemy destroys your base you lose.
If you play Dota2 or have played the game before, don’t get mad at me I’m offensively oversimplifying here.
Dota is a lot more complex than chess. Not only is its action space MASSIVE (almost infinite in practice) but anything can happen at any time because you don’t know what items (items are something you can buy during a Dota game duration and they help you win games) the enemy possesses and sometimes you don’t even know the position of your enemies.
Okay, so coming back to why the discount factor is needed — in a nondeterministic game like Dota2 anything is possible. When anything is possible and the game is undeterministic, you would want to end the game ASAP!!!! The discount factor with this equation — V(s) = max( γ * V(s’) ) is great because it will help prioritize the shortest sequence of actions that will likely lead you to a win.
So far in both the chess and Dota2 examples, we have some major problems:
Both games have so many possible states that our current algorithm V(s) = max( γ * V(s’) ) is impractical because the recursive depth of V(s) is essentially unthinkably massive for both games.
They are both 2 player games, it’s not clear what the value function calculation looks like.
In the Dota2 example, I talked about undeterministic states and actions. The current equation doesn’t account for randomness.
Dota2 has both continuous actions and discrete — you can click anywhere on the map and your hero will move there. This creates a constant 2D action space for movement, similar to how a robot might move in a 2D plane. The x and y coordinates where you can click are continuous values within the bounds of the map.
However, many other actions are discrete: selecting which ability to use (typically 4 unique abilities per hero), choosing which item to use (6 item slots), targeting abilities (either on a unit or a location), attack commands, and shop purchases.
^^ This is tricky as fuck, our algorithm falls apart because we expect actions to be discrete and the resulting states to be discrete as well. This combination makes Dota2's action space what we call a "hybrid" or "mixed" action space.
A non-hybrid but continuous action space example would be self-driving cars. When you're driving, you don't press the gas pedal in discrete steps - it's a continuous range from 0% to 100%. Similarly, the steering wheel can be turned to any angle.
Before I address the problems that I mentioned above I want to talk about “rewards” in reinforcement learning. Think about learning to play chess. If we only had V(s) = max(γ * V(s')) with no rewards, and we only knew that checkmate states were good or bad, we'd face a serious problem: most chess positions would appear equally valuable until we could see all the way to checkmate. This is because the value would only "flow backwards" from the final states.
This problem can be easily addressed by having intermediate rewards. Consider the following equation:
V(s) = max( R(s, a) + γ * V(s’) ) where R(s,a) is a reward function that outputs a reward when provided an action for the current state s.
Btw, this equation is known as the “Bellman Equation” and is one of the most foundational building block of reinforcement learning.
Modelling your world, states, actions and finding meaningful rewards for your problem in reinforcement learning is an art form. Chess is an extremely studied game and rewards for different states are generally standard and explored. Here is a pretty standard reward table for the game of chess:
chess_reward_table = {
# Material values (traditional piece values)
'piece_values': {
'pawn': 1.0,
'knight': 3.0,
'bishop': 3.0,
'rook': 5.0,
'queen': 9.0
},
# Positional rewards
'position_rewards': {
'control_center': 0.2, # Bonus for controlling center squares
'connected_rooks': 0.3, # Bonus for connected rooks
'bishop_pair': 0.3, # Small bonus for having both bishops
'doubled_pawns': -0.2, # Penalty for doubled pawns
'isolated_pawns': -0.2, # Penalty for isolated pawns
},
# Development rewards (early game)
'development': {
'piece_developed': 0.1, # Piece moved from starting square
'king_castled': 0.5, # Successfully castled king
'controlling_open_file': 0.2 # Rook on open file
},
# King safety
'king_safety': {
'pawn_shield': 0.3, # Pawns protecting king
'king_exposed': -0.4, # Penalty for exposed king
},
# Game ending states
'game_end': {
'checkmate_win': 100.0, # Winning the game
'checkmate_loss': -100.0, # Losing the game
'stalemate': 0.0, # Draw
}
}
I want you to really understand that these rewards are extremely important to future algorithms and optimizations. They really help guide and speed up search exponentially. In the above example the reward table can understood like:
R(s, a) = +100 for checkmates, +9 for capturing the enemy's queen, and +9.3 (9+0.3) for capturing a queen and connecting 2 rooks together.
We have an updated equation with rewards — V(s) = max( R(s, a) + γ * V(s’) ) does this make our search more efficient? No! Not yet. Why not? Because to calculate the state’s value we still have to traverse the practically infinitely massive tree of states (like chess board states).
This problem of too many (sometimes infinite) states and actions is the heart of all reinforcement learning.
Reinforcement learning problems can’t be “solved” but can be “estimated”. From now on, we will learn about the most efficient and accurate ways researchers have found to estimate problems (never really solve).
Okay, so the tree can’t be fully traversed but what if we partially traversed a state tree by pruning or **picking only “sensible” actions. In reinforcement learning this is called “exploration vs exploitation”.
Exploration: When learning in an uncertain environment, an agent needs to try new, potentially suboptimal actions to discover better strategies it might have missed.
Exploitation: Once an agent has discovered reliable strategies, it can exploit this knowledge by repeatedly choosing actions that lead to rewards.
But why exploration? The reason why we need exploration (and try potentially suboptimal actions) is because sometimes taking suboptimal short-term actions leads to better long-term outcomes. Examples:
The exploration-exploitation dilemma is fundamental to learning - like a pathfinding algorithm for road networks balancing between routing through established highways (exploitation) and investigating side streets and alternate routes (exploration). Too much exploration wastes computational effort testing every possible road combination, while pure exploitation might miss that optimal route through local roads that bypass congested highways entirely.
Something I want to talk about before we move on to more complex algorithms — Q(s,a) function also known as the action-value function!
Q(s,a) represents state-action value - value at starting from state s, taking action a:
Q(s,a) = R(s,a) + γ * V(s')
reminder that V(s) = max( R(s, a) + γ * V(s’) )
which means: V(s) = max( Q(s,a) ) for all actions a that can be taken at
(remember these 3 equations, they are pretty fundamental to rl)
Why do we need Q(s, a)? Why can’t we just use V(s) everywhere? — The reason is obvious but it’s never addressed in papers, books and research. Reinforcement learning primarily comprisies of algorithms and code-style logic expressed as mathematical equations. In translating algorithmic logic to mathematical notation, we need intermediate variables (like Q and V) to represent different computational steps. Just as code uses multiple variables to break down complex logic, RL math uses Q(s, a) and V(s) to express the process of evaluating actions and selecting maximums. These functions are derivatives of each other similar to how variables in code build on each other to express program logic.
This is where I have a fundamental problem with the way reinforcement learning is taught and studied & is also partially why I wanted to write this blog in my learning style. Anyway, my rant is over let’s move on.
Now that you know about exploration and exploitation, let’s talk about an algorithm called Real-time dynamic programming (RTDP):
In RTDP, exploration is typically done using an epsilon-greedy algorithm where:
The “exploration rate ε” usually starts high (like 0.9) and decays over time, allowing the algorithm to:
Initially explore broadly (because of the high starting exploration rate) to avoid local optima
Eventually, exploit known good paths. As the exploration rate decays towards 0, we essentially take action greedily. We do this because over time our confidence in value estimates increases.
notice how initially the tree is wide but then as the exploration rate decays the tree exploits only the good nodes
Let’s quickly read some pseudocode & find issues with this algorithm. Look at the code below:
def rtdp(state_space, actions, initial_state):
V = {s: 0 for s in state_space}
epsilon = epsilon_start
gamma=0.99 # discount factor from bellman equation
epsilon=0.9 # exploration rate, *ε* as described above
max_episodes=1000 # number of iterations
epsilon_decay = 0.995 # epsilon decay amount after each episode
for episode in range(max_episodes):
state = initial_state
while not terminal(state):
# next action decided using epsilon-greedy method
best_action = argmax(actions, lambda a: V[get_next_state(state, a)])
next_action = random_action() if random() < epsilon else best_action
next_state = transition(state, next_action)
# Bellman update
max_future_value = max(V[get_next_state(state, a)] for a in actions)
V[state] = reward(state,next_action) + gamma * max_future_value
state = next_state
epsilon *= epsilon_decay
return V
The code explained:
We run the for loop max_episodes times. I realize now that I haven’t spoken about the term “episodes” yet in the context of reinforcement learning. In RL an iteration (in our case one pass of our tree) is called an episode. So next time you see the episode, think of one iteration of the algorithm.
For each episode, we start from the initial state and explore our tree until we hit a terminal state. Initially, our tree only comprises of the root node (the initial state) but after each iteration,n the tree grows as we compute values of new states.
We now compute the next_action
in an epsilon-greedy manner. If the random()
returns a value < 0.9 (our epsilon value) then we select a random action and if it’s above 0.9 we select the best action. This means that there is a 90% chance we explore a new path and a 10% chance we exploit the nodes in our tree already. After we have the next action we then compute the next_state
by performing that action.
We then update the value of the state using the bellman equation we learnt about (but, with a small modification, i.e., the next_action
is already chosen.)
After this the state
is updated to the next_state
. we move on to the next state and then repeat steps a-d.
Once we reach the terminal state of our explored tree, we can update the exploration rate using the decay. Notice that the epsilon after 1000 episodes would be 0.9 * 0.995^1000 = 0.006. This means that towards the final episodes we are barely exploring (0.06% of the time) and are exploiting the states with the highest ROI.
A chess game equivalent of the above statement is something like — after enough episodes the algorithm has largely crystallized its strategy, almost always selecting moves it's learned lead to advantageous positions rather than experimenting with novel approaches.
The problems faced by RTDP:
Solving RTDP problems using Monte Carlo Tree Search (MCTS):
The core idea behind MCTS is that instead of relying on a decaying exploration rate like RTDP does, we use the actual value scores of the nodes to make better decisions.
MCTS has 4 stages:
Selection Stage:
During the first stage of the MCTS algorithm, we select a path based on a formula called the Upper Confidence Bound (UCB) scores. For each child node, we compute the UCB scores as follows:
UCB = Q(s,a) + c * √(ln(N) / n)
During tree traversal, we compute the scores for each child node. Sounds fancy, but it’s intuitive and simple. The first term of the formula, Q(s,a), says "go where we've found success before" (exploitation). The second term, c * √(ln(N) / n), says "explore paths we haven't tried much" (exploration). What makes this formula clever is how it automatically balances two competing needs: exploiting moves we know are good (the Q(s,a) term) and exploring moves we haven't tried much (the square root term). The more we visit a node, the smaller the exploration bonus becomes (because the denominator n increases), naturally shifting our focus to the most promising moves.
When a child node hasn’t been visited yet, the ln(N)/n
term is infinity because n for that node is 0. In simpler words, when a node has a child node that is also a leaf node (unvisited) we select that node to exploit in future stages.
The exploration constant (c) is generally set to sqrt(2) why?
The choice of c = √2 in the UCB formula is rooted in a probability theory concept called the Hoeffding inequality (I think so). The Hoeffding inequality helps us understand how far a random variable might deviate from its expected value. When we're exploring in MCTS, we're essentially trying to estimate the true value of each move based on limited samples. We want to be confident that our estimate is reasonably close to the true value. The value √2 comes from setting our confidence level for these estimates. I’m going to skip a bunch of math here but, √2 gives us what's called a "95% confidence interval" - meaning we can be 95% confident that our estimate is close to the true value. Think of it like this: Using √2 means we're making an assumption that exploration decisions that will be good choices 95% of the time.
Let’s look at the selection stage’s pseudocode:
class Node:
def __init__(self):
self.visits = 0
self.value = 0
self.children = {}
def ucb_score(parent, child, c=1.414):
if child.visits == 0:
return float('inf')
return (child.value/child.visits) + c * sqrt(log(parent.visits)/child.visits)
def select(node):
while node.children:
child_scores = softmax([ucb_score(node, child) for child in node.children])
node = select_random_child_based_on_softmax_ucb_scores(child_scores)
return node
Imo, the code is so simple & I’m not going to go through with it. But here are some important things to remember from this stage of MCTS —
float('inf')
is selected when a child hasn’t been visited.Expansion Stage:
After we've selected a leaf node during the selection stage, we need to grow our tree. During expansion, we create a new child node based on available actions. Here's what makes this stage interesting - we don't usually expand all possible children at once. Instead, we add just one child node per expansion. This helps us manage our memory efficiently while still exploring the space of possibilities.
Keeping it simple like the selection stage:
def expand(node, possible_actions):
# find unexplored actions
unexplored_actions = [
action for action in possible_actions
if action not in node.children
]
# if we have unexplored actions, randomly pick one
if unexplored_actions:
chosen_action = random.choice(unexplored_actions)
node.children[chosen_action] = Node()
return node.children[chosen_action]
return node
I don’t think this code needs any explaination, we select an action that hasn’t been performed and returns a new node (the new state) after performing that action.
Playout Stage:
This is where MCTS gets really clever. After expansion, we perform what's called a "playout" or "simulation" from our newly created node. Imagine you're playing chess - instead of carefully analyzing every possible move, you just play random moves until the game ends. That's essentially what we're doing here!
During the playout, we follow a simple simulation of moves (often random) until we reach a terminal state or hit a depth limit. This gives us a rough estimate of how good our position is. While it might seem too simple to just play randomly, this approach has a fascinating property: over many iterations, these random playouts actually give us meaningful information about which moves are generally better.
Did AlphaGo simulate a game using random moves? Does that make sense? Okay, I lied. The simulation is not often random! AlphaGo didn't use random moves in the simulation stage. Instead of random playouts, it used its value network to directly evaluate board positions. This value network was trained on millions of positions and could estimate the winning probability without needing to play out random moves. This was more efficient and accurate than traditional MCTS's random simulations since Go's massive branching factor made random playouts less reliable in evaluating positions. The image below is kinda what they did:
def playout(node, max_depth=50):
current_node = node
depth = 0
total_reward = 0
while not is_terminal(current_node) and depth < max_depth:
action = random.choice(get_possible_actions(current_node))
reward = simulate_action(action)
total_reward += reward
depth += 1
return total_reward
Backpropagation Stage:
Advantage 1 of MCTS over RTDP with an example of chess:
In RTDP, imagine each position in the chess tree has a score written on it that stays there between moves. When a position initially looks bad (like sacrificing your queen), it gets a low score. The only way to change this score is if RTDP randomly decides to explore this "bad" position due to its epsilon-greedy exploration. Since RTDP usually picks the highest-scoring moves (that's the "greedy" part), it might rarely visit this position again. The low score might stick around for a long time, preventing RTDP from discovering that this position actually leads to a win. In MCTS, there are no permanent scores on the tree positions. Every time MCTS runs, it builds its understanding of the position from scratch through simulations. If sacrificing the queen leads to checkmate, MCTS can discover this anytime it explores that branch, unaffected by any previous evaluations. It doesn't need to get "lucky" with random exploration to overcome a bad initial score because it runs a full simulation every time!
Advantage 2 of MCTS over RTDP again in an example of chess: Consider a chess position where most moves look bad but one subtle move wins. RTDP might waste a lot of time randomly exploring obviously bad moves, while MCTS would quickly focus on the promising line once it discovers it, while still occasionally checking other options to make sure it hasn't missed anything.
Advantage 3 of MCTS over RTDP while solving a maze:
RTDP would have to actually try walking down each path to learn about it, while MCTS can "imagine" walking down paths through simulation to get a rough idea of which ones might lead to the exit. This ability to look ahead through simulation helps MCTS make better decisions with limited computation time.
Problems with MCTS using examples that I like:
I want to pause and reflect on everything we understand so far. Idk if you noticed but all of the algorithms, formulas and examples we spoke about so far only talk about learning by experience (and computing a value table) or search and simulations (like in MCTS). But, imagine you're Netflix building a movie recommendation system. You have millions of users watching different movies, but each user only interacts with a tiny fraction of your catalogue. The challenge: how do you learn what movies they might like without forcing them to watch everything?
Okay, let’s try and build a Netflix recommendation algorithm (can be extended to Youtube, Tiktok, Instagram etc). My assumptions here — we have a massive dataset of people’s preferences and profile.