a reinforcement learning guide

Hi! I’m @naklecha & I love learning through examples and jumping right into things. It works well for me, it’s fun and imo it’s the best way to learn anything. So, that’s what I’m going to do. :) Fuck it! Let’s start by trying to solve chess.

note: please read this blog in dark theme (the colors are nicer that way)
To start with let’s call the initial “state” of the chessboard - s
In the above board state, white is winning because white has an extra rook! But let's look at it from a computer's pov. This is what a computer would know without calculations or lookahead:
- the position of all the pieces on the board, aka the board state
- which player is up next (white in this case)
- valid moves that can be made from the current board state, aka the action space
It doesn’t matter if the player is 1 move away from the win, 30 moves away or losing in 3 moves. Without additional calculations, the computer has no idea if the position is good and by how much. Let’s try and calculate how good the position is.
Now, for any given chess board let’s say we have a magical function that can give us a True or False boolean output if the position is winning or losing. Let’s call this magical function (in reinforcement learning it’s called the value function) - V(s) , this function takes in a state of the board and returns True or False if the position is winning or not.
Okay, but realistically can this function ever exist for the game of chess? The branching factor in chess (the number of possible moves at each turn) is around 30-35 on average. After just 5 moves, you're already looking at millions of positions. Now think of a 20-30 move game. We are looking at roughly 2^50 possible moves. That’s fucked up.

You are going to realize that picking chess as an example was a bad idea because its branching factor is too high. But, picking bad examples helps me learn better. Later, we can move on to more complex (actually useful) algorithms that solve some of those problems.
For now let’s ignore the part where I said this is impossible, and look at a simple chessboard state (a checkmate in 3 for example). In this example, it’s white to move let’s determine the value of the position, V(s) for white.
From the starting state (board position), white can make a bunch of possible moves (shown below) — 15 to exact.
It should be pretty intuitive that the “value” of the current position is somehow dependent on the value of the next position. The most naive version of this thought process would be something like:

V(s) = max( V(s’) ) for all actions a (s → after an action a → s’)

The above equation essentially says — “the value of the chess position is the value of the chess position after making the best move (because if we are playing optimally we would always play the best chess move, aka take the best action)” which obviously makes sense.
Math people love to write equations, but to me, it’s a lot easier to look at the above equation as just a tree search (depth-first search). There is a problem here though — even if we had infinite compute and time and traversing a chess game’s tree the equation — V(s) = max( V(s’) ) doesn’t care if the position is checkmate in 3 moves or 20 moves. Sometimes, you would prefer a position that is checkmate in 3 moves and should have a higher “value” than a position in 20.

This is where I’m going to introduce you to something (in the rl world) called the discount factor (γ) which will enable us to penalize “states” that are further away from the “optimal/winning state”. Now, the equation looks something like:

V(s) = max( γ * V(s’) ) for all actions a (s → after an action a → s’) and γ is the discount factor with a value range of [0,1].

Let’s say the “value” of a checkmate position is 100 when you win and -100 when you lose. The values at positions when considering the discount factor (let’s say the discount factor is 0.9).

V(checkmate position) = 100

V(checkmate in 1 position) = 0.9 * 100 = 90

V(checkmate in 2 position) = 0.9 * 90 = 81

V(checkmate in 3 position) = 0.9 * 81 = 72.9

and so on… the “discount factor” ensures that when a position is “more” winning or closer to a win, the state has a higher value.
Okay, I’ll be honest in my chess example the “discount factor” isn’t that useful. Like who cares if a position is checkmate in 20 or 2 moves? It’s a guaranteed win either way. I get it! Hear me out though… Not every situation/game is deterministic.

Think of the situation in which you are trying to solve a more complex game like Dota2 and you have to destroy the enemy’s base. If you haven’t played the game yet, here is a quick explanation — each team of 5 players has a base that they have to protect. The way you win a game of Dota is by destroying the enemy’s base, if the enemy destroys your base you lose.

If you play Dota2 or have played the game before, don’t get mad at me I’m offensively oversimplifying here.

Dota is a lot more complex than chess. Not only is its action space MASSIVE (almost infinite in practice) but anything can happen at any time because you don’t know what items (items are something you can buy during a Dota game duration and they help you win games) the enemy possesses and sometimes you don’t even know the position of your enemies.

Okay, so coming back to why the discount factor is needed — in a nondeterministic game like Dota2 anything is possible. When anything is possible and the game is undeterministic, you would want to end the game ASAP!!!! The discount factor with this equation — V(s) = max( γ * V(s’) ) is great because it will help prioritize the shortest sequence of actions that will likely lead you to a win.
So far in both the chess and Dota2 examples, we have some major problems:
1. Both games have so many possible states that our current algorithm V(s) = max( γ * V(s’) ) is impractical because the recursive depth of V(s) is essentially unthinkably massive for both games.
2. They are both 2 player games, it’s not clear what the value function calculation looks like.
3. In the Dota2 example, I talked about undeterministic states and actions. The current equation doesn’t account for randomness.
4. Dota2 has both continuous actions and discrete — you can click anywhere on the map and your hero will move there. This creates a constant 2D action space for movement, similar to how a robot might move in a 2D plane. The x and y coordinates where you can click are continuous values within the bounds of the map.
  
  However, many other actions are discrete: selecting which ability to use (typically 4 unique abilities per hero), choosing which item to use (6 item slots), targeting abilities (either on a unit or a location), attack commands, and shop purchases.
  
  ^^ This is tricky as fuck, our algorithm falls apart because we expect actions to be discrete and the resulting states to be discrete as well. This combination makes Dota2's action space what we call a "hybrid" or "mixed" action space.
  
  A non-hybrid but continuous action space example would be self-driving cars. When you're driving, you don't press the gas pedal in discrete steps - it's a continuous range from 0% to 100%. Similarly, the steering wheel can be turned to any angle.
Before I address the problems that I mentioned above I want to talk about “rewards” in reinforcement learning. Think about learning to play chess. If we only had V(s) = max(γ * V(s')) with no rewards, and we only knew that checkmate states were good or bad, we'd face a serious problem: most chess positions would appear equally valuable until we could see all the way to checkmate. This is because the value would only "flow backwards" from the final states.

This problem can be easily addressed by having intermediate rewards. Consider the following equation:

V(s) = max( R(s, a) + γ * V(s’) ) where R(s,a) is a reward function that outputs a reward when provided an action for the current state s.

Btw, this equation is known as the “Bellman Equation” and is one of the most foundational building block of reinforcement learning.

Modelling your world, states, actions and finding meaningful rewards for your problem in reinforcement learning is an art form. Chess is an extremely studied game and rewards for different states are generally standard and explored. Here is a pretty standard reward table for the game of chess:
```
chess_reward_table = {
    # Material values (traditional piece values)
    'piece_values': {
        'pawn': 1.0,
        'knight': 3.0,
        'bishop': 3.0,
        'rook': 5.0,
        'queen': 9.0
    },
    
    # Positional rewards
    'position_rewards': {
        'control_center': 0.2,     # Bonus for controlling center squares
        'connected_rooks': 0.3,    # Bonus for connected rooks
        'bishop_pair': 0.3,        # Small bonus for having both bishops
        'doubled_pawns': -0.2,     # Penalty for doubled pawns
        'isolated_pawns': -0.2,    # Penalty for isolated pawns
    },
    
    # Development rewards (early game)
    'development': {
        'piece_developed': 0.1,    # Piece moved from starting square
        'king_castled': 0.5,       # Successfully castled king
        'controlling_open_file': 0.2  # Rook on open file
    },
    
    # King safety
    'king_safety': {
        'pawn_shield': 0.3,        # Pawns protecting king
        'king_exposed': -0.4,      # Penalty for exposed king
    },
    
    # Game ending states
    'game_end': {
        'checkmate_win': 100.0,    # Winning the game
        'checkmate_loss': -100.0,  # Losing the game
        'stalemate': 0.0,          # Draw
    }
}
```
I want you to really understand that these rewards are extremely important to future algorithms and optimizations. They really help guide and speed up search exponentially. In the above example the reward table can understood like:

R(s, a) = +100 for checkmates, +9 for capturing the enemy's queen, and +9.3 (9+0.3) for capturing a queen and connecting 2 rooks together.
We have an updated equation with rewards — V(s) = max( R(s, a) + γ * V(s’) ) does this make our search more efficient? No! Not yet. Why not? Because to calculate the state’s value we still have to traverse the practically infinitely massive tree of states (like chess board states).

This problem of too many (sometimes infinite) states and actions is the heart of all reinforcement learning.

Reinforcement learning problems can’t be “solved” but can be “estimated”. From now on, we will learn about the most efficient and accurate ways researchers have found to estimate problems (never really solve).
Okay, so the tree can’t be fully traversed but what if we partially traversed a state tree by pruning or **picking only “sensible” actions. In reinforcement learning this is called “exploration vs exploitation”.

Exploration: When learning in an uncertain environment, an agent needs to try new, potentially suboptimal actions to discover better strategies it might have missed.

Exploitation: Once an agent has discovered reliable strategies, it can exploit this knowledge by repeatedly choosing actions that lead to rewards.

But why exploration? The reason why we need exploration (and try potentially suboptimal actions) is because sometimes taking suboptimal short-term actions leads to better long-term outcomes. Examples:
1. Navigation: A longer initial route might connect to a highway that's ultimately faster than the obvious shorter path
2. Dota2: Higher initial costs for expensive items often yield better long-term returns than can win you the game.
3. Chess: Sacrificing a high-value piece (seemingly suboptimal) can lead to checkmate sometimes.
4. Game Theory: In multi-agent scenarios, appearing unpredictable through occasional suboptimal moves can prevent opponents from exploiting fixed strategies
The exploration-exploitation dilemma is fundamental to learning - like a pathfinding algorithm for road networks balancing between routing through established highways (exploitation) and investigating side streets and alternate routes (exploration). Too much exploration wastes computational effort testing every possible road combination, while pure exploitation might miss that optimal route through local roads that bypass congested highways entirely.
Something I want to talk about before we move on to more complex algorithms — Q(s,a) function also known as the action-value function!

Q(s,a) represents state-action value - value at starting from state s, taking action a:

Q(s,a) = R(s,a) + γ * V(s')

reminder that V(s) = max( R(s, a) + γ * V(s’) )

which means: V(s) = max( Q(s,a) ) for all actions a that can be taken at

(remember these 3 equations, they are pretty fundamental to rl)

Why do we need Q(s, a)? Why can’t we just use V(s) everywhere? — The reason is obvious but it’s never addressed in papers, books and research. Reinforcement learning primarily comprisies of algorithms and code-style logic expressed as mathematical equations. In translating algorithmic logic to mathematical notation, we need intermediate variables (like Q and V) to represent different computational steps. Just as code uses multiple variables to break down complex logic, RL math uses Q(s, a) and V(s) to express the process of evaluating actions and selecting maximums. These functions are derivatives of each other similar to how variables in code build on each other to express program logic.

This is where I have a fundamental problem with the way reinforcement learning is taught and studied & is also partially why I wanted to write this blog in my learning style. Anyway, my rant is over let’s move on.
Now that you know about exploration and exploitation, let’s talk about an algorithm called Real-time dynamic programming (RTDP):

In RTDP, exploration is typically done using an epsilon-greedy algorithm where:
1. With probability ε: Choose a random action
2. With probability (1-ε): Choose the action that maximizes the expected value
The “exploration rate ε” usually starts high (like 0.9) and decays over time, allowing the algorithm to:
1. Initially explore broadly (because of the high starting exploration rate) to avoid local optima
2. Eventually, exploit known good paths. As the exploration rate decays towards 0, we essentially take action greedily. We do this because over time our confidence in value estimates increases.
  
  notice how initially the tree is wide but then as the exploration rate decays the tree exploits only the good nodes
Let’s quickly read some pseudocode & find issues with this algorithm. Look at the code below:
```
def rtdp(state_space, actions, initial_state):
   V = {s: 0 for s in state_space}
   epsilon = epsilon_start
   
   gamma=0.99 # discount factor from bellman equation
   epsilon=0.9 # exploration rate, *ε* as described above
   max_episodes=1000 # number of iterations
   epsilon_decay = 0.995 # epsilon decay amount after each episode

   for episode in range(max_episodes):
       state = initial_state
       while not terminal(state):
           # next action decided using epsilon-greedy method
           best_action = argmax(actions, lambda a: V[get_next_state(state, a)])
           next_action = random_action() if random() < epsilon else best_action
           next_state = transition(state, next_action)
           # Bellman update 
           max_future_value = max(V[get_next_state(state, a)] for a in actions)
           V[state] = reward(state,next_action) + gamma * max_future_value
           state = next_state
       epsilon *= epsilon_decay
   return V
```
The code explained:
1. We run the for loop max_episodes times. I realize now that I haven’t spoken about the term “episodes” yet in the context of reinforcement learning. In RL an iteration (in our case one pass of our tree) is called an episode. So next time you see the episode, think of one iteration of the algorithm.
2. For each episode, we start from the initial state and explore our tree until we hit a terminal state. Initially, our tree only comprises of the root node (the initial state) but after each iteration,n the tree grows as we compute values of new states.
3. We now compute the next_action in an epsilon-greedy manner. If the random() returns a value < 0.9 (our epsilon value) then we select a random action and if it’s above 0.9 we select the best action. This means that there is a 90% chance we explore a new path and a 10% chance we exploit the nodes in our tree already. After we have the next action we then compute the next_state by performing that action.
4. We then update the value of the state using the bellman equation we learnt about (but, with a small modification, i.e., the next_action is already chosen.)
5. After this the state is updated to the next_state. we move on to the next state and then repeat steps a-d.
6. Once we reach the terminal state of our explored tree, we can update the exploration rate using the decay. Notice that the epsilon after 1000 episodes would be 0.9 * 0.995^1000 = 0.006. This means that towards the final episodes we are barely exploring (0.06% of the time) and are exploiting the states with the highest ROI.
  
  A chess game equivalent of the above statement is something like — after enough episodes the algorithm has largely crystallized its strategy, almost always selecting moves it's learned lead to advantageous positions rather than experimenting with novel approaches.
The problems faced by RTDP:
1. Initial values - Performance heavily depends on initial value function estimates
2. Coverage - May ignore relevant states if they're not frequently encountered
3. Exploration decay - Hard to set an optimal decay rate; too fast means missing important states, too slow wastes computation
Solving RTDP problems using Monte Carlo Tree Search (MCTS):
I want to pause and reflect on everything we understand so far. Idk if you noticed but all of the algorithms, formulas and examples we spoke about so far only talk about learning by experience (and computing a value table) or search and simulations (like in MCTS). But, imagine you're Netflix building a movie recommendation system. You have millions of users watching different movies, but each user only interacts with a tiny fraction of your catalogue. The challenge: how do you learn what movies they might like without forcing them to watch everything?
Okay, let’s try and build a Netflix recommendation algorithm (can be extended to Youtube, Tiktok, Instagram etc). My assumptions here — we have a massive dataset of people’s preferences and profile.