Markov chains and Markov Decision process

Sanchit Tanwar
5 min readJan 24, 2019

This is the second part of the reinforcement learning tutorial series for beginners if you have not read part 1 please follow this link to jump to part 1. Most of the content in this tutorial is copied from different sources as I am a student myself.

Markov chain and Markov process

The Markov property states that the future depends only on the present and not on the past. The Markov chain is a probabilistic model that solely depends on the current state and not the previous states, that is, the future is conditionally independent of past.

Moving from one state to another is called transition and its probability is called a transition probability. We can think of an example of anything in which next state depends only on the present state.

Example of Markov chain

Markov decision process

MDP is an extension of the Markov chain. It provides a mathematical framework for modeling decision-making situations. Almost all Reinforcement Learning problems can be modeled as MDP. MDP can be represented by 5 important elements.

  1. A set of state(S) the agent can be in.
  2. A set of actions (A) that can be performed by an agent, for moving from one state to another.
  3. A transition probability (Pᵃₛ₁ₛ₂), which is the probability of moving from one state to another state by performing some action.
  4. A reward probability ( Rᵃₛ₁ₛ₂), which is the probability of a reward acquired by the agent for moving from one state to another state by performing some action.
  5. A discount factor (γ), which controls the importance of immediate and future rewards. We will discuss this in detail in the upcoming sections.

Rewards

Based on the action our agent performs, it receives a reward. A reward is nothing but a numerical value, say, +1 for good action and -1 for a bad action. An agent tries to maximize the total amount of rewards (cumulative rewards) it receives from the environment instead of immediate rewards. The total amount of rewards the agent receives from the environment is called returns. We can formulate the total amount of reward as

R(t) = r(t+1)+r(t+2)+r(t+3)+r(t+4)……………+r(Τ)

where r(t+1) is the reward received by the agent at a time step t₀ and so on.

Episodic and continuous tasks

Episodic tasks are the tasks that have a terminal state (end). For example, in car racing game the end of game is a terminal state. Once the game is over, you start the next episode by restarting the game which will be a whole new beginning. In the above case r(Τ) is the terminal state and the end of episode.

In continuous task there is no terminal state.For example, personal assitant do not have terminal state.

Discount factor

Since we don’t have any final state for a continuous task, we can define our return for continuous tasks as R(t) = r(t+1)+r(t+2)+r(t+3)…+r(Τ) which will sum up to ∞. That’s why we introduce the notion of a discount factor. We can redefine our return with a discount factor, as follows

R(t) = r(t+1)+γr(t+2)+γ² r(t+3)+ ……………

The discount factor decides how much importance we give to the future rewards and immediate rewards. The value of the discount factor lies within 0 to 1. Very low discount factor signifies importance to immediate reward while high discount signifies importance to future reward. The true value of the discount factor is application dependent but the optimal value of the discount factor lies between 0.2 to 0.8.

The policy function

As we studied in the previous article it is a function which maps states to actions. It is denoted by π. So, basically, a policy function says what action to perform in each state. Our ultimate goal lies in finding the optimal policy which specifies the correct action to perform in each state, which maximizes the reward.

State value function

State value function specifies how good it is for an agent to be in a particular state with a policy π. A value function is often denoted by V(s). It denotes the value of a state following a policy. The state value function depends on the policy and it varies depending on the policy we choose.

State value function can be denoted as

State value function

We can view value functions in a table. The greater the value, the better the state is:

State value table

Here state 2 is good.

State-action value function (Q function)

A state-action value function is also called the Q function. It specifies how good it is for an agent to perform a particular action in a state with a policy π. The Q function is denoted by Q(s). It denotes the value of taking an action in a state following a policy π. We can define Q function as follows:

Q function

The difference between the value function and the Q function is that the value function specifies the goodness of a state, while a Q function specifies the goodness of an action in a state. Similar to state value table we can make Q table which shows the value of all possible state action pairs. Whenever we say value function V(S) or Q function Q( S, a), it actually means the value table and Q table.

Q table

This is the end of this article. In the next article, we will discuss the Bellman equation and dynamic programming and start with programming.

Reinforcement learning series

  1. Introduction to reinforcement learning.
  2. Markov chains and markov decision process. → You are here.
  3. Bellman equation and dynamic programming

References

  1. Hands on reinforcement learning with python by Sudarshan Ravichandran

--

--