Next Generation Consultancy - Leading Innovation and Sustainability in Software Development, Learning & Personal Development | Experts in Transforming Ideas into Digital Solutions.

Go back to Blogs

Fundamentals of Markov Decision Processes and algorithms

December 21, 2024 0 Comments

What are Markov Decision Processes?

Markov Decision Processes (MDPs) are a mathematical framework used to model decision-making in situations where outcomes are partly random and partly under the control of an agent. An MDP provides a formalism for modeling sequential decision problems, where an agent interacts with an environment in discrete time steps.

Markov Decision Processes (MDPs) are fundamental to many areas of artificial intelligence, especially in reinforcement learning and optimization problems, enabling systems to learn or make decisions under uncertainty. By modeling the environment’s dynamics and rewards, MDPs provide a structured approach to solving complex decision-making problems.

Core Components of an MDP

The core components of an MDP are the tuple (S, A, P, R, γ) , where:

States (S): A finite or infinite set of states that represent all possible situations in which the agent can find itself. Each state s \in S represents a snapshot of the system at a given time.
Actions (A): A set of all actions available to the agent. At any state s, the agent can choose an action a \in A.
Transitions (P): A transition model that defines the probability of moving from one state to another, given a specific action. This is often represented as ( P(s’|s, a) ), the probability of transitioning to state ( s’ ) from state ( s ) after taking action ( a ).
Rewards (R): A reward function that assigns a numerical value to each state-action pair, representing the immediate reward received after transitioning from one state to another due to an action. This is often denoted as ( R(s, a, s’) ).
Discount Factor (γ): A discount factor (0 ≤ γ ≤ 1) that represents the importance of future rewards. A discount factor close to 1 values future rewards almost as much as immediate rewards, while a discount factor close to 0 prioritizes immediate rewards.

Objective of MDPs

The primary objectives of Markov Decision Processes (MDPs) are:

Model Decision-Making: MDPs provide a structured framework to model decision-making problems where outcomes are influenced by both deterministic actions and probabilistic events.
Optimize Long-Term Rewards: The main goal of solving an MDP is to find an optimal policy that maximizes the expected cumulative reward over time, considering both immediate and future rewards.
Handle Uncertainty: MDPs are designed to handle uncertainty in both state transitions and rewards, making them suitable for real-world applications where outcomes are not always predictable.
Sequential Decision-Making: MDPs are used to model problems that involve a sequence of decisions over time, where each decision impacts future states and rewards.
Policy Derivation: MDPs aim to derive a policy, which is a mapping from states to actions, that guides the decision-maker on the best action to take in each state to achieve the optimal outcome.
Evaluate and Improve Policies: MDPs provide methods to evaluate the effectiveness of a given policy and iteratively improve it to reach the optimal policy.

Applications of MDPs

Markov Decision Processes (MDPs) are widely used in various fields to model decision-making problems where outcomes are partly random and partly under the control of a decision-maker. Here are some common applications of MDPs:

Robotics: MDPs are used to model the decision-making process of robots, helping them navigate and perform tasks in uncertain environments.
Finance: In financial modeling, MDPs can be used to optimize investment strategies, portfolio management, and pricing of financial derivatives.
Healthcare: MDPs assist in medical decision-making, such as determining optimal treatment plans for patients over time, considering the probabilistic nature of disease progression and treatment outcomes.
Operations Research: MDPs are applied to optimize supply chain management, inventory control, and logistics, where decisions need to be made under uncertainty.
Reinforcement Learning: MDPs form the theoretical foundation for reinforcement learning algorithms, which are used in various AI applications, including game playing, recommendation systems, and autonomous driving.
Telecommunications: MDPs help in optimizing network routing, resource allocation, and managing communication protocols in uncertain and dynamic network environments.
Manufacturing: In manufacturing, MDPs are used to optimize production processes, maintenance scheduling, and quality control under uncertainty.

By modeling these problems as MDPs, decision-makers can systematically analyze and derive optimal policies that maximize long-term rewards while accounting for the inherent uncertainties in the system.

Algorithms to Solve MDPs

The goal in an MDP is to find an optimal policy, a mapping from states to actions, that maximizes the expected cumulative reward over time. Two common algorithms used to solve MDPs are Value Iteration and Policy Iteration. These algorithms iteratively update the value of each state and improve the policy until convergence, ensuring that the agent makes the best possible decisions to maximize long-term rewards.

Here is a brief overview of how these algorithms work:

Value Iteration

Value Iteration is an algorithm that iteratively updates the value of each state based on the expected rewards and transitions until the values converge. Once the value function converges, the policy can be derived by taking the action that maximizes the expected reward.

Policy Iteration

Policy Iteration starts with a random policy and iteratively evaluates and improves it. The policy evaluation step updates the value table based on the current policy, and the policy improvement step updates the policy based on the new value table. This process repeats until the policy stabilizes.

Python Implementation of MDPs Algorithms

The provided Python implementation follows a modular approach to simulate and solve an Markov Decision Process using the algorithms described in the previous section

Here’s a detailed guide on implementing an expanded and organized Markov Decision Process (MDP) in Python. This includes an explanation of the components, the structure, and an organized implementation.

The provided Python implementation follows a modular approach to simulate and solve an MDP. Key components include:

MDP Class

The MDP class represents a Markov Decision Process, initializing with states, actions, transitions, rewards, and a discount factor (gamma). It includes methods to retrieve possible next states and their probabilities for a given state-action pair (get_transition), and to get the reward for a specific state-action-next_state triplet (get_reward). This class serves as the foundational structure for defining the MDP components and their interactions.

class MDP:
   def __init__(self, states, actions, transitions, rewards, gamma=0.9):
       """
       Initialize MDP components.
       :param states: List of states
       :param actions: List of actions
       :param transitions: Nested dict, P[state][action] = [(prob, next_state)]
       :param rewards: Dict, R[state][action][next_state] = reward
       :param gamma: Discount factor (0 <= gamma <= 1)
       """
       self.states = states
       self.actions = actions
       self.transitions = transitions
       self.rewards = rewards
       self.gamma = gamma


   def get_transition(self, state, action):
       """Return possible next states and probabilities for a state-action pair."""
       try:
           return self.transitions[state][action]
       except KeyError:
           raise ValueError(f"Invalid state-action pair: ({state}, {action})")


   def get_reward(self, state, action, next_state):
       """Return reward for a state-action-next_state triplet."""
       try:
           return self.rewards[state][action][next_state]
       except KeyError:
           raise ValueError(f"Invalid state-action-next_state triplet: ({state}, {action}, {next_state})")

ValueIteration Class

The ValueIteration class implements the value iteration algorithm to solve the MDP. It initializes with an MDP instance and a convergence threshold. The class maintains a value table and a placeholder policy. The iterate method performs the value iteration process, updating the value of each state based on expected rewards and transitions until the values converge. It then derives the optimal policy from the value table. The class also provides methods to display the value table (display_value_table) and the optimal policy (display_policy).

import numpy as np
from markov_decision_process import MDP

class ValueIteration:
   def __init__(self, mdp, threshold=1e-4):
       """
       Initialize value iteration solver.
       :param mdp: An instance of the MDP class
       :param threshold: Convergence threshold for value updates
       """
       self.mdp = mdp
       self.threshold = threshold
       self.value_table = {state: 0 for state in mdp.states}  # Initialize all values to 0
       self.policy = {state: None for state in mdp.states}  # Placeholder policy

   def iterate(self):
       """Perform value iteration until convergence."""
       while True:
           delta = 0
           for state in self.mdp.states:
               # Calculate the value for each action
               action_values = []
               for action in self.mdp.actions:
                   try:
                       value = sum(prob * (self.mdp.get_reward(state, action, next_state) +
                                           self.mdp.gamma * self.value_table[next_state])
                                   for prob, next_state in self.mdp.get_transition(state, action))
                       action_values.append(value)
                   except ValueError as e:
                       print(e)
                       return

               # Update value table
               best_value = max(action_values)
               delta = max(delta, abs(best_value - self.value_table[state]))
               self.value_table[state] = best_value

               # Update policy
               self.policy[state] = self.mdp.actions[np.argmax(action_values)]

           if delta < self.threshold:
               break

   def get_policy(self):
       """Return the optimal policy."""
       return self.policy

   def display_value_table(self):
       """Display the value table."""
       print("State-Value Table:")
       for state in self.mdp.states:
           action_values = {action: sum(prob * (self.mdp.get_reward(state, action, next_state) +
                                                self.mdp.gamma * self.value_table[next_state])
                                        for prob, next_state in self.mdp.get_transition(state, action))
                            for action in self.mdp.actions}
           print(f"State {state}: {action_values}")

   def display_policy(self):
       """Display the policy."""
       print("Policy:")
       for state, action in self.policy.items():
           print(f"State {state}: {action}")

def  value_iteration(mdp: MDP):
   try:
       vi = ValueIteration(mdp)
       vi.iterate()
       print("Optimal Policy (Value Iteration):")
       vi.display_value_table()
       vi.display_policy()
   except Exception as e:
       print(f"Error during Value Iteration: {e}")

PolicyIteration Class

The PolicyIteration class implements the policy iteration algorithm to solve the MDP. It initializes with an MDP instance and a convergence threshold. The class maintains a value table and a randomly initialized policy. The evaluate_policy method updates the value table based on the current policy, while the improve_policy method updates the policy based on the new value table. The iterate method alternates between policy evaluation and improvement until the policy stabilizes. The class also provides methods to display the value table (display_value_table) and the optimal policy (display_policy).

import numpy as np
from markov_decision_process import MDP

class PolicyIteration:
   def __init__(self, mdp, threshold=1e-4):
       """
       Initialize policy iteration solver.
       :param mdp: An instance of the MDP class
       :param threshold: Convergence threshold for policy evaluation
       """
       self.mdp = mdp
       self.threshold = threshold
       self.value_table = {state: 0 for state in mdp.states}  # Initialize all values to 0
       self.policy = {state: np.random.choice(mdp.actions) for state in mdp.states}  # Random initial policy

   def evaluate_policy(self):
       """Evaluate the current policy."""
       while True:
           delta = 0
           for state in self.mdp.states:
               action = self.policy[state]
               try:
                   value = sum(prob * (self.mdp.get_reward(state, action, next_state) +
                                       self.mdp.gamma * self.value_table[next_state])
                               for prob, next_state in self.mdp.get_transition(state, action))
               except ValueError as e:
                   print(e)
                   return
               delta = max(delta, abs(value - self.value_table[state]))
               self.value_table[state] = value

           if delta < self.threshold:
               break

   def improve_policy(self):
       """Improve the current policy based on value function."""
       policy_stable = True
       for state in self.mdp.states:
           old_action = self.policy[state]
           action_values = []
           for action in self.mdp.actions:
               try:
                   value = sum(prob * (self.mdp.get_reward(state, action, next_state) +
                                       self.mdp.gamma * self.value_table[next_state])
                               for prob, next_state in self.mdp.get_transition(state, action))
                   action_values.append(value)
               except ValueError as e:
                   print(e)
                   return

           # Choose the best action
           self.policy[state] = self.mdp.actions[np.argmax(action_values)]
           if old_action != self.policy[state]:
               policy_stable = False

       return policy_stable

   def iterate(self):
       """Perform policy iteration until convergence."""
       while True:
           self.evaluate_policy()
           if self.improve_policy():
               break

   def get_policy(self):
       """Return the optimal policy."""
       return self.policy

   def display_value_table(self):
       """Display the value table."""
       print("State-Value Table:")
       for state in self.mdp.states:
           action_values = {action: sum(prob * (self.mdp.get_reward(state, action, next_state) +
                                                self.mdp.gamma * self.value_table[next_state])
                                        for prob, next_state in self.mdp.get_transition(state, action))
                            for action in self.mdp.actions}
           print(f"State {state}: {action_values}")

   def display_policy(self):
       """Display the policy."""
       print("Policy:")
       for state, action in self.policy.items():
           print(f"State {state}: {action}")

def policy_iteration(mdp: MDP):
   try:
       pi = PolicyIteration(mdp)
       pi.iterate()
       print("Optimal Policy (Policy Iteration):")
       pi.display_value_table()
       pi.display_policy()
   except Exception as e:
       print(f"Error during Policy Iteration: {e}")

The execute method in the provided code demonstrates the creation and solution of a Markov Decision Process (MDP) using both Value Iteration and Policy Iteration algorithms. It defines three states (‘A’, ‘B’, ‘C’), two actions (‘left’, ‘right’), and specifies the transitions and rewards for each state-action pair. An MDP instance is created with these parameters and a discount factor (gamma) of 0.9. The code then initializes and runs the Value Iteration and Policy Iteration algorithms on this MDP, displaying the optimal policy and value table for each method.

from markov_decision_process import MDP
from mdp_policy_iteration import policy_iteration
from mdp_value_iteration import value_iteration

def execute():
   states = ['A', 'B', 'C']
   actions = ['left', 'right']
   transitions = {
       'A': {
           'left': [(1.0, 'B')],
           'right': [(1.0, 'C')],
       },
       'B': {
           'left': [(1.0, 'A')],
           'right': [(1.0, 'C')],
       },
       'C': {
           'left': [(1.0, 'A')],
           'right': [(1.0, 'B')],
       }
   }
   rewards = {
       'A': {
           'left': {'B': 1},
           'right': {'C': 0},
       },
       'B': {
           'left': {'A': 0},
           'right': {'C': 2},
       },
       'C': {
           'left': {'A': 1},
           'right': {'B': 2},
       }
   }
   mdp = MDP(states, actions, transitions, rewards, gamma=0.9)
   # Solve using Value Iteration
   value_iteration(mdp)
   # Solve using Policy Iteration
   policy_iteration(mdp)

The main block in the provided code serves as the application entrypoint.

# Application entrypoint
if __name__ == "__main__":
   execute()

This implementation is organized, efficient, and can handle large MDP problems with slight modifications for performance optimizations.

Use Case Example

The provided code defines a Markov Decision Process (MDP) and implements two algorithms to solve it: Value Iteration and Policy Iteration. The MDP class initializes the states, actions, transitions, rewards, and discount factor (gamma). It also includes methods to retrieve transitions and rewards for given state-action pairs.

The ValueIteration class iteratively updates the value of each state based on the expected rewards and transitions until the values converge within a specified threshold. It then derives the optimal policy from the value table.

The PolicyIteration class starts with a random policy and iteratively evaluates and improves it. The policy evaluation step updates the value table based on the current policy, and the policy improvement step updates the policy based on the new value table. This process repeats until the policy stabilizes.

The example usage demonstrates the creation of an MDP with three states and two actions, along with their respective transitions and rewards. Both Value Iteration and Policy Iteration are applied to this MDP to find the optimal policy. The results, including the optimal policy and the value table, are displayed for both algorithms.

Problem Setup:
- States: S = \{A, B, C\}
- Actions: A = \{left, right\}
Transitions: Deterministic transitions for simplicity:
- P(B|A, left) = 1, P(C|A, right) = 1
- P(A|B, left) = 1, P(C|B, right) = 1
- P(A|C, left) = 1, P(B|C, right) = 1
Rewards: Immediate rewards depend on the state-action pair:
- R(A, left, B) = 1, R(A, right, C) = 0
- R(B, right, C) = 2, R(C, right, B) = 2

Results

Conclusions

Markov Decision Processes (MDPs) provide a powerful framework for modeling decision-making problems where outcomes are partly random and partly under the control of a decision-maker. By defining states, actions, transitions, rewards, and a discount factor, MDPs allow for the systematic analysis and optimization of policies to maximize long-term rewards.

Key Takeaways:

Structured Framework: MDPs offer a structured approach to model complex decision-making problems in various fields such as robotics, finance, healthcare, and more.
Optimal Policy: The goal of solving an MDP is to find an optimal policy that maximizes the expected cumulative reward over time.
Algorithms: Common algorithms for solving MDPs include Value Iteration, Policy Iteration, and Q-Learning. These algorithms iteratively improve the value function and policy until convergence.
Handling Uncertainty: MDPs are well-suited for environments with uncertainty, as they account for probabilistic transitions and rewards.

By leveraging MDPs and their associated algorithms, decision-makers can develop robust strategies that account for uncertainty and optimize long-term outcomes.

References

Dynamic Programming and Markov Processes, MIT Press and Wiley, New York (1960)
numpy documentation