---
title: "pomdp: Introduction to Partially Observable Markov Decision Processes"
author: "Michael Hahsler and Hossein Kamalzadeh"
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{pomdp: Introduction to Partially Observable Markov Decision Processes}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  rmarkdown::html_vignette
---


```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library("pomdp")
```

# Introduction

The R package **pomdp** [@Hahsler2024] provides the infrastructure to define and
analyze the solutions of Partially Observable Markov Decision Processes
(POMDP) models. The package is a companion to package **pomdpSolve**
which provides the executable for
'[pomdp-solve](http://www.pomdp.org/code/)' [@Cassandra2015], a
well-known fast C implementation of a variety of algorithms to solve
POMDPs. **pomdp** can also use package **sarsop** [@Bottiger2021] which
provides an implementation of the SARSOP (Successive Approximations of
the Reachable Space under Optimal Policies) algorithm.

The package provides the following algorithms:

-   Exact value iteration

    -   **Enumeration algorithm** [@Sondik1971].
    -   **Two pass algorithm** [@Sondik1971].
    -   **Witness algorithm** [@Littman1995].
    -   **Incremental pruning algorithm** [@Zhang1996],
        [@Cassandra1997].

-   Approximate value iteration

    -   **Finite grid algorithm** [@Cassandra2015], a variation of
        point-based value iteration to solve larger POMDPs (**PBVI**;
        see [@Pineau2003]) without dynamic belief set expansion.
    -   **SARSOP** [@Kurniawati2008], Successive Approximations of the
        Reachable Space under Optimal Policies, a point-based algorithm
        that approximates optimally reachable belief spaces for
        infinite-horizon problems (via package **sarsop**
        [@Bottiger2021]).

The package enables the user to simply define all components of a POMDP
model and solve the problem using several methods. The package also
contains functions to analyze and visualize the POMDP solutions (e.g.,
the optimal policy) and extends to regular MDPs.

In this document, we will give a very brief introduction to the concept
of POMDPs, describe the features of the R package, and illustrate the
usage with a toy example.

# Partially Observable Markov Decision Processes

A partially observable Markov decision process (POMDP) is a combination
of an regular Markov Decision Process to model system dynamics with a
hidden Markov model that connects unobservable system states
probabilistically to observations.

The agent can perform actions which affect the system (i.e., may cause
the system state to change) with the goal to maximize the expected
future rewards that depend on the sequence of system state and the
agent's actions in the future. The goal is to find the optimal policy
that guides the agent's actions. Different to MDPs, for POMDPs, the
agent cannot directly observe the complete system state, but the agent
makes observations that depend on the state. The agent uses these
observations to form a belief about in what state the system currently
is. This belief is called a belief state and is expressed as a
probability distribution over all possible states. The solution of the
POMDP is a policy prescribing which action to take in each belief state.
Note that belief states are continuous resulting in an infinite state
set which makes POMDPs much harder to solve compared to MDPs.

The POMDP framework is general enough to model a variety of real-world
sequential decision-making problems. Applications include robot
navigation problems, machine maintenance, and planning under uncertainty
in general. The general framework of Markov decision processes with
incomplete information was described by Karl Johan Åström [@Astrom1965]
in the case of a discrete state space, and it was further studied in the
operations research community where the acronym POMDP was coined. It was
later adapted for problems in artificial intelligence and automated
planning by Leslie P. Kaelbling and Michael L. Littman [@Kaelbling1998].

A discrete-time POMDP can formally be described as a 7-tuple
$$\mathcal{P} = (S, A, T, R, \Omega , O, \gamma),$$ where

-   $S = \{s_1, s_2, \dots, s_n\}$ is a set of partially observable
    states,

-   $A = \{a_1, a_2, \dots, a_m\}$ is a set of actions,

-   $T$ a set of conditional transition probabilities $T(s' \mid s,a)$
    for the state transition $s \rightarrow s'$ conditioned on the taken
    action.

-   $R: S \times A \rightarrow \mathbb{R}$ is the reward function,

-   $\Omega = \{o_1, o_2, \dots, o_k\}$ is a set of observations,

-   $O$ is a set of observation probabilities $O(o \mid s',a)$
    conditioned on the reached state and the taken action, and

-   $\gamma \in [0, 1]$ is the discount factor.

At each time period, the environment is in some unknown state $s \in S$.
The agent chooses an action $a \in A$, which causes the environment to
transition to state $s' \in S$ with probability $T(s' \mid s,a)$. At the
same time, the agent receives an observation $o \in \Omega$ which
depends on the new state of the environment with probability
$O(o \mid s',a)$. Finally, the agent receives a reward $R(s,a)$. Then
the process repeats. The goal is for the agent to choose actions that
maximizes the expected sum of discounted future rewards, i.e., she
chooses the actions at each time $t$ that
$$\max E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right].$$

For a finite time horizon, only the expectation over the sum up to the
time horizon is used.

# Package Functionality

Solving a POMDP problem with the **pomdp** package consists of two
steps:

1.  Define a POMDP problem using the function `POMDP()`, and\
2.  solve the problem using `solve_POMDP()`.

## Defining a POMDP Problem

The `POMDP()` function has the following arguments, each corresponds to
one of the elements of a POMDP.

```{r}
str(args(POMDP))
```

where

-   `states` defines the set of states $S$,

-   `actions` defines the set of actions $A$,

-   `observations` defines the set of observations $\Omega$,

-   `transition_prob` defines the conditional transition probabilities
    $T(s' \mid s,a)$,

-   `observation_prob` specifies the conditional observation
    probabilities $O(o \mid s',a)$,

-   `reward` specifies the reward function $R$,

-   `discount` is the discount factor $\gamma$ in range $[0,1]$,

-   `horizon` is the problem horizon as the number of periods to
    consider.

-   `terminal_values` is a vector of state utilities at the end of the
    horizon.

-   `start` is the initial probability distribution over the system
    states $S$,

-   `max` indicates whether the problem is a maximization or a
    minimization, and

-   `name` used to give the POMDP problem a name.

While specifying the discount rate and the set of states, observations
and actions is straight-forward. Some arguments can be specified in
different ways. The initial belief state `start` can be specified as

-   A vector of $n$ probabilities in $[0,1]$, that add up to 1, where
    $n$ is the number of states.

    ```{r, eval = FALSE}
    start = c(0.5 , 0.3 , 0.2)
    ```

-   The string '"uniform"' for a uniform distribution over all states.

    ```{r, eval = FALSE}
    start = "uniform"
    ```

-   A vector of integer indices specifying a subset as start states. The
    initial probability is uniform over these states. For example, only
    state 3 or state 1 and 3:

    ```{r, eval = FALSE}
    start = 3
    start = c(1, 3)
    ```

-   A vector of strings specifying a subset as equally likely start
    states.

    ```{r, eval = FALSE}
    start <- "state3"
    start <- c("state1" , "state3") 
    ```

-   A vector of strings starting with `"-"` specifying which states to
    exclude from the uniform initial probability distribution.

    ```{r, eval = FALSE}
    start = c("-" , "state2")
    ```

The transition probabilities (`transition_prob`), observation
probabilities (`observation_prob`) and reward function (`reward`) can be
specified in several ways:

-   As a `data.frame` created using `rbind()` and the helper functions
    `T_()`, `O_()` and `R_()`.
-   A named list of matrices representing the transition probabilities
    or rewards.
-   A function with the same arguments `T_()`, `O_()` or `R_()` that
    returns the probability or reward.

More details can be found in the manual page for `POMDP()`.

## Solving a POMDP

POMDP problems are solved with the function `solve_POMDP()` with the
following arguments.

```{r}
str(args(solve_POMDP))
```

The `model` argument is a POMDP problem created using the `POMDP()`
function, but it can also be the name of a POMDP file using the format
described in the [file specification section of
'pomdp-solve](http://www.pomdp.org/code/pomdp-file-spec.html)'. The
`horizon` argument specifies the finite time horizon (i.e, the number of
time steps) considered in solving the problem. If the horizon is
unspecified (i.e., `NULL`), then the algorithm continues running
iterations till it converges to the infinite horizon solution. The
`method` argument specifies what algorithm the solver should use.
Available methods including `"grid"`, `"enum"`, `"twopass"`,
`"witness"`, and `"incprune"`. Further solver parameters can be
specified as a list in `parameters`. The list of available parameters
can be obtained using the function `solve_POMDP_parameter()`. Details on
the other arguments can be found in the manual page for
\`solve_POMDP()\`.

# The Tiger Problem Example

We will demonstrate how to use the package with the Tiger Problem
[@Cassandra1994]. The problem is defined as:

> An agent is facing two closed doors and a tiger is put with equal
> probability behind one of the two doors represented by the states
> `tiger-left` and `tiger-right`, while treasure is put behind the other
> door. The possible actions are `listen` for tiger noises or opening a
> door (actions `open-left` and `open-right`). Listening is neither free
> (the action has a reward of -1) nor is it entirely accurate. There is
> a 15% probability that the agent hears the tiger behind the left door
> while it is actually behind the right door and vice versa. If the
> agent opens door with the tiger, it will get hurt (a negative reward
> of -100), but if it opens the door with the treasure, it will receive
> a positive reward of 10. After a door is opened, the problem is
> reset(i.e., the tiger is randomly assigned to a door with chance
> 50/50) and the the agent gets another try.

## Specifying the Tiger Problem

The problem can be specified using function `POMDP()` as follows.

```{r}
library("pomdp")

Tiger <- POMDP(
  name = "Tiger Problem",
  
  discount = 0.75,
  
  states = c("tiger-left" , "tiger-right"),
  actions = c("listen", "open-left", "open-right"),
  observations = c("tiger-left", "tiger-right"),
  
  start = "uniform",
  
  transition_prob = list(
    "listen" = "identity", 
    "open-left" = "uniform", 
    "open-right" = "uniform"),

  observation_prob = list(
    "listen" = matrix(c(0.85, 0.15, 0.15, 0.85), nrow = 2, byrow = TRUE), 
    "open-left" = "uniform",
    "open-right" = "uniform"),
    
  reward = rbind(
    R_("listen",     "*",           "*", "*", -1  ),
    R_("open-left",  "tiger-left",  "*", "*", -100),
    R_("open-left",  "tiger-right", "*", "*", 10  ),
    R_("open-right", "tiger-left",  "*", "*", 10  ),
    R_("open-right", "tiger-right", "*", "*", -100)
  )
)

Tiger
```

Note that we use for each component the way that lets us specify the
problem in the easiest way (i.e., for observations and transitions a
list and for rewards a data frame created with the `R_()` function).

## Solving the Tiger Problem

Now, we can solve the problem. We use the default method (finite grid)
which implements a form of point-based value iteration that can find
approximate solutions also for larger problems.

```{r}
sol <- solve_POMDP(Tiger)
sol
```

The output is an object of class POMDP which contains the solution as an
additional list component. The solution can be accessed directly in the
list.

```{r}
sol$solution
```

The solution contains the following elements:

-   **`total_expected_reward`:** The total expected reward of the
    optimal solution.

-   **`initial_belief_state`:** The index of the node in the policy
    graph that represents the initial belief state.

-   **`belief_states`:** A data frame of all the belief states (rows)
    used while solving the problem. There is a column at the end that
    indicates which node in the policy graph is associated with the
    belief state. That is which segment in the value function (specified
    in **alpha** below) provides the best value for the given belief
    state.

-   **`pg`:** A data frame containing the optimal policy graph. Rows are
    nodes in the graph are segments in the value function and each
    represents one or more belief states. Column two indicates the
    optimal action for the node. Columns three and after represent the
    transitions to new nodes in the policy graph depending on the next
    observation.

-   **`alpha`:** A data frame with the coefficients of the optimal
    hyperplanes for the value function.

-   **`policy`:** A data frame that combines the information from `pg`
    and `alpha`. The first few columns specifying the belief state
    (hyperplane from `alpha`) and the last column indicates the optimal
    action (from `pg`).

## Visualization

In this section, we will visualize the policy graph provided in the
solution by the `solve_POMDP()` function.

```{r fig.width = 10, fig.asp = .7}
plot_policy_graph(sol)
```

The policy graph can be easily interpreted. Without prior information,
the agent starts at the node marked with "initial belief." In this case
the agent beliefs that there is a 50-50 chance that the tiger is behind
ether door. The optimal action is displayed inside the state and in this
case is to listen. The observations are labels on the arcs. Let us
assume that the observation is "tiger-left", then the agent follows the
appropriate arc and ends in a node representing a belief (one ore more
belief states) that has a very high probability of the tiger being left.
However, the optimal action is still to listen. If the agent again hears
the tiger on the left then it ends up in a note that has a close to 100%
belief that the tiger is to the left and `open-right` is the optimal
action. The are arcs back from the nodes with the open actions to the
initial state reset the problem.

Since we only have two states, we can visualize the piecewise linear
convex value function as a simple plot.

```{r}
alpha <- sol$solution$alpha
alpha

plot_value_function(sol, ylim = c(0,20))
```

The lines represent the nodes in the policy graph and the optimal
actions are shown in the legend.

# References