---
author: "Joseph Larmarange"
title: "About missing values: regular NAs, tagged NAs and user NAs"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{About missing values: regular NAs, tagged NAs and user NAs}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

In base **R**, missing values are indicated using the specific value `NA`. **Regular NAs** could be used with any type of vector (double, integer, character, factor, Date, etc.).

Other statistical software have implemented ways to differentiate several types of missing values.

**Stata** and **SAS** have a system of **tagged NAs**, where NA values are tagged with a letter (from a to z). **SPSS** allows users to indicate that certain non-missing values should be treated in some analysis as missing (**user NAs**). The `haven` package implements **tagged NAs** and **user NAs** in order to keep this information when importing files from **Stata**, **SAS** or **SPSS**.

```{r}
library(labelled)
```


## Tagged NAs

### Creation and tests

**Tagged NAs** are proper `NA` values with a tag attached to them. They can be created with `tagged_na()`. The attached tag should be a single letter, lowercase (a-z) or uppercase (A-Z).


```{r}
x <- c(1:5, tagged_na("a"), tagged_na("z"), NA)
```


For most **R** functions, tagged NAs are just considered as regular NAs. By default, they are just printed as any other regular NA.

```{r}
x
is.na(x)
```

To show/print their tags, you need to use `na_tag()`, `print_tagged_na()` or `format_tagged_na()`.

```{r}
na_tag(x)
print_tagged_na(x)
format_tagged_na(x)
```

To test if a certain NA is a regular NA or a tagged NA, you should use `is_regular_na()` or `is_tagged_na()`.

```{r}
is.na(x)
is_tagged_na(x)
# You can test for specific tagged NAs with the second argument
is_tagged_na(x, "a")
is_regular_na(x)
```


Tagged NAs could be defined **only** for double vectors. If you add a tagged NA to a character vector, it will be converted into a regular NA. If you add a tagged NA to an integer vector, the vector will be converted into a double vector.

```{r, error=TRUE}
y <- c("a", "b", tagged_na("z"))
y
is_tagged_na(y)
format_tagged_na(y)

z <- c(1L, 2L, tagged_na("a"))
typeof(z)
format_tagged_na(z)
```

### Unique values, duplicates and sorting with tagged NAs

By default, functions such as `base::unique()`, `base::duplicated()`, `base::order()` or `base::sort()` will treat tagged NAs as the same thing as a regular NA. You can use `unique_tagged_na()`, `duplicated_tagged_na()`, `order_tagged_na()` and `sort_tagged_na()` as alternatives that will treat two tagged NAs with different tags as separate values.

```{r}
x <- c(1, 2, tagged_na("a"), 1, tagged_na("z"), 2, tagged_na("a"), NA)
x %>% print_tagged_na()

unique(x) %>% print_tagged_na()
unique_tagged_na(x) %>% print_tagged_na()

duplicated(x)
duplicated_tagged_na(x)

sort(x, na.last = TRUE) %>% print_tagged_na()
sort_tagged_na(x) %>% print_tagged_na()
```

### Tagged NAs and value labels

It is possible to define value labels for tagged NAs.

```{r}
x <- c(1, 0, 1, tagged_na("r"), 0, tagged_na("d"), tagged_na("z"), NA)
val_labels(x) <- c(
  no = 0, yes = 1,
  "don't know" = tagged_na("d"),
  refusal = tagged_na("r")
)
x
```

When converting such labelled vector into factor, tagged NAs are, by default, converted into regular NAs (it is not possible to define tagged NAs with factors).

```{r}
to_factor(x)
```

However, the option `explicit_tagged_na` of `to_factor()` allows to transform tagged NAs into explicit factor levels.

```{r}
to_factor(x, explicit_tagged_na = TRUE)
to_factor(x, levels = "prefixed", explicit_tagged_na = TRUE)
```

### Conversion into user NAs

Tagged NAs can be converted into user NAs with `tagged_na_to_user_na()`.

```{r}
tagged_na_to_user_na(x)
tagged_na_to_user_na(x, user_na_start = 10)
```

Use `tagged_na_to_regular_na()` to convert tagged NAs into regular NAs.

```{r}
tagged_na_to_regular_na(x)
tagged_na_to_regular_na(x) %>% is_tagged_na()
```


## User NAs


`haven` introduced an `haven_labelled_spss` class to deal with user defined missing values in a similar way as **SPSS**. In such case, additional attributes will be used to indicate with values should be considered as missing, but such values will not be stored as internal `NA` values. You should note that most R function will not take this information into account. Therefore, you will have to convert missing values into `NA` if required before analysis. These defined missing values could co-exist with internal `NA` values.

### Creation

User NAs could be created directly with `labelled_spss()`. You can also manipulate them with `na_values()` and `na_range()`. 

```{r}
v <- labelled(c(1, 2, 3, 9, 1, 3, 2, NA), c(yes = 1, no = 3, "don't know" = 9))
v
na_values(v) <- 9
v

na_values(v) <- NULL
v

na_range(v) <- c(5, Inf)
na_range(v)
v
```

NB: you cant also use `set_na_range()` and `set_na_values()` for a `dplyr`-like syntax.

```{r}
library(dplyr)
# setting value labels and user NAs
df <- tibble(s1 = c("M", "M", "F", "F"), s2 = c(1, 1, 2, 9)) %>%
  set_value_labels(s2 = c(yes = 1, no = 2)) %>%
  set_na_values(s2 = 9)
df$s2

# removing user NAs
df <- df %>% set_na_values(s2 = NULL)
df$s2
```

### Tests

Note that `is.na()` will return `TRUE` for user NAs. Use `is_user_na()` to test if a specific value is a user NA and `is_regular_na()` to test if it is a regular NA.

```{r}
v
is.na(v)
is_user_na(v)
is_regular_na(v)
```

### Conversion

For most **R** functions, user NAs values are **still** regular values.

```{r}
x <- c(1:5, 11:15)
na_range(x) <- c(10, Inf)
val_labels(x) <- c("dk" = 11, "refused" = 15)
x
mean(x)
```

You can convert user NAs into regular NAs with `user_na_to_na()` or `user_na_to_regular_na()` (both functions are identical).

```{r}
user_na_to_na(x)
mean(user_na_to_na(x), na.rm = TRUE)
```

Alternatively, if the vector is numeric, you can convert user NAs into tagged NAs with `user_na_to_tagged_na()`.

```{r}
user_na_to_tagged_na(x)
mean(user_na_to_tagged_na(x), na.rm = TRUE)
```

Finally, you can also remove user NAs definition without converting these values to `NA`, using `remove_user_na()`.

```{r}
remove_user_na(x)
mean(remove_user_na(x))
```