%\VignetteIndexEntry{Working with Unknown Values}
%\VignettePackage{gdata}
%\VignetteKeywords{unknown, missing, manip}

\documentclass[a4paper]{report}
\usepackage{Rnews}
\usepackage[round]{natbib}
\bibliographystyle{abbrvnat}

\usepackage{Sweave}
\SweaveOpts{strip.white=all, keep.source=TRUE}
\SweaveOpts{concordance=TRUE}

\begin{document}

\begin{article}

\title{Working with Unknown Values}
\subtitle{The \pkg{gdata} package}
\author{by Gregor Gorjanc}

\maketitle

This vignette has been published as \cite{Gorjanc}.

\section{Introduction}

Unknown or missing values can be represented in various ways. For example
SAS uses \code{.}~(dot), while \R{} uses \code{NA}, which we can read as
Not Available. When we import data into \R{}, say via \code{read.table} or
its derivatives, conversion of blank fields to \code{NA} (according to
\code{read.table} help) is done for \code{logical}, \code{integer},
\code{numeric} and \code{complex} classes. Additionally, the
\code{na.strings} argument can be used to specify values that should also
be converted to \code{NA}. Inversely, there is an argument \code{na} in
\code{write.table} and its derivatives to define value that will replace
\code{NA} in exported data. There are also other ways to import/export data
into \R{} as described in the {\emph R Data Import/Export} manual
\citep{RImportExportManual}.  However, all approaches lack the possibility
to define unknown value(s) for some particular column. It is possible that
an unknown value in one column is a valid value in another column. For
example, I have seen many datasets where values such as 0, -9, 999 and
specific dates are used as column specific unknown values.

This note describes a set of functions in package \pkg{gdata}\footnote{
  package version 2.3.1} \citep{WarnesGdata}: \code{isUnknown},
\code{unknownToNA} and \code{NAToUnknown}, which can help with testing for
unknown values and conversions between unknown values and \code{NA}. All
three functions are generic (S3) and were tested (at the time of writing)
to work with: \code{integer}, \code{numeric}, \code{character},
\code{factor}, \code{Date}, \code{POSIXct}, \code{POSIXlt}, \code{list},
\code{data.frame} and \code{matrix} classes.

\section{Description with examples}

The following examples show simple usage of these functions on
\code{numeric} and \code{factor} classes, where value \code{0} (beside
\code{NA}) should be treated as an unknown value:

<<ex01>>=
library("gdata")
xNum <- c(0, 6, 0, 7, 8, 9, NA)
isUnknown(x=xNum)
@

The default unknown value in \code{isUnknown} is \code{NA}, which means
that output is the same as \code{is.na} --- at least for atomic
classes. However, we can pass the argument \code{unknown} to define which
values should be treated as unknown:

<<ex02>>=
isUnknown(x=xNum, unknown=0)
@

This skipped \code{NA}, but we can get the expected answer after
appropriately adding \code{NA} into the argument \code{unknown}:

<<ex03>>=
isUnknown(x=xNum, unknown=c(0, NA))
@

Now, we can change all unknown values to \code{NA} with \code{unknownToNA}.
There is clearly no need to add \code{NA} here. This step is very handy
after importing data from an external source, where many different unknown
values might be used. Argument \code{warning=TRUE} can be used, if there is
a need to be warned about ``original'' \code{NA}s:

<<ex04>>=
(xNum2 <- unknownToNA(x=xNum, unknown=0))
@

Prior to export from \R{}, we might want to change unknown values
(\code{NA} in \R{}) to some other value. Function \code{NAToUnknown} can be
used for this:

<<ex05>>=
NAToUnknown(x=xNum2, unknown=999)
@

Converting \code{NA} to a value that already exists in \code{x} issues an
error, but \code{force=TRUE} can be used to overcome this if needed. But be
warned that there is no way back from this step:

<<ex06>>=
NAToUnknown(x=xNum2, unknown=7, force=TRUE)
@

Examples below show all peculiarities with class \code{factor}.
\code{unknownToNA} removes \code{unknown} value from levels and inversely
\code{NAToUnknown} adds it with a warning. Additionally, \code{"NA"} is
properly distinguished from \code{NA}. It can also be seen that the
argument \code{unknown} in functions \code{isUnknown} and
\code{unknownToNA} need not match the class of \code{x} (otherwise factor
should be used) as the test is internally done with \code{\%in\%}, which
nicely resolves coercing issues.

<<ex07>>=
(xFac <- factor(c(0, "BA", "RA", "BA", NA, "NA")))
isUnknown(x=xFac)
isUnknown(x=xFac, unknown=0)
isUnknown(x=xFac, unknown=c(0, NA))
isUnknown(x=xFac, unknown=c(0, "NA"))
isUnknown(x=xFac, unknown=c(0, "NA", NA))

(xFac <- unknownToNA(x=xFac, unknown=0))
(xFac <- NAToUnknown(x=xFac, unknown=0))
@

These two examples with classes \code{numeric} and \code{factor} are fairly
simple and we could get the same results with one or two lines of \R{}
code. The real benefit of the set of functions presented here is in
\code{list} and \code{data.frame} methods, where \code{data.frame} methods
are merely wrappers for \code{list} methods.

We need additional flexibility for \code{list}/\code{data.frame} methods,
due to possibly having multiple unknown values that can be different among
\code{list} components or \code{data.frame} columns. For these two methods,
the argument \code{unknown} can be either a \code{vector} or \code{list},
both possibly named. Of course, greater flexibility (defining multiple
unknown values per component/column) can be achieved with a \code{list}.

When a \code{vector}/\code{list} object passed to the argument
\code{unknown} is not named, the first value/component of a
\code{vector}/\code{list} matches the first component/column of a
\code{list}/\code{data.frame}. This can be quite error prone, especially
with \code{vectors}. Therefore, I encourage the use of a \code{list}. In
case \code{vector}/\code{list} passed to argument \code{unknown} is named,
names are matched to names of \code{list} or \code{data.frame}. If lengths
of \code{unknown} and \code{list} or \code{data.frame} do not match,
recycling occurs.

The example below illustrates the application of the described functions to
a list which is composed of previously defined and modified numeric
(\code{xNum}) and factor (\code{xFac}) classes. First, function
\code{isUnknown} is used with \code{0} as an unknown value. Note that we
get \code{FALSE} for \code{NA}s as has been the case in the first example.

<<ex08>>=
(xList <- list(a=xNum, b=xFac))
isUnknown(x=xList, unknown=0)
@

We need to add \code{NA} as an unknown value. However, we do not get the
expected result this way!

<<ex09>>=
isUnknown(x=xList, unknown=c(0, NA))
@

This is due to matching of values in the argument \code{unknown} and
components in a \code{list}; i.e., \code{0} is used for component \code{a}
and \code{NA} for component \code{b}.  Therefore, it is less error prone
and more flexible to pass a \code{list} (preferably a named list) to the
argument \code{unknown}, as shown below.

<<ex10>>=
(xList1 <- unknownToNA(x=xList,
                       unknown=list(b=c(0, "NA"),
                                    a=0)))
@

Changing \code{NA}s to some other value (only one per component/column) can
be accomplished as follows:

<<ex11>>=
NAToUnknown(x=xList1,
            unknown=list(b="no", a=0))
@

A named component \code{.default} of a \code{list} passed to argument
\code{unknown} has a special meaning as it will match a component/column
with that name and any other not defined in \code{unknown}. As such it is
very useful if the number of components/columns with the same unknown
value(s) is large. Consider a wide \code{data.frame} named \code{df}. Now
\code{.default} can be used to define unknown value for several columns:

<<ex12, echo=FALSE>>=
df <- data.frame(col1=c(0, 1, 999, 2),
                 col2=c("a", "b", "c", "unknown"),
                 col3=c(0, 1, 2, 3),
                 col4=c(0, 1, 2, 2))
@

<<ex13>>=
tmp <- list(.default=0,
            col1=999,
            col2="unknown")
(df2 <- unknownToNA(x=df,
                    unknown=tmp))
@

If there is a need to work only on some components/columns you can of
course ``skip'' columns with standard \R{} mechanisms, i.e.,
by subsetting \code{list} or \code{data.frame} objects:

<<ex14>>=
df2 <- df
cols <- c("col1", "col2")
tmp <- list(col1=999,
            col2="unknown")
df2[, cols] <- unknownToNA(x=df[, cols],
                           unknown=tmp)
df2
@

\section{Summary}

Functions \code{isUnknown}, \code{unknownToNA} and \code{NAToUnknown}
provide a useful interface to work with various representations of
unknown/missing values. Their use is meant primarily for shaping the data
after importing to or before exporting from \R{}. I welcome any comments or
suggestions.

% \bibliography{refs}

\begin{thebibliography}{1}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
  \providecommand{\doi}[1]{doi: #1}\else
  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi

\bibitem[Gorjanc(2007)]{Gorjanc}
G.~Gorjanc.
\newblock Working with unknown values: the gdata package.
\newblock \emph{R News}, 7\penalty0 (1):\penalty0 24--26, 2007.
\newblock URL \url{http://CRAN.R-project.org/doc/Rnews/Rnews_2007-1.pdf}.

\bibitem[{R Development Core Team}(2006)]{RImportExportManual}
{R Development Core Team}.
\newblock \emph{R Data Import/Export}, 2006.
\newblock URL \url{http://cran.r-project.org/manuals.html}.
\newblock ISBN 3-900051-10-0.

\bibitem[Warnes (2006)]{WarnesGdata}
G.~R. Warnes.
\newblock \emph{gdata: Various R programming tools for data manipulation},
  2006.
\newblock URL
  \url{http://cran.r-project.org/src/contrib/Descriptions/gdata.html}.
\newblock R package version 2.3.1. Includes R source code and/or documentation
  contributed by Ben Bolker, Gregor Gorjanc and Thomas Lumley.

\end{thebibliography}

\address{Gregor Gorjanc\\
  University of Ljubljana, Slovenia\\
\email{gregor.gorjanc@bfro.uni-lj.si}}

\end{article}

\end{document}