\documentclass[a4paper]{article}

\setlength{\textwidth}{140mm}
\setlength{\oddsidemargin}{10mm}

\usepackage[utf8]{inputenc}
\usepackage{url}

\newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\code}[1]{\mbox{\texttt{#1}}}
\newcommand{\pkg}[1]{\strong{#1}}
\newcommand{\proglang}[1]{\textsf{#1}}
\newcommand{\acronym}[1]{\textsc{#1}}
\newcommand{\file}[1]{\textsf{#1}}

\sloppy

%% \VignetteIndexEntry{Introduction to the RKEA Package}

\begin{document}
<<echo=FALSE>>=
options(width = 75)
### for sampling
set.seed <- 1234
@
\title{Introduction to the \pkg{RKEA} Package}
\author{Ingo Feinerer \and Kurt Hornik}
\maketitle

\begin{abstract}
  A short introduction to the \pkg{RKEA} package.
\end{abstract}

\section*{Introduction}

The \pkg{RKEA} package provides a \proglang{R} interface to
\proglang{Kea} (\url{http://www.nzdl.org/Kea/}), a tool for keyword
extraction in texts.  See \url{https://code.google.com/p/kea-algorithm/}
and \url{http://www.nzdl.org/Kea/Download/Kea-5.0-Readme.txt} for
further information on \proglang{Kea}.

Note that \proglang{Maui} (\url{http://maui-indexer.googlecode.com/}),
an algorithm for topic indexing, can be used for the same tasks as
\proglang{Kea}, but offers additional features, including indexing using
Wikipedia as a controlled vocabulary.  See
\url{https://www.airpair.com/nlp/keyword-extraction-tutorial} for a
tutorial on NLP keyword extraction with \proglang{Maui} and RAKE (Rapid
Automatic Keyword Extraction): note however that currently there is no
\proglang{R} interface to \proglang{Maui}, nor an \proglang{R}
implementation of RAKE.

\section*{Loading the Package}

Before actually working we need to load the package:
<<>>=
library("RKEA")
@

\section*{Creating a Keyword Extraction Model}

\proglang{Kea} needs a keyword extraction model for keyword
extraction. You can build your own models by manually indexing the
keywords in a small set of texts, and then call \code{createModel()}.

<<keep.source=TRUE>>=
library("tm")
data("crude")

keywords <- list(c("Diamond", "crude oil", "price"),
                 c("OPEC", "oil", "price"),
                 c("Texaco", "oil", "price", "decrease"),
                 c("Marathon Petroleum", "crude", "decrease"),
                 c("Houston Oil", "revenues", "decrease"),
                 c("Kuwait", "OPEC", "quota"))

tmpdir <- tempfile()
dir.create(tmpdir)
model <- file.path(tmpdir, "crudeModel")

createModel(crude[1:6], keywords, model)
@

Please note that we just wrap the functionality of the original
\proglang{Kea} program which always uses files for in- and output (and
that is the reason you also need to use a directory in \proglang{R} as
shown in the above example). We deliberately decided not to modify the
\proglang{Kea} Java archive shipped with this \proglang{R} package for
compatibility reasons. However this may induce some warnings in
\proglang{R} (e.g., because some internal \proglang{Kea} paths might not
be available) but nevertheless you should get the full functionality out
of it.

\section*{Keyword Extraction}

Once you have a \proglang{Kea} model you can extract keywords from texts.

<<keep.source=TRUE>>=
extractKeywords(crude, model)

unlink(tmpdir, recursive = TRUE)
@

\section*{Working with Controlled Vocabularies}

The data used for the keyword extraction tutorial with \proglang{Maui}
and RAKE can be downloaded from
\url{https://maui-indexer.googlecode.com/files/fao780.tar.gz}; the
AGROVOC Agricultural Thesaurus can be obtained from
\url{http://www.nzdl.org/Kea/Download/vocabularies/agrovoc.skos.zip}
(SKOS format) or
\url{http://www.nzdl.org/Kea/Download/vocabularies/agrovoc.text.zip}
(text format).

With the data unpacked to subdirectory \file{fao780} and
\file{agrovoc.skos.zip} unzipped in the working directory, one can use 
<<eval=false>>=
txts <- Sys.glob(file.path("fao780", "*.txt"))
keys <- sub("txt$", "key", txts)
txts <- lapply(txts, readLines)
keys <- lapply(keys, readLines)
build <- seq_len(100)
xtrct <- seq(101, 105)
model <- "fao780_model"
createModel(txts[build], keys[build], model, "agrovoc", "skos")
extractKeywords(txts[xtrct], model, "agrovoc", "skos")
@ 
to build a keyword model using the first 100 texts, and use the model to
extract the keywords from the next 5 texts.

\end{document}