---
title: "Get Started"
author: "R Validation Hub"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Quick Start}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(riskmetric)
library(dplyr)
library(tibble)

options(repos = "https://cran.rstudio.com")

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)
```

# Introduction

`riskmetric` provides a workflow to evaluate the quality of a set of R packages
that involves five major steps. The workflow can help users to choose high
quality R packages, improve package reliability and prove the validity of R
packages in a regulated industry. In concept, these steps include:

### 1. Finding a source for package information

First we need to identify a source of package metadata. There are a number of
places one may want to look for this information, be it a source code directory,
local package library or remote package repository. Once we find a source of
package data, we begin to collect it in a _package reference_ (`pkg_ref`)
object.

> Learn more: `?pkg_ref`

### 2. Caching package metadata

If more information is needed to perform a given risk assessment, we will use
what metadata we already have to continue to search for more fine-grained
information about the package. For example, if we have a location of a locally
installed package, we can use that path to search for that package's
`DESCRIPTION` file, and from there read in the `DESCRIPTION` contents. To avoid
repeatedly processing the same metadata, these intermediate results are cached
within the `pkg_ref` object so that they can be used in the derivation of
mulitple risk metrics.

> Learn more: `?pkg_ref_cache`

### 3. Assess this metadata against a risk criterion 

For each measure of risk, we first try to boil down that measure into some
fundamental nugget of the package metadata that is comparable across packages
and sources of information. The cross-comparable result of assessing a package
in this way is what we refer to as a _package metric_ (`pkg_metric`). 

For example, with that `DESCRIPTION` file content, we might look at whether a
maintainer is identified in the authors list. To ensure we can easily compare
this information between packages that use the `Authors` field and the
`Authors@R` field, we would boil this information down to just a single logical
value indicating whether or not a maintainer was identified.

> Learn more: `?pkg_assess`

### 4. Score our metrics

After we have these atomic representations of metrics, we want to score them so
that they can be meaningfully compared to one another. In practice this just
embeds a means of converting from the datatype of the metric to a numeric value
on a fixed scale from 0 (worst) to 1 (best). 

Given our maintainer metric example, we might rate a package as `1` (great)
if a maintainer is identified or `0` (poor) if no maintainer is found.

> Learn more: `?pkg_score`

### 5. Summarizing across metric scores

Finally, we may want to look at these scores of individual metrics in some sort
of aggregate risk score. Naturally, not all metric scores may warrant the same
weight. Having scores normalized to a fixed range allows us to define a 
summarizing algorithm to consistently assess and compare packages. 

Notably, risk is an inverse scale from metric scores. High metric scores are
favorable, whereas high risk scores are unfavorable. 

> Learn more: `?summarize_scores`



# The `riskmetric` Workflow

These five steps are broken down into just a handful of primary functions. 

```{r, echo = FALSE}
knitr::include_graphics("../man/figures/core-workflow.svg")
```

## Creating a package reference object

First, we create a _package reference_ class object using the `pkg_ref`
constructor function. This object will contain metadata as it's collected in the
various risk assessments. 

```{r, eval = FALSE}
library(riskmetric)
riskmetric_pkg_ref <- pkg_ref("riskmetric")
print(riskmetric_pkg_ref)
```

```{r, echo = FALSE, warning = FALSE}
rver <- gsub("\\.\\d+$", "", paste0(R.version$major, ".", R.version$minor))
package <- pkg_ref("riskmetric")

# hack in order to mutate package environment directly so nobody accidentally
# publishes any personal info in their library path
invisible(riskmetric:::bare_env(package, {
  package$path <- sprintf(
    "/home/user/username/R/%s/Resources/library/riskmetric", 
    rver)
}))

package
```

Here we see that the `riskmetric` `pkg_ref` object is actually subclassed as a
`pkg_install`. There is a hierarchy of `pkg_ref` object classes including
`pkg_source` for source code directories, `pkg_install` for locally installed
packages and `pkg_remote` for references to package information pulled from the
internet including `pkg_cran_remote` and `pkg_bioc_remote` for CRAN and 
Bioconductor hosted packages respectively. 

Throughout all of `riskmetric`, S3 classes are used extensively to make use of
generic functions with divergent, reference mechanism dependent behaviors for
caching metadata, assessing packages and scoring metrics.

Likewise, some fields have a trailing `...` indicating that they haven't yet
been computed, but that the reference type has knowledge of how to go out and
grab that information if the field is requested. Behind the scenes, this is done
using the `pkg_ref_cache` function, which itself is an S3 generic, using the
name of the field and `pkg_ref` class to dispatch to appropriate functions for
retrieving metadata.

## Assessing a package

There are a number of prespecified assessments, all prefixed by convention with
`assess_*`. Every assessment function takes a single argument, a `pkg_ref`
object and produces a `pkg_metric` object corresponding to the
`assess_*` function that was applied.

```{r, eval = FALSE}
riskmetric_export_help_metric <- assess_export_help(riskmetric_pkg_ref)
print(riskmetric_export_help_metric[1:5])
```

```{r, echo = FALSE}
rver <- gsub("\\.\\d+$", "", paste0(R.version$major, ".", R.version$minor))
package <- pkg_ref("riskmetric")

riskmetric_export_help_metric <- assess_export_help(package)
print(riskmetric_export_help_metric[1:5])

# hack in order to mutate package environment directly so nobody accidentally
# publishes any personal info in their library path
invisible(riskmetric:::bare_env(package, {
  package$path <- sprintf(
    "/home/user/username/R/%s/Resources/library/riskmetric", 
    rver)
}))
```

Every function in the `assess_*` family of functions is expected to return 
basic measure of a package. In this case, we return a named logical vector
indicating whether each export function has an associated help document. 

The return type also leaves a trail of what assessment produced this metric. In
addition to the `pkg_metric` class, we now have a `pkg_metric_export_help` 
subclass which is used for dispatching to an appropriate scoring method.

It's worth pointing out that the act of calling this function has had the 
side-effect of mutating our `riskmetric_pkg_ref` object.

```{r, eval = FALSE}
riskmetric_pkg_ref
```

```{r, echo = FALSE}
package
```

Here `riskmetric_pkg_ref$help_aliases` has a known value because it was needed
to asses whether the package has documentation for its exports. 

> _a note on caching_  
>
>This happens because `pkg_ref` objects are really just `environment`s with some
syntactic sugar, and `environments` in R are always modified by-reference. This
globally mutable behavior is used so that operations performed by one assessment
can be reused by others. Likewise, computing one field may require that a
previous field has been computed first, triggering a chain of metadata
retrieval. In this case, `$help_aliases` required that `$path` be available.
>
>This chaining behavior comes for free by implementing the `pkg_ref_cache`
caching function for each field. For contributors, this alleviates the need to
remember an order of operations, and for users this behavior means that subsets
of assessments can be run in an arbitrary order without pulling superfluous
metadata, keeping track of every-growing objects or ensuring certain assessments
get called before others.

In addition to the metric-specific `assess_*` family of functions, a more
comprehensive `pkg_assess` function is provided. Notably, `pkg_assess` accepts a
`pkg_ref` object and list of assessments to apply, defaulting to
`all_assessments()`, which returns a list of all `assess_*` functions in the
`riskmetric` namespace.

```{r, eval = FALSE}
pkg_assess(riskmetric_pkg_ref)
```

```{r, echo = FALSE}
pkg_assess(pkg_ref("riskmetric"))
```

Since that is a lot to take in, `pkg_assess` also operates on `tibble`s, 
returning a cleaner output that might be easier to sort through when assessing
a package.

```{r, eval = FALSE}
pkg_assess(as_tibble(riskmetric_pkg_ref))
```

```{r, echo = FALSE}
pkg_assess(as_tibble(pkg_ref("riskmetric")))
```


## Scoring package metrics

After a metric has been collected, we "score" the metric to convert it to a 
quantified representation of risk. 

There is a single scoring function, `metric_score`, that dispatches based on the
class of the metric that is passed to it to interpret the atomic metric result.

```{r}
metric_score(riskmetric_export_help_metric)
```

For convenience, `pkg_score` is provided as a convenience to operate on
`pkg_ref` objects directly. It can also operate on the `tibble` produced by
`pkg_assess` applied to a `pkg_ref` `tibble`, providing a new `tibble` with
scored metrics.

```{r, warning  = FALSE}
pkg_score(pkg_assess(as_tibble(pkg_ref("riskmetric"))))
```

> Note that `pkg_assess` and `pkg_score` accepts an `error_handler` argument
which determines how errors are escalated for communication. We've chosen to
default to being cautious, displaying warnings liberally to ensure thorough
documentation of the risk assessment process. If these warnings are bothersome,
there are alternative reporting schemes in the `assessment_error_*` and
`score_error_*` families of functions.

# Cohort assessments

Packages are often part of a larger cohort, so we've made sure to accommodate
assessments of mulitple packages simultaneously. 

## Creating a `tibble` from `pkg_ref`s

We start by calling our `pkg_ref` constructor function with a list or vector.
Doing so will return a list of `pkg_ref` objects. With this list, we can use
`tibble::as_tibble` to convert the `pkg_ref` list into a `tibble`, automatically
populating some useful index columns like `package` and `version`. To clean
things up further we can use the `magrittr` pipe (`%>%`) to chain these commands
together.

```{r}
package_tbl <- pkg_ref(c("riskmetric", "utils", "tools")) %>%
  as_tibble()
```

## The `riskmetric` workflow on multiple packages

`pkg_assess` and `pkg_score` can operate on `tibble`s, making it easy to
simultaneously test an entire cohort of packages at once.

```{r, warning = FALSE}
package_tbl %>%
  pkg_assess() %>%
  pkg_score()
```

Notice that a summary column, `pkg_score`, is included in addition to our metric
scores. This value is  a shorthand for aggregating a weighted average of risk
scores across `tibble` columns using `summarize_scores`.

```{r, warning = FALSE}
package_tbl %>%
  pkg_assess() %>%
  pkg_score() %>%
  summarize_scores()
```

# How you can help...

As you can see, the package is currently quite bare-bones and nobody would
reasonably choose packages based solely on the existence of a NEWS file. 

Our priority so far has been to set up an extensible framework as the foundation
for a community effort, and that's where you come in! There are a few things you
can do to get started.

1. [Propose a new metric on the `riskmetric` GitHub](https://github.com/pharmaR/riskmetric/issues/new?labels=Metric%20Proposal)
1. [Take part in the discussion](https://github.com/pharmaR/riskmetric/issues?q=is%3Aopen+is%3Aissue+label%3A%22Metric+Proposal%22) about which metrics are captured and how they are measured
1. Check out the `extending-riskmetric` vignette to see how to extend the
functionality with your own metrics
where we can further discuss new metric proposals
1. Help us to develop new metrics and package functionality