---
title: "origin"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{origin}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In contrast to other programming languages, R has no widely established and 
undisputed style guide (e.g. PEP 8 for Python). As a data scientist, I helped 
to establish a company wide R style guide. While it mainly relies on the 
[tidyverse style guide](https://style.tidyverse.org/), we generally decided
to be more explicit in our coding practice. This includes that we always refer
to functions from non-native R packages with the double colon operator `::`. While it
is relatively easy to establish such a convention in new projects, it is 
challenging to adapt ongoing projects and legacy code. `origin` allows for much
faster conversions of both legacy code as well as currently written code.


## Purpose of `origin`

The main purpose is to add `pkg::` to an R function call, i.e. it changes
code like this:

<img src="vignette_quickview.gif" width="650px" />

## Usage of `origin`

In general, you can either originize some selected text (more on that later 
in Addins), a whole script, or a all scripts in a specific folder, e.g. your
project folder. There is a specifically designed function for each purpose yet
they all share the same options. Therefore, only `originize_file()` is 
extensively presented as an example with its default options.

### Code Usage
```{r, eval=FALSE}
originize_file(file = "testscript.R",
               pkgs = .packages(), 
               overwrite = TRUE,
               ask_before_applying_changes = TRUE,
               ignore_comments = TRUE,
               check_conflicts = TRUE,
               add_base_packages = FALSE,
               check_base_conflicts = TRUE, 
               check_local_conflicts = TRUE,
               excluded_functions = list(dplyr = c("%>%", "across"),
                                         data.table = c(":=", "%like%"),
                                         # exclude from all packages:
                                         c("first", "last")), 
               verbose = TRUE, 
               use_markers = TRUE)
```


### Common Arguments
* `pkgs`: which packages to check for functions used in the code (see **Considered Packages**).
The default are all packages attached via `library` or `require`
* `overwrite`: actually insert `pkg::` into the code. Otherwise,
logging shows only what *would* happen. Note that `ask_before_applying_changes`
still allows to keep control over your code before `origin` changes anything.
* `ask_before_applying_changes`: whether changes should be applied
immediately or the user must approve them first.
* `check_conflicts`: should `origin` check for potential 
namespace conflicts, i.e. a used function is defined in more than one considered
package. User input is required to solve the issue. 
Strongly encouraged to be set to `TRUE`.
* `add_base_packages`: should base packages also be added, e.g. `base::sum()`.
* `check_base_conflicts`: Should origin also check for conflicts
with base R functions.
* `check_local_conflicts`: Should origin also check for conflicts
with locally defined functions anywhere in your project? Note that it does not
check the environment but solely parses files and scans them for function definitions.
* `excluded_functions`: a (named) list of functions to exclude from checking.
* `verbose`: some sort of logging is performed, either in the 
console or via the markers tab in RStudio.
* `use_markers`: whether to use the Markers tab in RStudio.
* `filetypes`: Which filetypes to consider. 
Currently, origin supports .R, .Rmd, and .Qmd (Quarto) files.



### Addins

Besides using regular R functions to originize files, there are also useful
addins delivered with `origin`. These addins are designed to be used on-the-fly
while coding. You can either originize selected text, the currently opened file,
or all scripts in the currently opened project. However, to have as much control
as when using functions, each function argument corresponds to an option that 
can be set and used inside the addins, e.g.

```{r, eval=FALSE}
options(origin.pkgs = c("dplyr", "data.table"),
        origin.overwrite = TRUE)
```

Actually, most function arguments of `origin` first check whether an option has
been declared and uses the assigned value as its default. This allows for equal
outcomes regardless whether you use the addin or a function sequentially.


### Safety Measures
Since `origin` changes files on disk, it is very important that the user has 
full control over what happens and user input is required before critical steps.

#### Logging
Most importantly, the user must be aware of what the originized file(s) 
would look like. For this, all changes **and** potential missed changes
are presented, either in the Markers tab (recommended) or in the console.

<img src="markers_logging.png" width="650px" />
<img src="console_logging.png" width="650px" />

* <span style="color: #00c7cc">insertion</span>: `pkg::` is inserted prior to a function
* <span style="color: #ff0000">missing</span>: an object that has the same name as a function 
but not undoubtedly used as a function. In R it is usually no problem
to have variables that name like functions (data or df are popular examples).
While it is always clear when a function is directly used as one, functions
can also be arguments of other functions, most famously in functional programming 
like the *apply family or purrr. `origin` highlights such cases in 
the logging output.
* <span style="color: #ffa500">infix</span>: functions like `%>%` are exported by packages but cannot be called
with the `pkg::fun()` convention. Such functions are highlighted by default
to point the user that these stem from a package. When using 
dplyr-style code, consider to exclude the pipe-operator via 
`exclude_functions`.


#### Same Function Name in Multiple Packages
Due to the variety of R packages, function names must not be unique across all
packages out there. By default, R masks priorly imported functions by those 
imported afterwards. `origin` mimics this rule by applying a higher priority
to those packages that are listed first. In case there is a conflict regarding
a **used** function, These functions are listed along with the packages from 
which they stem.

```{asis, eval=FALSE}
Used functions in mutliple Packages!  
  
  filter: dplyr, stats
first: data.table, dplyr 

Order in which relevant packges are evaluated:
  data.table >> dplyr >> stats 

Do you want to proceed?
1: YES
2: NO
```


#### Custom Functions Mask Exported Functions
As packages mask each others functions, the same applies to locally
defined custom functions. In case you defined your own `last` function
in your project, `origin` should **not** add `dplyr::` to it. Therefore,
your project is searched for function definitions and local functions 
have higher priority than those exported by packages. Note that, 
depending on the project size, this process can take quite some time.
In this case, set the argument/option `path_to_local_functions` to a 
subdirectory or `check_local_conflicts` to `FALSE` to skip this feature.

```{asis, eval=FALSE}
Locally defined and used functions mask exported functions from packages

last: dplyr

Local functions have higher priority. In case you want to use an
exported version of a function listed above set pkg::fun manually

Got it?
1: YES
2: NO
3: Show files
```




#### Many Files Selected
When `originizing` a complete folder or project, many R scripts might be checked.
In case the user is unaware that there are many files in the selected folder,
resulting in a long run time of `origin`,
a warning is triggered and user input is required.

```{asis, eval=FALSE}
You are about to originize 99 files.

Proceed?
1: YES
2: NO
3: Show files
```



#### Final Check
Before the proposed changes are applied eventually, a final user input 
is required.

```{asis, eval=FALSE}
Happy with the result? 😀

1: YES
2: NO
```


## Discussion
Whether or not to add `pkg::` to each (imported) function is a [controversial](https://stackoverflow.com/q/4372145/8107362)
[issue](https://stackoverflow.com/q/23232791/8107362) in the R community. While the tidyverse style guide does not mention explicit namespacing, [R Packages](https://r-pkgs.org) and the [Google R style guide](https://google.github.io/styleguide/Rguide.html#qualifying-namespaces) are in favor of it.

Pros

+ very explicit
+ completely avoid namespace conflicts
+ no need to attach the complete namespace of a package
+ keep track of which function belongs to which package

Cons 

- (minimal) performance issue
- more writing required
- longer code
- infix functions like `%>%` cannot be called via `magrittr::%>%`
and workarounds are still required here. Either use 
  ```
  library(magrittr, include.only = "%>%")
  `%>%` <- magrittr::`%>%`
  ```
- calling `library()` on top of a script clearly indicates which packages are
  needed. A not yet installed package throws an error right away, not until
  a function cannot be found later in the script. However, one can use 
  the `include_only` argument and set it to `NULL`. No functions are attached
  into the search list then.
  ```
  library(magrittr, include_only = NULL)
  ``` 


## Check Package Usage since origin 1.0.0

As a new feature origin origin exports the function `check_pkg_usage`. 
Given you take over a project or just built a huge barrage of `library` calls
over time. Which of those are actually still needed.
Just run all those `library(...)` calls and then call `check_pkg_usage()`

### Interpreting the Output of check_pkg_usage

```
== Package Usage Report ================================================
-- Used Packages: 2 ----------------------------------------------------
v data.table
v testthat  

-- Unused Packages: 1 --------------------------------------------------
i dplyr

-- Possible Namespace Conflicts:  1 -----------------------------------
x last     	data.table >> dplyr

-- Specifically (`pkg::fun()`) further used Packages: 2 ----------------
i purrr

-- Functions with unknown origin: 1 ------------------------------------
x map
```
 The output shows
 - we had attached 3 packages: {data.table}, {testthat}, and {dplyr}
 - functions from {data.table} and {testthat} are used
 - {dplyr} functions are not used
 - a namespace conflict for the function `last` between {data.table} and {dplyr}
 - additionally, we use purrr:: at some occasions
 - we use the `map()` function that is not exported from {data.table}, 
 {testthat}, or {dplyr}. Note that `map` is exported from {purrr} that is 
 used elsewhere but here our code would fail since {purrr} is not attached 
 and `map cannot be found. 
 
 A markers Tab shows all unknown functions and unknown packages that are 
 used explicitly
 
 
### Interpreting the Result of check_pkg_usage
 Having a closer look into `result`
 
 ```
 as.data.frame(result)
#>       pkg         fun n_calls namespaced conflict conflict_pkgs
#> 1    base        %in%      53      FALSE    FALSE            NA
#> 2    base   .packages       8      FALSE    FALSE            NA
#> 3    base      Filter       3      FALSE    FALSE            NA
#> 4    base         Map       1      FALSE    FALSE            NA
#> 5    base      Reduce       5      FALSE    FALSE            NA
#> ...
```

It first shows a lot of base functions. That is, even though their are not explicitly attached, base r packages are always attached. The print output does not show them but if you want to deep dive into the functions that are used in the project they are available

```
#>             pkg              fun n_calls namespaced conflict conflict_pkgs
#> 110  data.table           %like%      10      FALSE    FALSE            NA
#> 111  data.table               :=       1      FALSE    FALSE            NA
#> 112  data.table               CJ       1      FALSE    FALSE            NA
#> 113  data.table    as.data.table       1      FALSE    FALSE            NA
#> 114  data.table    as.data.table       3       TRUE    FALSE            NA
#> 115  data.table             last       2       TRUE    FALSE            NA
#> 116  data.table             last       1      FALSE     TRUE         dplyr
```
 Going further, there are a bunch of {data.table} functions that have been used.
 Some are listed twice because they were sometimes called via `data.table::`,
 sometimes not. Furthermore, `last` is marked with `conflict = TRUE`.
 This is because {dplyr} does export a `last` function, as well. However, since 
 {data.table} has the higher priority than {dplyr} in this project, {origin} 
 considers it as an {data.table} function. Note that if a function is 
 namespaced via `::`, no conflict is given.
 
 
 Finally, at the end of the output:
 
 ```
#>         pkg  fun n_calls namespaced conflict conflict_pkgs
#> 219    <NA>  map       1      FALSE       NA            NA
#> 220  dplyr  <NA>       0         NA       NA            NA
```
Here we see the `map` function that would not be assigned to one of the given packages
and the {dplyr} package that has not been used.

### Final Remarks
Locally defined functions are also detected via parsing. These also do have a
higher priority than exported function from other packages.