---
title: "explore"
author: "Roland Krasser"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{explore}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! 

explore package on Github: [https://github.com/rolkra/explore](https://github.com/rolkra/explore)

As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.

```{r message=FALSE, warning=FALSE}
library(dplyr)
library(explore)
```

### Interactive data exploration

Explore your data set (in this case the iris data set) in one line of code:

```{R eval=FALSE, echo=TRUE}
explore(iris)
```

A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few "mouse clicks".
 
![](../man/figures/explore-shiny-iris-target-species.png){width=600px}

You can choose each variable containing as a target, that is binary (0/1, FALSE/TRUE or "no"/"yes"), categorical or numeric.


### Report variables

Create a rich HTML report of all variables with one line of code:

```{R eval=FALSE, echo=TRUE}
# report of all variables
iris %>% report(output_file = "report.html", output_dir = tempdir())
```

![](../man/figures/report-attributes.png){width=600px}

Or you can simply add a target and create the report. In this case we use a binary target, but a categorical or numerical target would work as well.

```{R eval=FALSE, echo=TRUE}
# report of all variables and their relationship with a binary target
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% 
  report(output_file = "report.html", 
         output_dir = tempdir(),
         target = is_versicolor)

```

If you use a binary target, the parameter ***split = FALSE*** (or `targetpct = TRUE`) will give you a different view on the data.

![](../man/figures/report-target.png){width=600px}

### Grow a decision tree

Grow a decision tree with one line of code:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4}
iris %>% explain_tree(target = Species)
```

You can grow a decision tree with a binary target too.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4}
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% select(-Species) %>% explain_tree(target = is_versicolor)
```

Or using a numerical target. The syntax stays the same.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=4}
iris %>% explain_tree(target = Sepal.Length)
```

You can control the growth of the tree using the parameters `maxdepth`, `minsplit` and `cp`.

To create other types of models use `explain_forest()`, `explain_xgboost()` and `explain_logreg()`.

### Explore dataset

Explore your table with one line of code to see which type of variables it contains.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore_tbl()
```

You can also use `describe_tbl()` if you just need the main facts without visualization.

```{r message=FALSE, warning=FALSE}
iris %>% describe_tbl()
```

### Explore variables

Explore a variable with one line of code. You don't have to care if a variable is numerical or categorical.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Species)
```

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length)
```

### Explore variables with a target

Explore a variable and its relationship with a binary target with one line of code. You don't have to care if a variable is numerical or categorical.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, target = is_versicolor)
```

Using split = FALSE will change the plot to %target:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, target = is_versicolor, split = FALSE)
```

The target can have more than two levels:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, target = Species)
```

Or the target can even be numeric:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, target = Petal.Length)
```

### Explore multiple variables

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5}
iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  explore_all()
```

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5}
iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor)
```

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5}
iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor, split = FALSE)
```

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5}
iris %>% 
  select(Sepal.Length, Sepal.Width, Species) %>% 
  explore_all(target = Species)
```

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=2.5}
iris %>% 
  select(Sepal.Length, Petal.Width, Petal.Length) %>% 
  explore_all(target = Petal.Length)
```

```{r message=FALSE, warning=FALSE}
data(iris)
```

To use a high number of variables with `explore_all()` in a RMarkdown-File, it is necessary to set a meaningful fig.width and fig.height in the junk. The function `total_fig_height()` helps to automatically set fig.height: ```fig.height=total_fig_height(iris)```

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=total_fig_height(iris, size=2.5)}
iris %>% 
  explore_all()
```

If you use a target: ```fig.height=total_fig_height(iris, var_name_target = "Species")```

```{r message=FALSE, warning=FALSE, fig.width=8, fig.height=total_fig_height(iris, var_name_target = "Species", size=2.5)}
iris %>% explore_all(target = Species)
```

You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot)

### Explore correlation between two variables

Explore correlation between two variables with one line of code:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, Petal.Length)
```

You can add a target too:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, Petal.Length, target = Species)
```

### Explore options

If you use explore to explore a variable and want to set lower and upper limits for values, you can use the `min_val` and `max_val` parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, min_val = 4.5, max_val = 7)
```

`explore` uses auto-scale by default. To deactivate it use the parameter `auto_scale = FALSE`

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% explore(Sepal.Length, auto_scale = FALSE)
```

### Describing data

Describe your data in one line of code:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% describe()
```

The result is a data-frame, where each row is a variable of your data. You can use `filter` from dplyr for quick checks:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# show all variables that contain less than 5 unique values
iris %>% describe() %>% filter(unique < 5)
```

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# show all variables contain NA values
iris %>% describe() %>% filter(na > 0)
```

You can use `describe` for describing variables too. You don't need to care if a variale is numerical or categorical. The output is a text.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# describe a numerical variable
iris %>% describe(Species)
```

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# describe a categorical variable
iris %>% describe(Sepal.Length)
```

### Use data

Use one of the prepared datasets to explore:

* `use_data_beer()`
* `use_data_diamonds()`
* `use_data_iris()`
* `use_data_mpg()`
* `use_data_mtcars()`
* `use_data_penguins()`
* `use_data_starwars()`
* `use_data_titanic()`
* `use_data_wordle()`

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
use_data_beer() %>% describe()
```

### Create data

Use one of the prepared datasets to explore:

* `create_data_abtest()`
* `create_data_app()`
* `create_data_buy()`
* `create_data_churn()`
* `create_data_esoteric()`
* `create_data_newsletter()`
* `create_data_person()`
* `create_data_unfair()`
* `create_data_random()`

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# create dataset and describe it
data <- create_data_app(obs = 100)
describe(data)
```

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# create dataset and describe it
data <- create_data_random(obs = 100, vars = 5)
describe(data)
```

You can build you own random dataset by using ```create_data_empty()``` and  ```add_var_random_*()``` functions:

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
# create dataset and describe it
data <- create_data_empty(obs = 1000) %>% 
  add_var_random_01("target") %>% 
  add_var_random_dbl("age", min_val = 18, max_val = 80) %>% 
  add_var_random_cat("gender", 
                     cat = c("male", "female", "other"), 
                     prob = c(0.4, 0.4, 0.2)) %>% 
  add_var_random_starsign() %>%
  add_var_random_moon()
describe(data)
```

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
data %>% select(random_starsign, random_moon) %>% explore_all()
```

### Basic data cleaning

To clean a variable you can use `clean_var`. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
iris %>% 
  clean_var(Sepal.Length, 
            min_val = 4.5, 
            max_val = 7.0, 
            na = 5.8, 
            name = "sepal_length") %>% 
  describe()
```

To drop variables or observations you can use ```drop_var_*()``` and ```drop_obs_*()``` functions.

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
use_data_penguins() %>% 
  describe_tbl()
```

```{r message=FALSE, warning=FALSE, fig.width=6, fig.height=3}
use_data_penguins() %>%
  drop_obs_with_na() %>%
  describe_tbl()
```

### Create notebook

Create an RMarkdown template to explore your own data. Set output_dir (existing file may be overwritten)

```{r eval=FALSE, message=FALSE, warning=FALSE}
create_notebook_explore(
  output_dir = tempdir(),
  output_file = "notebook-explore.Rmd")
```

![](../man/figures/notebook-explore.png){width=600px}