---
title: "Cluster Management"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Cluster Management}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

`{brickster}` has 1:1 mappings with the clusters REST API, enabling full control of Databricks clusters from your R session.

## Cluster Creation

Clusters have a number of parameters and can be configured to match to needs of a given workload. `db_cluster_create()` facilitates creation of a cluster in a Databricks workspace for all cloud platforms (AWS, Azure, GCP).

Depending on the cloud you will need to change the node types and `cloud_attrs` to be one of; `aws_attributes()`, `azure_attributes()`, or `gcp_attributes()`.

Below we will create a cluster on AWS and then step through using the other supporting functions.

```{r setup}
library(brickster)

# create a small cluster on AWS with DBR 9.1 LTS
new_cluster <- db_cluster_create(
  name = "brickster-cluster",
  spark_version = "9.1.x-scala2.12",
  num_workers = 2,
  node_type_id = "m5a.xlarge",
  cloud_attrs = aws_attributes(
    ebs_volume_count = 3,
    ebs_volume_size = 100
  )
)
```

```{r, echo=FALSE, results='hide'}
temp <- get_and_start_cluster(cluster_id = new_cluster$cluster_id)
```

Refer to documentation for details on how to use other parameters not mentioned here (e.g. `spark_conf`).

Before creating a cluster you may want to check the supported values for a number of the parameters. There are functions to assist with this:

|                        Function | Purpose                                                                                                                      |
|----------------:|-------------------------------------------------------|
| `db_cluster_runtime_versions()` | List of runtime versions available for the workspace, useful for finding relevant `spark_version`                            |
|  `db_cluster_list_node_types()` | List of supported node types available in workspace/region, useful for finding relevant `node_type_id`/`driver_node_type_id` |
|       `db_cluster_list_zones()` | AWS Only, lists availability zones (AZ) clusters can occupy                                                                  |

`db_cluster_get()` will provide details for the cluster we just created, including information such as the state.

This can be useful as you may wish to wait for the cluster to be `RUNNING` , which is exactly what `get_and_start_cluster()` uses internally to wait until the cluster is running before completing.

```{r}
cluster_info <- db_cluster_get(cluster_id = new_cluster$cluster_id)
cluster_info$state
```

## Editing Clusters

You can edit Databricks clusters to change various parameters using `db_cluster_edit()`. For example, we may decide we want our cluster to autoscale between 2-8 nodes and add some tags.

```{r, results='hide'}

# we are required to input all parameters
db_cluster_edit(
  cluster_id = new_cluster$cluster_id,
  name = "brickster-cluster",
  spark_version = "9.1.x-scala2.12",
  node_type_id = "m5a.xlarge",
  autoscale = cluster_autoscale(min_workers = 2, max_workers = 8),
  cloud_attrs = aws_attributes(
    ebs_volume_count = 3,
    ebs_volume_size = 100
  ),
  custom_tags = list(
    purpose = "brickster_cluster_demo"
  )
)
```

However, if the intention was to only change the size of a given cluster the `db_cluster_resize()` function is a simpler alternative.

I can either adjust the number of workers or change the autoscale range. If the range or workers is adjusted via `autoscale` the number of workers active on the cluster will be increased/decreased if they are outside the bounds.

```{r, results='hide'}
# adjust number autoscale range to be between 4-6 workers
db_cluster_resize(
  cluster_id = new_cluster$cluster_id,
  autoscale = cluster_autoscale(min_workers = 4, max_workers = 6)
)
```

It's important to note that if specifying `num_workers` instead of `autoscale` on a cluster than has an existing autoscale range it will become a fixed number of workers from that point onward.

Databricks clusters can be "pinned" which stops them from being removed after 30 days of termination. `db_cluster_pin()` and `db_cluster_unpin()` are the functions used for changing if a cluster is "pinned" or not.

```{r, results='hide'}
# pin the cluster
db_cluster_pin(cluster_id = new_cluster$cluster_id)

# unpin the cluster
# db_cluster_unpin(cluster_id = new_cluster$cluster_id)
```

## Cluster State

There are a few functions that can be used to to manage the state of an existing cluster

|                                         Function | Purpose                                                                                      |
|-----------------:|------------------------------------------------------|
|                             `db_cluster_start()` | Start a cluster that is inactive                                                             |
|                           `db_cluster_restart()` | Restart a cluster, cluster must already be running                                           |
|  `db_cluster_delete()` /`db_cluster_terminate()` | Terminate an active cluster, does not remove the cluster configuration from Databricks       |
|                       `db_cluster_perm_delete()` | Stops (if active) and permanently deletes a cluster, it will not longer appear in Databricks |

## Cluster Libraries

Databricks clusters can have libraries installed from a number of sources using `db_libs_install()` and the associated `libs_*()` functions:

|      Function | Library Source      |
|--------------:|---------------------|
|  `lib_cran()` | CRAN                |
|  `lib_pypi()` | PyPi                |
|   `lib_egg()` | Python egg (file)   |
|   `lib_whl()` | Python wheel (file) |
| `lib_maven()` | Maven               |
|   `lib_jar()` | JAR (file)          |

```{r, results='hide'}
# installing a package from CRAN on cluster
db_libs_install(
  cluster_id = new_cluster$cluster_id,
  libraries = libraries(
    lib_cran(package = "palmerpenguins"),
    lib_cran(package = "dplyr")
  )
)
```

For convenience the `wait_for_lib_installs()` function will block until all the libraries for the specified cluster have finished installing.

```{r, results='hide'}
wait_for_lib_installs(cluster_id = new_cluster$cluster_id)
```

Installation of libraries is asynchronous and will complete in the background. `db_libs_cluster_status()` is used to check on the installation status of libraries for a given cluster, `db_libs_all_cluster_statuses()` is used for getting the status of all libraries across all clusters in the workspace.

```{r}
db_libs_cluster_status(cluster_id = new_cluster$cluster_id)
```

Libraries can be uninstalled using `db_libs_uninstall()`.

```{r, results='hide'}
db_libs_uninstall(
  cluster_id = new_cluster$cluster_id,
  libraries = libraries(
    lib_cran(package = "palmerpenguins")
  )
)
```

Using `db_libs_cluster_status()` shows that the library will be uninstalled upon restart (e.g. `db_cluster_restart()`).

```{r}
db_libs_cluster_status(cluster_id = new_cluster$cluster_id)
```

## Events

A list of events regarding the clusters activity can be fetched via `db_cluster_events()`. There are many [event types](https://docs.databricks.com/api/workspace/clusters/events#events) that can occur, and by default the 50 most recent events are returned.

```{r}
events <- db_cluster_events(cluster_id = new_cluster$cluster_id)
head(events, 1)
```

```{r, echo=FALSE, results='hide'}
db_cluster_unpin(cluster_id = new_cluster$cluster_id)
db_cluster_perm_delete(cluster_id = new_cluster$cluster_id)
```