--- title: "SCDB: Basic principles" output: rmarkdown::html_vignette bibliography: "references.bib" vignette: > %\VignetteIndexEntry{SCDB: Basic principles} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r library} library(SCDB) ``` ```{r setup, echo = FALSE, results = "hide", eval = rlang::is_installed("RSQLite")} # Setup conn to be used for examples conn <- get_connection() # Use a wrapper for update_snapshot which uses LoggerNull to suppress all logging if (!"update_snapshot" %in% ls(envir = globalenv())) { update_snapshot <- function(...) { return(SCDB::update_snapshot(logger = LoggerNull$new(), ...)) } } # Setup example_data table in conn example_data <- dplyr::copy_to(conn, dplyr::transmute(datasets::mtcars, car = rownames(mtcars), hp), name = "example_data", overwrite = TRUE) ``` The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data. In practice, this is done by labeling each record in the data with three additional fields: * checksum: An md5 hash of information stored the record * from\_ts: the timestamp at which the record was introduced to the dataset * until\_ts: the timestamp at which the record was removed from the dataset This strategy of time versioning is often called "type 2" history [@Kimball2013]. The SCDB package provides the function `update_snapshot` to handle the insertion and deactivation of records using this strategy. It further includes several functions to improve the Quality of life for working with database data. A simple example of usage is shown below.
For this example, we use a temporary, on-disk SQLite database. Note that `get_connection()` tries to establish connection using `DBI::dbConnect()` with as few additional arguments as possible. Different drivers may require authentication which can be read from a configuration file.^[ In the context of the SCDB package, this is most notably `RPostgres::Postgres()`, which may read from a `.pgpass` file. See also the [PostgreSQL documentation](https://www.postgresql.org/docs/current/libpq-pgpass.html). ] Our example data is `datasets::mtcars` reduced to only two columns: row names converted to a column `car`, and `hp` ```{r example_data, eval = FALSE} conn <- get_connection() ``` ```{r example_data_hidden, eval = requireNamespace("RSQLite", quietly = TRUE)} example_data <- dplyr::tbl(conn, DBI::Id(table = "example_data")) example_data ``` Imagine on Day 1, in this case January 1st, 2020, our currently available data is the first three records of the `example_data`. We then store this data in a table `mtcars`: ```{r example_1, eval = requireNamespace("RSQLite", quietly = TRUE)} data <- head(example_data, 3) update_snapshot( .data = data, conn = conn, db_table = "mtcars", # the name of the DB table to store the data in timestamp = as.POSIXct("2020-01-01 11:00:00") ) ``` We can then access out data using the `get_table()` function, and include information on data validity period using `include_slice_info = TRUE`: ```{r example_1_results, eval = requireNamespace("RSQLite", quietly = TRUE)} get_table(conn, "mtcars") get_table(conn, "mtcars", include_slice_info = TRUE) ``` Note that where e.g. `dplyr::tbl()` requires a more exact specification of the table identity (`tbl(conn, DBI::Id(table = "mtcars"))`), `get_table()` will parse any character to a `DBI::Id()` object input using `id()`. The following day, the current data is now the first five rows of our example data. We then store this data in the database using `update_snapshot()`: ```{r example_2, eval = requireNamespace("RSQLite", quietly = TRUE)} # Let's say that the next day, our data set is now the first 5 of our example data data <- head(example_data, 5) update_snapshot( .data = data, conn = conn, db_table = "mtcars", # the name of the DB table to store the data in timestamp = as.POSIXct("2020-01-02 12:00:00") ) ``` We can again use the `get_table()` function to see the latest available data, including time-keeping with `include_slice_info = TRUE`: ```{r example_2_results_a, eval = requireNamespace("RSQLite", quietly = TRUE)} get_table(conn, "mtcars") get_table(conn, "mtcars", include_slice_info = TRUE) ``` Since our data is time-versioned, we can recover the data from the day before ```{r example_2_results_b, eval = requireNamespace("RSQLite", quietly = TRUE)} get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00") ``` On day 3, we imagine that we have the same 5 records, but one of them is altered ```{r example_3, eval = requireNamespace("RSQLite", quietly = TRUE)} data <- head(example_data, 5) %>% dplyr::mutate(hp = ifelse(car == "Mazda RX4", hp / 2, hp)) update_snapshot( .data = data, conn = conn, db_table = "mtcars", # the name of the DB table to store the data in timestamp = as.POSIXct("2020-01-03 10:00:00") ) ``` We can again access our data using the `get_table()` function and see that the currently available data (with the changed hp value for Mazda RX4) ```{r example_3_results_a, eval = requireNamespace("RSQLite", quietly = TRUE)} get_table(conn, "mtcars") ``` Finally, using `slice_ts = NULL`, the full history (and time-keeping information) is returned: ```{r example_3_results_b, eval = requireNamespace("RSQLite", quietly = TRUE)} get_table(conn, "mtcars", slice_ts = NULL) ``` ### References