---
title: "How to prepare data for tna"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to prepare data for tna}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 6,
  fig.height = 4,
  out.width = "100%",
  dpi = 30,
  dev = "svg",
  fig.ext="svg",
  message = FALSE,
  fig.show = TRUE,
  results = "hide",  
  comment = "#>"
)
suppressPackageStartupMessages({
  library("tna")
  library("tibble")
  library("dplyr")
})
```

The `tna` package accepts sequence data (a `stslist` object created by the `TraMineR` package), a wide `data.frame` where each row represents a sequence and each column represents a timepoint, or a transition matrix. When our data does not follow any of those formats, we can use the `prepare_data` function from `tna` to get it in the right shape. 

Let's start by creating an example dataframe containing event logs where each row represents an action performed by a `user` at a specific `timestamp.` The `achievement` column indicates whether the user was associated with high or low achievement.


```{r}
df <- tribble(
  ~user, ~timestamp, ~event, ~achievement, ~order,
  2, "2025-02-27 18:01:32", "Plan",           "High", 1,
  2, "2025-02-27 18:03:32", "Goals",          "High", 2,
  1, "2025-02-27 18:08:32", "Goals",          "High", 1,
  3, "2025-02-27 18:15:32", "Plan",           "High", 1,
  5, "2025-02-27 18:16:32", "Help",           "Low",  1,
  3, "2025-02-27 18:19:32", "Goals",          "High", 2,
  5, "2025-02-27 18:19:32", "Plan",           "Low",  2,
  2, "2025-02-27 18:20:32", "Environment",    "High", 3,
  1, "2025-02-27 18:25:32", "Task",           "High", 2,
  4, "2025-02-27 18:25:32", "Help",           "Low",  1,
  5, "2025-02-27 18:26:32", "Task",           "Low",  3,
  3, "2025-02-27 18:32:32", "Metacognition",  "High", 3,
  4, "2025-02-27 18:33:32", "Goals",          "Low",  2,
  1, "2025-02-27 18:36:32", "Environment",    "High", 3,
  1, "2025-02-27 18:44:32", "Task",           "High", 4,
  1, "2025-02-27 18:45:32", "Task",           "High", 5,
  5, "2025-02-27 18:46:32", "Help",           "Low",  4,
  1, "2025-02-27 19:01:32", "Plan",           "High", 6,
  2, "2025-02-27 19:06:32", "Environment",    "High", 4,
  4, "2025-02-27 19:06:32", "Plan",           "Low",  3,
  1, "2025-02-27 19:13:32", "Metacognition",  "High", 7,
  3, "2025-02-27 19:13:32", "Metacognition",  "High", 4,
  4, "2025-02-27 19:15:32", "Goals",          "Low",  4,
  4, "2025-02-27 19:20:32", "Metacognition",  "Low",  5,
  4, "2025-02-27 19:20:32", "Environment",    "Low",  6,
  5, "2025-02-27 19:23:32", "Metacognition",  "Low",  5,
  3, "2025-02-27 19:25:32", "Help",           "High", 5,
  2, "2025-02-27 19:27:32", "Metacognition",  "High", 5,
  2, "2025-02-27 19:33:32", "Environment",    "High", 6,
  4, "2025-02-27 19:46:32", "Environment",    "Low",  7,
  5, "2025-02-27 19:49:32", "Plan",           "Low",  6,
  2, "2025-02-27 19:55:32", "Goals",          "High", 7
)

```

When we are interested in the whole dataset as a single sequence, and the data is already ordered chronologically, we just need to specify the `action` column, which in this case is called `event`. We can pass the result of calling `prepare_data` directly to the `tna` function to create a `tna` model.

```{r}
by_classroom <- prepare_data(df, action = "event")
tna_by_classroom <- tna(by_classroom)
plot(tna_by_classroom)
```
When we want to create a sequence per user, in addition to `action`, we need to specify the `actor` column, which in this case is `user`. If the data is already ordered, we do not need to provide any additional arguments. If it is not ordered, we can provide a column to order the data by, in this case `order`.

```{r}
by_user <- prepare_data(df, actor = "user", action = "event", order = "order")
tna_by_user <- tna(by_user)
plot(tna_by_user)
```

If rather than the order we only have the timestamps, we can provide it as `time` column. By default, events happening less than 15 minutes apart will be grouped in the same sequence, while events that happen after a longer time, will mark the start of a new sequence (i.e., session). If both `time` and `order` are provided, the data will be first ordered by `time`, and in case of a tie, by `order`. 

```{r}
by_session <- prepare_data(df, actor = "user", time = "timestamp", action = "event")
tna_by_session <- tna(by_session)
plot(tna_by_session)
```

If we want to customize the time gap that marks the start of a new sequence, we can do so by customizing the `time_threshold` argument (in minutes). 
```{r}
by_session_custom <- prepare_data(df, actor = "user", time = "timestamp", action = "event", 
                                  time_threshold = 10*60) # 10 minutes
tna_by_session_custom <- tna(by_session_custom)
plot(tna_by_session_custom)
```
Another advantage of using `prepare_data` prior to constructing the `tna` model is that we get to keep other variables of the data and use them in our analysis. For instance, we can use `group_tna` to create a `tna` model by achievement group just by passing the result of `prepare_data` as a first argument, and indicating the name of the column in the data that we want to group by.
```{r,fig.width=8,fig.height=3,fig.dpi=900,echo=2:3}
layout(t(1:2))
gtna <- group_tna(by_user, group = "achievement")
plot(gtna)
```