---
title: "Data Requirements"
output: 
  rmarkdown::html_vignette:
    toc: true 
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Data Requirements}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
library(httptest2)
.mockPaths("../tests/mocks")
start_vignette(dir = "../tests/mocks")

original_options <- options("NIXTLA_API_KEY"="dummy_api_key", digits=7)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.width = 7, 
  fig.height = 4
)
```

```{r}
library(nixtlar)
```

This vignette explains the data requirements for using any of the core functions of `nixtlar`:  

```{r, eval=FALSE}
# Core functions of `nixtlar` 
- nixtlar::nixtla_client_forecast()
- nixtlar::nixtla_client_historic()
- nixtlar::nixtla_client_detect_anomalies()
- nixtlar::nixtla_client_cross_validation()
- nixtlar::nixtla_client_plot()
```

## 1. Input Requirements

`nixtlar` now supports the following data structures: data frames, tibbles, and tsibbles. The output format will always be a data frame. 

Regardless of your data structure, the following two columns must always be included when using any core functions of `nixtlar`:

- **Date Column**: This column must contain timestamps formatted as `YYYY-MM-DD` or `YYYY-MM-DD hh:mm:ss`, either as characters or date-time objects. For date-time objects, we recommend using the `as.POSIX*` functions from base R, although `as.Date` is also supported. The default name for this column is `ds`. If your dataset uses a different name, please specify it by setting the parameter `time_col="your_time_column_name"`.

- **Target Column**: This column should contain the numeric target variable for forecasting. The default name for this column is `y`. If your dataset uses a different name, specify it by setting the parameter `target_col="your_target_column_name"`.

## 2. Multiple Series

If you are working with multiple series, you must include a column with a unique identifier for each series. This column can contain characters or integers, and its default name is `unique_id`. If your dataset uses a different name for the identifier column, please specify it by setting the parameter `id_col="your_id_column_name"`. If your dataset contains only one series and does not need an identifier, set `id_col` to `NULL`.

Please be aware that in earlier versions of `nixtlar`, the default name for `id_col` was `NULL`, but it is now `unique_id`. 

```{r}
# sample valid input 
df <- nixtlar::electricity
head(df)
str(df)
```

## 3. Exogenous Variables

When using exogenous variables, `nixtlar` distinguishes between historical and future exogenous variables:

- **Historical Exogenous Variables**: These should be included in the input data immediately following the `id_col`, `ds`, and `y` columns. If your dataset contains additional columns that are not exogenous variables, you must remove them before using any core functions of `nixtlar`.

- **Future Exogenous Variables**: These correspond to the `X_df` parameter and should cover the entire forecast horizon. This dataset must include columns with the appropriate timestamps and, if applicable, unique identifiers, formatted as described in the previous sections.

```{r}
# sample valid input with exogenous variables 
df <- nixtlar::electricity_exo_vars
head(df)

future_exo_vars <- nixtlar::electricity_future_exo_vars
head(future_exo_vars)
```

To learn more about how to use exogenous variables, please refer to the [Exogenous variables vignette](https://nixtla.github.io/nixtlar/articles/exogenous-variables.html). 

## 4. Missing values 

When using `TimeGPT` via `nixtlar`, ensure the following:

1. **No Missing Values in the Target Column**: The target column must not contain any missing values (`NA`).

2. **Continuous Date Sequence**: The dates must be continuous, without any gaps, from the start date to the end date, matching the frequency of the data.

Currently, **nixtlar** does not provide any functionality to fill missing values or dates. To learn more about this, please refer to the vignette on [Special Topics](https://nixtla.github.io/nixtlar/articles/special-topics.html). 

## 5. Minimum data requirements 

The minimum size **per series** to obtain results from `nixtlar::nixtla_client_forecast` is one, regardless of the frequency of the data. Keep in mind, however, that this will produce results with limited accuracy. 

For certain scenarios, more than one observation may be necessary:

- When using the parameters `level`, `quantiles`, or `finetune_steps`. 
- When incorporating exogenous variables. 
- When including historical forecasts by setting `add_history=TRUE`. 

The minimum data requirement varies with the frequency of the data, detailed in the official [TimeGPT documentation](https://docs.nixtla.io/docs/getting-started-data_requirements). 

When using `nixtlar::nixtla_client_cross_validation`, you also need to consider the forecast horizon (`h`), the number of windows (`n_windows`) and the step size (`step_size`). The formula for the minimum data points required per series is:

\begin{equation}
\text{Min per series} = \text{Min per frequency}+h+\text{step_size}*(\text{n_windows}-1)
\end{equation}

Here, $\text{Min per frequency}$ refers to the values specified in the table from the official documentation.

```{r, include=FALSE}
options(original_options)
end_vignette()
```