---
title: "Under the hood - tokenlist"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Under the hood - tokenlist}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(textrecipes)
```

**textrecipes** has been using lists of character vectors to carry around the tokens. A simple S3 vector class has been implemented with the **vctrs** package to handle that list of tokens, henceforth to be known as a `tokenlist`.

If you are only using this package for preprocessing then you most likely won't even notice that this change has happened. However if you are thinking of contributing to **textrecipes** then knowing about `tokenlist`s will be essential.

A `tokenlist` is based around a simple list of character vectors, and has 3 attributes, `lemma`, `pos` and `tokens`.

# `tokens` attribute

The `tokens` attribute is a vector of the unique tokens contained in the data list.
This attribute is calculated automatically when using `tokenlist()`. If a function is applied to the tokenlist where the resulting unique tokens can be derived then `new_tokenlist()` can be used to create a tokenlist with known `tokens` attribute.

# `lemma` and `pos` attributes

Both the `lemma` and `pos` attribute are used in the same way. They default to `NULL` but can be filled depending on which engine is being used in `step_tokenize()`. The attribute is a list of characters in the exact shape and size as the tokenlist and should have a one-to-one relationship.

If a specific element is removed from the tokenlist then the corresponding element in `lemma` and `pos` should be removed.