---
title: "Documenting Datasets"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Documenting Datasets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message=FALSE, warning=FALSE}
library(qtkit)
library(fs)
library(tibble)
library(dplyr)
library(glue)
library(readr)
```

# Introduction

Proper dataset documentation is crucial for reproducible research and effective data sharing. The {qtkit} package provides two main functions to help standardize and automate the documentation process:

- `create_data_origin()`: Creates standardized metadata about data(set) sources
- `create_data_dictionary()`: Generates the scaffolding for a detailed variable-level documentation or can use AI to generate descriptions to be reviewed and updated as necessary

# Creating Dataset Origin Documentation

## Basic Usage

Let's start with documenting the built-in `mtcars` dataset:

```{r}
# Create a temporary file for our documentation
origin_file <- file_temp(ext = "csv")

# Create the origin documentation template
origin_doc <- create_data_origin(
  file_path = origin_file,
  return = TRUE
)

# View the template
origin_doc |>
  glimpse()
```

The template provides fields for essential metadata. You can either open the CSV file in a spreadsheet editor or fill it out programmatically, as shown below.

Here's how you might fill it out for `mtcars`:

```{r}
origin_doc |>
  mutate(description = c(
    "Motor Trend Car Road Tests",
    "Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391â€“411.",
    "US automobile market, passenger vehicles",
    "1973-74",
    "Built-in R dataset (.rda)",
    "Single data frame with 32 observations of 11 variables",
    "Public Domain",
    "Citation: Henderson and Velleman (1981)"
  )) |>
  write_csv(origin_file)
```

## Customizing Origin Documentation

You can force overwrite existing documentation:

```{r}
create_data_origin(
  file_path = origin_file,
  force = TRUE
)
```

# Creating Data Dictionaries

## Basic Dictionary Creation

Create a basic data dictionary without AI assistance:

```{r}
# Create a temporary file for our dictionary
dict_file <- file_temp(ext = "csv")

# Generate dictionary for iris dataset
iris_dict <- create_data_dictionary(
  data = iris,
  file_path = dict_file
)

# View the results
iris_dict |>
  glimpse()
```

## AI-Enhanced Data Dictionaries

If you have an OpenAI API key, you can generate more detailed descriptions:

```{r, eval=FALSE}
# Not run - requires API key
Sys.setenv(OPENAI_API_KEY = "your-api-key")

iris_dict_ai <- create_data_dictionary(
  data = iris,
  file_path = dict_file,
  model = "gpt-4",
  sample_n = 5
)
```

Example output might look like:

```{r echo=FALSE}
# Simulated AI output
tibble(
  variable = c("Sepal.Length", "Sepal.Width"),
  name = c("Sepal Length", "Sepal Width"),
  type = c("numeric", "numeric"),
  description = c(
    "Length of the sepal in centimeters",
    "Width of the sepal in centimeters"
  )
)
```

## Working with Larger Datasets

For larger datasets, you can use sampling and grouping:

```{r, eval=FALSE}
diamonds_dict <- diamonds |>
  create_data_dictionary(
    file_path = "diamonds_dict.csv",
    model = "gpt-4",
    sample_n = 3,
    grouping = "cut" # Sample across different cut categories
  )
```

## Best Practices

1. Create documentation when first obtaining/creating a dataset
2. Update documentation when:
   - Adding new variables
   - Modifying data structure
   - Changing data sources
3. Store documentation alongside data in version control
4. Include documentation paths in your project README

# Conclusion

The {qtkit} package provides flexible tools for standardizing dataset documentation. By combining `create_data_origin()` and `create_data_dictionary()`, you can create comprehensive documentation that enhances reproducibility and data sharing.

## Additional Resources

- Package documentation: `help(package = "qtkit")`
- Related packages: {dataMaid}, {codebook}
- [Project homepage](https://github.com/qtalr/qtkit)