---
title: "Example 1: Basic usage"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Example 1: Basic usage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

# Use tidyfst just like dplyr
This part of vignette has referred to `dplyr`'s vignette in <https://dplyr.tidyverse.org/articles/dplyr.html>. We'll try to reproduce all the results. First load the needed packages.

```{r}
library(tidyfst)
library(nycflights13)
library(data.table)

data.table(flights)
```

## Filter rows with `filter_dt()`
```{r}
filter_dt(flights, month == 1 & day == 1)
```
Note that comma could not be used in the expressions. Which means `filter_dt(flights, month == 1,day == 1)` would return error.
## Arrange rows with `arrange_dt()`
```{r}
arrange_dt(flights, year, month, day)
```

  Use `-` (minus symbol) to order a column in descending order:
```{r}
arrange_dt(flights, -arr_delay)
```

## Select columns with `select_dt()`

```{r}
select_dt(flights, year, month, day)
```

  `select_dt(flights, year:day)` and `select_dt(flights, -(year:day))` are not supported. But I have added a feature to help select with regular expression, which means you can:
```{r}
select_dt(flights, "^dep")
```
  The rename process is almost the same as that in `dplyr`:
```{r}
select_dt(flights, tail_num = tailnum)
rename_dt(flights, tail_num = tailnum)
```
  
## Add new columns with `mutate_dt()`
```{r}
mutate_dt(flights,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)
```
  
  However, if you just create the column, please split them. The following codes would not work:
```{r,eval=FALSE}
mutate_dt(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
```
  Instead, use:
```{r}
mutate_dt(flights,gain = arr_delay - dep_delay) %>%
  mutate_dt(gain_per_hour = gain / (air_time / 60))
```
  
  If you only want to keep the new variables, use `transmute_dt()`:
```{r}
transmute_dt(flights,
  gain = arr_delay - dep_delay
)
```
  
## Summarise values with `summarise_dt()`
```{r}
summarise_dt(flights,
  delay = mean(dep_delay, na.rm = TRUE)
)
```

## Randomly sample rows with `sample_n_dt()` and `sample_frac_dt()`
```{r}
sample_n_dt(flights, 10)
sample_frac_dt(flights, 0.01)
```

## Grouped operations
  For the below `dplyr` codes:
```{r,eval=FALSE}
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
```
  We could get it via:
```{r}
flights %>% 
  summarise_dt( count = .N,
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE),by = tailnum)
```
  `summarise_dt` (or `summarize_dt`) has a parameter "by", you can specify the group.
  We could find the number of planes and the number of flights that go to each possible destination:
```{r}
# the dplyr syntax:
# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

summarise_dt(flights,planes = uniqueN(tailnum),flights = .N,by = dest) %>% 
  arrange_dt(dest)

```
  If you need to group by many variables, use:
```{r}
# the dplyr syntax:
# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N)

# (per_month <- summarise(per_day, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights))

# (per_year  <- summarise(per_month, flights = sum(flights)))
flights %>% 
  summarise_dt(by = .(year,month,day),flights = .N) %>% 
  summarise_dt(by = .(year,month),flights = sum(flights)) %>% 
  summarise_dt(by = .(year),flights = sum(flights))
```
  
# Comparison with data.table syntax
  *tidyfst* provides a tidy syntax for *data.table*. For such design, *tidyfst* never runs faster than the analogous *data.table* codes. Nevertheless, it facilitate the dplyr-users to gain the computation performance in no time and guide them to learn more about data.table for speed.
  Below, we'll compare the syntax of `tidyfst` and `data.table` (referring to  [Introduction to data.table](https://rdatatable.gitlab.io/data.table/articles/datatable-intro.html)). This could let you know how they are different, and let users to choose their preference. Ideally, *tidyfst* will lead even more users to learn more about *data.table* and its wonderful features, so as to design more extentions for *tidyfst* in the future.
  
## Data
Because we want a more stable data source, here we'll use the flight data from the above `nycflights13` package.
```{r}
library(tidyfst)
library(data.table)
library(nycflights13)

flights = data.table(flights) %>% na.omit()
```

## Subset rows
```{r}
# data.table
head(flights[origin == "JFK" & month == 6L])
flights[1:2]
flights[order(origin, -dest)] 

# tidyfst
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  head()
flights %>% slice_dt(1:2)
flights %>% arrange_dt(origin,-dest)
```

## Select column(s)
```{r}
# data.table
flights[, list(arr_delay)]
flights[, .(arr_delay, dep_delay)]
flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

# tidyfst
flights %>% select_dt(arr_delay)
flights %>% select_dt(arr_delay, dep_delay)
flights %>% transmute_dt(delay_arr = arr_delay, delay_dep = dep_delay)
```

## Mixed computation
```{r}
# data.table
flights[, sum( (arr_delay + dep_delay) < 0)]
flights[origin == "JFK" & month == 6L,
               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
flights[origin == "JFK" & month == 6L, length(dest)]
flights[origin == "JFK" & month == 6L, .N]

# tidyfst
flights %>% summarise_dt(sum( (arr_delay + dep_delay) < 0))
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  summarise_dt(m_arr = mean(arr_delay), m_dep = mean(dep_delay))
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  nrow()
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  count_dt()
flights %>% 
  filter_dt(origin == "JFK" & month == 6L) %>% 
  summarise_dt(.N)
```
  In the above examples, we could learn that in *tidyfst*, you could still use the methods in data.table, such as `.N`.
  
## Refer to columns by names
```{r}
# data.table
flights[, c("arr_delay", "dep_delay")]

select_cols = c("arr_delay", "dep_delay")
flights[ , ..select_cols]
flights[ , select_cols, with = FALSE]

flights[, !c("arr_delay", "dep_delay")]
flights[, -c("arr_delay", "dep_delay")]

# returns year,month and day
flights[, year:day]
# returns day, month and year
flights[, day:year]
# returns all columns except year, month and day
flights[, -(year:day)]
flights[, !(year:day)]

# tidyfst
flights %>% select_dt(c("arr_delay", "dep_delay"))

select_cols = c("arr_delay", "dep_delay")
flights %>% select_dt(cols = select_cols)

flights %>% select_dt(-arr_delay,-dep_delay)

flights %>% select_dt(year:day)
flights %>% select_dt(day:year)
flights %>% select_dt(-(year:day))
flights %>% select_dt(!(year:day))
```

## Aggregations
```{r}
# data.table
flights[, .N, by = .(origin)]
flights[carrier == "AA", .N, by = origin]
flights[carrier == "AA", .N, by = .(origin, dest)]
flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]

# tidyfst
flights %>% count_dt(origin) # sort by default
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin)
flights %>% filter_dt(carrier == "AA") %>% count_dt(origin,dest)
flights %>% filter_dt(carrier == "AA") %>% 
  summarise_dt(mean(arr_delay), mean(dep_delay),
               by = .(origin, dest, month))
```
  Note that currently `keyby` is not used in *tidyfst*. This featuer might be included in the future for better performance in order-independent tasks. Moreover, `count_dt` is sorted automatically by the counted number, this could be controlled by the parameter "sort".
```{r}
# data.table
flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
flights[, .N, .(dep_delay>0, arr_delay>0)]

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  count_dt(origin,dest,sort = FALSE) %>% 
  arrange_dt(origin,-dest)
flights %>% 
  summarise_dt(.N,by = .(dep_delay>0, arr_delay>0))
```
  Now let's try a more complex example:
```{r}
# data.table
flights[carrier == "AA", 
        lapply(.SD, mean), 
        by = .(origin, dest, month), 
        .SDcols = c("arr_delay", "dep_delay")] 

# tidyfst
flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean)
           )
```
  Let me explain what happens here, especially in `group_dt`. First filter by condition `carrier == "AA"`, then group by three variables, which are `origin, dest, month`. Last, summarise by columns with "_delay" in the column names and get the mean value of all such variables(with "_delay" in their column names). This is a very creative design, utilizing `.SD` in *data.table* and upgrade the `group_by` function in *dplyr* (because you never need to `ungroup` now, just put the group operations in the `group_dt`). And **you can pipe in the group_dt function**. Let's play with it a little bit further:
```{r}
flights %>% 
  filter_dt(carrier == "AA") %>% 
  group_dt(
    by = .(origin, dest, month),
    at_dt("_delay",summarise_dt,mean) %>% 
      mutate_dt(sum = dep_delay + arr_delay)
           )
```
  However, I don't recommend using it if you don't acutually need it for group computation (just start another pipe follows `group_dt`).
  Now let's end with some easy examples:
```{r}
# data.table
flights[, head(.SD, 2), by = month]

# tidyfst
flights %>% 
  group_dt(by = month,head(2))
```
  Deep inside, *tidyfst* is born from *dplyr* and *data.table*, and use *stringr* to make flexible APIs, so as to bring their superiority into full play.