---
title: "Filtering trajectories"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Filtering trajectories}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(sf_max_print = 5L)
if (Sys.info()["user"] != "bart") {
  if (Sys.getenv("MBPWD") != "") {
    options(keyring_backend = "env")
    move2::movebank_store_credentials("move2_user", Sys.getenv("MBPWD"))
  } else {
    knitr::opts_chunk$set(eval = FALSE)
  }
}
```

```{r setup, message=FALSE}
library(move2)
library(dplyr)
library(units)
library(sf)
```
Download example data and select columns to reduce printing.
```{r}
galapagos_albatrosses <- movebank_download_study(2911040,
  attributes = c(
    "ground_speed",
    "heading",
    "height_above_ellipsoid",
    "eobs_temperature",
    "individual_local_identifier"
  )
) %>%
  select_track_data(study_site, weight, animal_life_stage)
```

# Filtering locations

## Omit empty locations

```{r}
galapagos_albatrosses %>%
  filter(!st_is_empty(.))
```

## Temporal filtering

First location each 6 hour window

```{r}
galapagos_albatrosses %>%
  filter(!st_is_empty(.)) %>%
  mt_filter_per_interval(unit = "6 hours")
```

Random location each day

```{r}
galapagos_albatrosses %>%
  filter(!st_is_empty(.)) %>%
  mt_filter_per_interval(criterion = "random", unit = "days")
```

# Finding and filtering duplicated records

When dealing with trajectories frequently duplicated records do occur. There are
many reasons these can appear ranging from the way in which data is recorded to
duplicated data transmissions and uploads. These data are often stored, but for
analysis they need to be removed. A simple definition of a duplicate record
would be an observation at exactly the same time of the same individual. However
many tracking devices record additional information such as acceleration. These
records frequently have the same time as location records meaning not all
records with duplicated timestamps can directly be deleted.

Duplicated records can be found in the following way:

```{r}
galapagos_albatrosses %>%
  group_by(mt_time(), mt_track_id()) %>%
  filter(n() != 1) %>%
  arrange(mt_time())
```

If you are only interested in finding duplicated records where there is a
location this can as follows (in this case there are none):

```{r}
galapagos_albatrosses %>%
  filter(!st_is_empty(.)) %>%
  group_by(mt_time(), mt_track_id()) %>%
  filter(n() != 1) %>%
  arrange(mt_time())
```

The package also has some build in functions for filtering unique records. Several strategies for omitting duplicated records are build in.

First it is possible to omit all records that are a subset of other records, i.e. records that got added later with more information are retained. This happens with some tracking devices if data gets directly downloaded from the tag. As no information is lost this is the default strategy.

```{r}
simulated_data <- mt_sim_brownian_motion(1:2)[rep(1:4, 2), ]
simulated_data$temperature <- c(1:3, NA, 1:2, 7:8)
simulated_data
simulated_data %>% mt_filter_unique()
```

This strategy how ever does not guarantee not duplicates are left, as two records might not be subsets from each other. 

An alternative is to take a random record from each set of duplicates, this is not advised for formal analysis but might help for a quick inspection of data. This is also a lot quicker then inspecting subsets. How ever care needs to be taken as the example below, for example, results in empty points being retained at the cost of informative locations.

```{r}
galapagos_albatrosses %>% mt_filter_unique("sample")
```

# Filtering tracks

## Tracks with at least `n` locations

```{r download}
galapagos_albatrosses %>%
  group_by(mt_track_id()) %>%
  filter(n() > 500)
```

## Tracks having a minimal duration

```{r}
galapagos_albatrosses %>%
  group_by(mt_track_id()) %>%
  filter(as_units(diff(range(mt_time()))) > set_units(1, "week"))
```

## Tracks that visit foraging area at least once

```{r, fig.width=7, fig.height=4.2, fig.alt="Plot of the tracking data from albatrosses including coastlines and the foraging area"}
foraging_area <- st_as_sfc(st_bbox(c(
  xmin = -82, xmax = -77,
  ymax = -0.5, ymin = -13
), crs = 4326))
library(ggplot2, quietly = TRUE)
ggplot() +
  geom_sf(data = rnaturalearth::ne_coastline(returnclass = "sf", 50)) +
  theme_linedraw() +
  geom_sf(data = foraging_area, fill = "red", alpha = 0.3, color = "red") +
  geom_sf(
    data = galapagos_albatrosses %>% filter(!st_is_empty(.)),
    aes(color = `individual_local_identifier`)
  ) +
  coord_sf(
    crs = sf::st_crs("+proj=aeqd +lon_0=-83 +lat_0=-6 +units=km"),
    xlim = c(-1000, 600), ylim = c(-800, 700)
  )
# Filter to tracks making it at least once to the foraging area
galapagos_albatrosses %>%
  group_by(mt_track_id()) %>%
  filter(any(st_intersects(geometry, foraging_area, sparse = FALSE)))
```

## Filter by track attribute

To use track attributes for filtering there is the `filter_track_data` function. This function works in the same way as `filter` from `dplyr` except that is operates on the track data. As soon as individuals are omitted from the track data the associated event data is also omitted.

```{r}
galapagos_albatrosses %>%
  filter_track_data(study_site == "Punta Suarez")
```

# Re organizing trajectories

## Split on time gaps

```{r}
galapagos_albatrosses %>%
  filter(!st_is_empty(.)) %>%
  mutate(
    next_new_track = mt_time_lags(.) > set_units(4, "h") |
      is.na(mt_time_lags(.)),
    track_index = cumsum(lag(next_new_track, default = FALSE))
  ) %>%
  mt_set_track_id("track_index")
```

## Monthly tracks

```{r}
library(lubridate, quietly = TRUE)
galapagos_albatrosses %>%
  mt_set_track_id(paste(mt_track_id(.),
    sep = "_", month.name[month(mt_time(.))]
  ))
```