---
title: "Object-Oriented Programming"
author: "Martin Westgate & Dax Kellie"
date: '2024-11-19'
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Object-Oriented Programming}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


The default method for building queries in `galah` is to first use `galah_call()`
to create a query object called a "`data_request`". This object class is specific 
to `galah`.


``` r
galah_call() |>
  filter(genus == "Crinia") |>
  class()
```

```
## [1] "data_request"
```

When a piped object is of class `data_request`, galah can trigger functions to 
use specific methods for this object class, even if a function name is used by 
another package. For example, users can use `filter()` and `group_by()` functions 
from [dplyr](https://dplyr.tidyverse.org/index.html) instead 
of `galah_filter()` and `galah_group_by()` to construct a query. Consequently,
the following queries are synonymous:


``` r
galah_call() |>
  galah_filter(genus == "Crinia", year == 2020) |>
  galah_group_by(species) |>
  atlas_counts()
```

``` r
galah_call() |> 
  filter(genus == "Crinia", year == 2020) |>
  group_by(species) |>
  atlas_counts()
```

```
## # A tibble: 16 Ã— 2
##    species                 count
##    <chr>                   <int>
##  1 Crinia signifera        42621
##  2 Crinia parinsignifera    8664
##  3 Crinia glauerti          3111
##  4 Crinia georgiana         1509
##  5 Crinia remota             718
##  6 Crinia sloanei            682
##  7 Crinia insignifera        530
##  8 Crinia tinnula            291
##  9 Crinia deserticola        253
## 10 Crinia pseudinsignifera   223
## 11 Crinia tasmaniensis       181
## 12 Crinia bilingua            74
## 13 Crinia subinsignifera      46
## 14 Crinia riparia             10
## 15 Crinia flindersensis        3
## 16 Crinia nimba                1
```

Thanks to object-oriented programming, galah "masks" `filter()` and `group_by()` 
functions to use methods defined for `data_request` objects instead. The full 
list of masked functions is: 

- `arrange()` (`{dplyr}`)
- `count()` (`{dplyr}`)
- `identify()` (`{graphics}`) as a synonym for `galah_identify()`
- `select()` (`{dplyr}`) as a synonym for `galah_select()`
- `group_by()` (`{dplyr}`) as a synonym for `galah_group_by()`
- `slice_head()` (`{dplyr}`) as a synonym for the `limit` argument in `atlas_counts()`
- `st_crop()` (`{sf}`) as a synonym for `galah_polygon()`

Note that these functions are all evaluated lazily; they amend the underlying 
object, but do not amend the nature of the data until the call is evaluated. To 
actually build and run the query, we'll need to use one or more of a different 
set of dplyr verbs: `collapse()`, `compute()` and `collect()`.

## Advanced query building

The usual way to begin a query to request data in galah is using `galah_call()`. 
However, this function now calls one of three types of `request_` functions. 
If you prefer, you can begin your pipe with one of these dedicated `request_`
functions (rather than `galah_call()`) depending on the type of data you 
want to collect. 

For example, if you want to download occurrences, use `request_data()`:


``` r
x <- request_data("occurrences") |> # note that "occurrences" is the default `type`
  filter(species == "Crinia tinnula", 
         year == 2010) |>
  collect()
```

You'll notice that this query differs slightly from the query structure used in 
earlier versions of `galah`. The desired data type, `"occurrences"`, 
is specified at the beginning of the query within `request_data()` rather than 
at the end using `atlas_occurrences()`. Specifying the data type at the start 
allows users to make use of advanced query building using three newly 
implemented stages of query building: `collapse()`, `compute()` and `collect()`.
These stages mirror existing [functions in dplyr for querying 
databases](https://dplyr.tidyverse.org/reference/compute.html), and act in the 
following way:

- `collapse()` converts the object to a `query`. This allows users to inspect  
   their API calls before they are sent. Depending on the request, this function
   may also call 'supplementary' APIs to collect required information,
   such as Taxon Concept Identifiers or field names.
- `compute()` is intended to send the query in question to the requested API 
   for processing. This is particularly important for occurrences, where
   it can be useful to submit a query and retrieve it at a later time. If the 
   `compute()` stage is not required, however, `compute()` simply converts
   the `query` to a new class (`computed_query`).
- `collect()` retrieves the requested data into your workspace, returning a 
  `tibble`.

We can use these in sequence, or just leap ahead to the stage we want:


``` r
x <- request_data() |>
  filter(genus == "Crinia", year == 2020) |>
  group_by(species) |>
  arrange(species) |>
  count()

collapse(x)
```

```
## Object of class query with type data/occurrences-count-groupby 
## url: https://api.ala.org.au/occurrences/occurrences/facets?fq=%28genus%3A%2... 
## arrange: species (ascending)
```

``` r
compute(x)
```

```
## Object of class computed_query with type data/occurrences-count-groupby 
## url: https://api.ala.org.au/occurrences/occurrences/facets?fq=%28genus%3A%2... 
## arrange: species (ascending)
```

``` r
collect(x) |> head()
```

```
## # A tibble: 6 Ã— 2
##   species              count
##   <chr>                <int>
## 1 Crinia bilingua         74
## 2 Crinia deserticola     253
## 3 Crinia flindersensis     3
## 4 Crinia georgiana      1509
## 5 Crinia glauerti       3111
## 6 Crinia insignifera     530
```

The benefit of using `collapse()`, `compute()` and `collect()` is that queries 
are more modular. This is particularly useful for large data requests in galah. 
Users can send their query using `compute()`, and download data once the query 
has finished â€” downloading with `collect()` later â€” rather than waiting for the 
request to finish within R.


``` r
# Create and send query to be calculated server-side
request <- request_data() |>
  identify("perameles") |>
  filter(year > 1900) |>
  compute()
  
# Download data
request |>
  collect()
```

Additionally, functions that are more modular are generally easier to 
interrogate and debug. Previously some functions did several different things, 
making it difficult to know which APIs were being called, when, and for what 
purpose. Partitioning queries into three distinct stages is much more transparent, 
and allows users to check their query construction prior to sending a request. 
For example, the query above is constructed with the following information, 
returned by `collapse()`.


``` r
request_data() |>
  identify("perameles") |>
  filter(year > 1900) |>
  collapse()
```

```
## Object of class query with type data/occurrences 
## url: https://api.ala.org.au/occurrences/occurrences/offline/download?fq=%28...
```

The `collapse()` stage includes an additional argument (`.expand`) that, 
when set to `TRUE`, shows all the APIs called to construct the user-requested
query. This is especially useful for debugging.

## Object classes

Under the hood, the different query-building verbs each amend the supplied 
object to a new class:

- `collapse()` returns class `query`, which is a list containing a `type` slot
   and one or more `url`s
- `compute()` returns a single object of class `computed_query`
- `collect()` returns a `tibble`

These can be called directly, or via the `method` and `type` arguments of 
`galah_call()`, which specify which dedicated `request_` function and data type 
to return. To demonstrate what we mean, take the following calls, which despite 
using different syntax, all return the number of records available for the year 
2020:


``` r
# new syntax
request_data() |>
  filter(year == 2020) |>
  count() |>
  collect()

# similar, but using `galah_call()`
galah_call(method = "data",
           type = "occurrences-count") |>
  filter(year == 2020) |>
  collect()

# original syntax
galah_call() |>
  galah_filter(year == 2020) |>
  atlas_counts()
```

Another example is to list available `fields` in the selected atlas:


``` r
request_metadata(type = "fields") |>
  collect()

galah_call(method = "metadata", 
           type = "fields") |>
  collect()

show_all(fields)
```

Or to show values for states and territories:


``` r
request_metadata() |>
  filter(field == "cl22") |>
  unnest() |>
  collect()

galah_call(method = "metadata", 
           type = "fields-unnest") |>
  galah_filter(id == "cl22") |>
  collect()

search_all(fields, "cl22") |>
  show_values()
```

While `request_metadata()` is more modular than `show_all()`, there is 
little benefit to using it for most applications. However, in some cases,
larger databases like GBIF return huge `data.frame`s of metadata when called 
via `show_all()`. Using `request_metdata()` allows users to specify a 
`slice_head()` line within their pipe to get around this issue.

## Which syntax should I prefer?

Despite these benefits, we have no plans to _require_ users to call masked 
functions. Functions prefixed with `galah_` or `atlas_` are not going away. 
Indeed, while there is perfect redundancy between old and new syntax in some 
cases, in others they serve different purposes. In `atlas_media()` for example, 
several calls are made and joined in a way that reduces the number of steps 
required by the user. Under the hood, however, all `atlas_` functions are now 
entirely built using the above syntax.