---
title: "Transform and Validate JSON and NDJSON"
author:
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY, US
- name: Marcel Ramos
  affiliation: CUNY School of Public Health at Hunter College, New York, NY, US
output:
  BiocStyle::html_document
vignette: |
  %\VignetteIndexEntry{Transform and Validate JSON and NDJSON}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE
)
```

# Introduction & installation

Use [rjsoncons][] for querying, transforming, and searching JSON,
NDJSON, or R objects using [JMESpath][], [JSONpath][], or
[JSONpointer][]. [rjsoncons][] supports [JSON patch][] for document
editing, and [JSON schema][] validation. Link to the package for
direct access to additional features in the [jsoncons][] C++ library.

[jsoncons]: https://github.com/danielaparker/jsoncons/
[rjsoncons]: https://mtmorgan.github.io/rjsoncons/
[JSONpath]: https://goessner.net/articles/JsonPath/
[JMESpath]: https://jmespath.org/
[JSONpointer]: https://datatracker.ietf.org/doc/html/rfc6901
[JSON patch]: https://jsonpatch.com/
[JSON schema]: https://json-schema.org/

Install the released package version from CRAN

```{r install, eval = FALSE}
install.packages("rjsoncons", repos = "https://CRAN.R-project.org")
```

Install the development version with

```{r install_github, eval = FALSE}
if (!requireNamespace("remotes", quiety = TRUE))
    install.packages("remotes", repos = "https://CRAN.R-project.org")
remotes::install_github("mtmorgan/rjsoncons")
```

Attach the installed package to your *R* session, and check the
version of the C++ library in use

```{r library, messages = FALSE}
library(rjsoncons)
rjsoncons::version()
```

# Query and pivot

Functions in this package work on JSON or NDJSON character vectors,
file paths and URLs to JSON or NDJSON documents, and *R* objects that
can be transformed to a JSON string.

## Select, filter and transform with `j_query()`

Here is a simple JSON example document

```{r json_example}
json <- '{
  "locations": [
    {"name": "Seattle", "state": "WA"},
    {"name": "New York", "state": "NY"},
    {"name": "Bellevue", "state": "WA"},
    {"name": "Olympia", "state": "WA"}
  ]
}'
```

There are several common use cases. Use [rjsoncons][] to query the
JSON string using [JSONpath][], [JMESpath][] or [JSONpointer][]
syntax to filter larger documents to records of interest, e.g., only
cities in New York state, using 'JMESpath' syntax.

```{r j_query}
j_query(json, "locations[?state == 'NY']") |>
    cat("\n")
```

Use the `as = "R"` argument to extract deeply nested elements as *R*
objects, e.g., a character vector of city names in Washington state.

```{r as_arg}
j_query(json, "locations[?state == 'WA'].name", as = "R")
```

The JSON Pointer specification is simpler, indexing a single object in
the document. JSON arrays are 0-based.

```{r jsonpointer}
j_query(json, "/locations/0/state")
```

The examples above use `j_query()`, which automatically infers query
specification from the form of `path` using `j_path_type()`. It may be
useful to indicate query specification more explicitly using
`jsonpointer()`, `jsonpath()`, or `jmespath()`; examples illustrating
features available for each query specification are on the help pages
`?jsonpointer`, `?jsonpath`, and `?jmespath`.

## Array-of-objects to *R* data.frame with `j_pivot()`

The following transforms a nested JSON document into a format that can
be incorporated directly in *R* as a `data.frame`.

```{r array_of_objects}
path <- '{
    name: locations[].name,
    state: locations[].state
}'
j_query(json, path, as = "R") |>
    data.frame()
```

The transformation from JSON 'array-of-objects' to 'object-of-arrays'
suitable for direct representation as a `data.frame` is common, and is
implemented directly as `j_pivot()`

```{r j_pivot}
j_pivot(json, "locations", as = "data.frame")
```

`j_pivot()` also support `as = "tibble"` when the [dplyr][] package is
installed.

[purrr]: https://CRAN.R-project.org/package=purrr
[dplyr]: https://CRAN.R-project.org/package=dplyr
[tidyr]: https://CRAN.R-project.org/package=tidyr
[r4ds]: https://r4ds.hadley.nz/
[hierarchical data]: https://r4ds.hadley.nz/rectangling

## NDJSON support

[rjsoncons][] supports [NDJSON][] (new-line delimited JSON). NDJSON
consists of a file or character vector where each line / element
represents a JSON record. This example uses data from the [GitHub
Archive][] project recording all actions on public GitHub
repositories. The data included in the package are the first 10 lines
of <https://data.gharchive.org/2023-02-08-0.json.gz>.

```{r}
ndjson_file <-
    system.file(package = "rjsoncons", "extdata", "2023-02-08-0.json")
```

NDJSON can be read into R (`ndjson <- readLines(ndjson_file)`) and
used in `j_query()` / `j_pivot()`, but it is often better to leave
full NDJSON files on disk. Thus the first argument to `j_query()` or
`j_pivot()` is usually a (text or gz-compressed) file path or URL.
Two additional options are available when working with
NDJSON. `n_records` limits the number of records processed. Using
`n_records` can be very useful when exploring the data. For instance,
the first record of a file can be viewed interactively with

```{r ndjson_listviewer, eval = FALSE}
j_query(ndjson_file, n_records = 1) |>
    listviewer::jsonedit()
```

The option `verbose = TRUE` adds a progress indicator, which provides
confidence that progress is being made while parsing large files. The
progress bar requires the [cli][] package.

`j_query()` provides a one-to-one mapping of NDJSON lines / elements
to the return value, e.g., `j_query(ndjson_file, "@", as = "string")`
on an NDJSON file with 1000 lines will return a character vector of
1000 elements, or with `j_query(ndjson, "@", as = "R")` an *R* list
with length 1000.

```{r ndjson_j_query}
j_query(ndjson_file, "{id: id, type: type}", n_records = 5)
```

`j_pivot()` transforms an NDJSON file or character vector of objects
into a format convenient for input in *R*. `j_pivot()` with NDJSON
files and JMESpath paths work particularly well together, because
JMESpath provides flexibility in creating JSON objects to be pivoted.

```{r ndjson_j_pivot}
j_pivot(ndjson_file, "{id: id, type: type}", as = "data.frame")
```

Filtering NDJSON files can require relatively more complicated paths,
e.g., to filter 'PushEvent' types from organizations, construct a
query that acts on each NDJSON record to return an array of a single
object, then apply a filter to replace uninteresting elements with
0-length arrays (using `as = "tibble"` often transforms the *R*
list-of-vectors to a tibble in a more pleasing and robust manner
compared to `as = "data.frame"`).

```{r ndjson_j_pivot_filter}
path <-
    "[{id: id, type: type, org: org}]
         [?@.type == 'PushEvent' && @.org != null] |
             [0]"
j_pivot(ndjson_file, path, as = "data.frame")
```

A more complete example is used in the [NDJSON extended
vignette][ndjson-extended]

[NDJSON]: https://ndjson.org/
[GitHub Archive]: https://www.gharchive.org/
[ndjson-extended]: https://mtmorgan.github.io/rjsoncons/articles/b_ndjson_extended.html
[cli]: https://CRAN.R-project.org/package=cli

## *R* objects as input

[rjsoncons][] can filter and transform _R_ objects. These are
converted to JSON using `jsonlite::toJSON()` before queries are made;
`toJSON()` arguments like `auto_unbox = TRUE` can be added to the
function call.

```{r r_list}
## `lst` is an *R* list
lst <- jsonlite::fromJSON(json, simplifyVector = FALSE)
j_query(lst, "locations[?state == 'WA'].name | sort(@)", auto_unbox = TRUE) |>
    cat("\n")
```

# Patch

[JSON Patch][] provides a simple way to edit or transform a JSON
document using JSON commands.

## Applying a patch with `j_patch_apply()`

Starting with the JSON document

```{r}
json <- '{
  "biscuits": [
    { "name": "Digestive" },
    { "name": "Choco Leibniz" }
  ]
}'
```

one can `"add"` another biscuit, and copy a favorite biscuit to a new
locations using the following patch

```{r}
patch <- '[
    {"op": "add", "path": "/biscuits/1", "value": { "name": "Ginger Nut" }},
    {"op": "copy", "from": "/biscuits/2", "path": "/best_biscuit"}
]'
```

The paths are specified using [JSONpointer][] notation; remember that
JSON arrays are 0-based, compared to 1-based *R* arrays. Applying the
patch results in a new JSON document.

```{r}
j_patch_apply(json, patch)
```

Patches can also be created from *R* objects with the helper function
`j_patch_op()`.

```{r}
ops <- c(
    j_patch_op(
        "add", "/biscuits/1", value = list(name = "Ginger Nut"),
        auto_unbox = TRUE
    ),
    j_patch_op("copy", "/best_biscuit", from = "/biscuits/2")
)
identical(j_patch_apply(json, patch), j_patch_apply(json, ops))
```

`j_patch_op()` takes care of unboxing `op=`, `path=`, and `from=`, but
some care must be taken in 'unboxing' the `value=` argument for
operations such as 'add'; it may also be appropriate to unbox only
specific fields, e.g.,

```{r}
value <- list(name = jsonlite::unbox("Ginger Nut"))
j_patch_op("add", "/biscuits/1", value = value)
```

From the [JSON patch][] web site, available operations and example
JSON are:

- `add` -- add elements to an existing document.

    ```
    {"op": "add", "path": "/biscuits/1", "value": {"name": "Ginger Nut"}}
    ```

- `remove` -- remove elements from a document.

    ```
    {"op": "remove", "path": "/biscuits/0"}
    ```

- `replace` -- replace one element with another

    ```
    {
        "op": "replace", "path": "/biscuits/0/name",
        "value": "Chocolate Digestive"
    }
    ```

- `copy` -- copy a path to another location.

    ```
    {"op": "copy", "from": "/biscuits/0", "path": "/best_biscuit"}
    ```

- `move` -- move a path to another location.

    ```
    {"op": "move", "from": "/biscuits", "path": "/cookies"}
    ```

- `test` -- test for the existence of a path; if the path does not
  exist, do not apply any of the patch.

    ```
    {"op": "test", "path": "/best_biscuit/name", "value": "Choco Leibniz"}
    ```

Formal description of these operations is provided in Section 4 of
[RFC6902][]. A patch command is *always* an array, even when a single
operation is involved.

## Difference between documents with `j_patch_from()`

The `j_patch_from()` function constructs a patch from the difference
between two documents

```{r}
j_patch_from(j_patch_apply(json, patch), json)
```

[JSON Patch]: https://jsonpatch.com/
[RFC6902]: https://datatracker.ietf.org/doc/html/rfc6902/#section-4

# Schema validation

[JSON schema][] provides structure to JSON
documents. `j_schema_is_valid()` checks that a JSON document is valid
against a specified schema, and `j_schema_validate()` tries to
illustrate how a document deviates from the schema.

As an example consider `j_patch_op()`, where the operation is supposed
to conform to the [JSON patch][] schema. For convenience, a copy of
this schema is available in [rjsoncons][].

```{r}
## alternatively: schema <- "https://json.schemastore.org/json-patch"
schema <- system.file(package = "rjsoncons", "extdata", "json-patch.json")
cat(readLines(schema), sep = "\n")
```

The well-formed 'op' is valid, and `j_schema_validate()` produces no output

```{r valid-schema}
op <- '[{
    "op": "add", "path": "/biscuits/1",
    "value": { "name": "Ginger Nut" }
}]'
j_schema_is_valid(op, schema)
j_schema_validate(op, schema)
```

Introduce an invalid 'op', `"op": "invalid_op"`, and the schema is no
longer valid.

```{r invalid-schema}
op <- '[{
    "op": "invalid_op", "path": "/biscuits/1",
    "value": { "name": "Ginger Nut" }
}]'
j_schema_is_valid(op, schema)
```

The reason can be understood from (careful!) consideration of the
output of `j_schema_validate()`, with reference to the schema itself.

```{r invalid-schema-tibble}
j_schema_validate(op, schema, as = "tibble") |>
    tibble::glimpse()
```

The validation indicates that the schema `evaluationPath`
'/items/oneOf' is not satisfied, because of the `error` 'No schema
[i.e., 'oneOf' elements] matched, ...'.

The 'details' column summarizes why each of the 3 elements of
`/items/oneOf` fails the schema specification; use `as = "details"` to
extract this directly

```{r invalid-schema-details}
j_schema_validate(op, schema, as = "details") |>
    tibble::glimpse()
```

This indicates that the first item in the schema is rejected because
'invalid_op' is not a valid enum

```{r invalid-schema-0}
j_query(schema, "/items/oneOf/0/properties/op/enum") |>
    noquote()
```

Reasons for rejecting other items can be explored using similar steps.

# Flatten and find

It can sometimes be helpful to explore JSON documents by 'flattening'
the JSON to an object of path / value pairs, where the path is the
[JSONpointer][] path to the corresponding value. It is then
straight-forward to search this flattened object for, e.g., the path
to a known field or value. As an example, consider the object

```{r}
codes <- '{
    "discards": {
        "1000": "Record does not exist",
        "1004": "Queue limit exceeded",
        "1010": "Discarding timed-out partial msg"
    },
    "warnings": {
        "0": "Phone number missing country code",
        "1": "State code missing",
        "2": "Zip code missing"
    }
}'
```

The 'flat' JSON of this can be represented as named list (using
`str()` to provide a compact visual representation)

```{r}
j_flatten(codes, as = "R") |>
    str()
```

The names of the list are JSONpointer (default) or JSONpath, so can be
used in `j_query()` and `j_pivot()` as appropriate

```{r}
j_query(codes, "/discards/1010")
```

There are two ways to find known keys and values. The first is to use
exact matching to one or more keys or values, e.g.,

```{r}
j_find_values(
    codes, c("Record does not exist", "State code missing"),
    as = "tibble"
)
j_find_keys(codes, "warnings", as = "tibble")
```

It is also possible to match using a regular expression.

```{r}
j_find_values_grep(codes, "missing", as = "tibble")
j_find_keys_grep(codes, "card.*/100", as = "tibble") # span key delimiters
```

Keys are always character vectors, but values can be of different
type; `j_find_values()` supports searches on these.

```{r}
j <- '{"x":[1,[2, 3]],"y":{"a":4}}'
j_flatten(j, as = "R") |> str()
j_find_values(j, c(2, 4), as = "tibble")
```

A common operation might be to find the path to a know value, and then
to query the original JSON to find the object in which the value is
contained.

```{r}
j_find_values(j, 3, as = "tibble")
## path to '3' is '/x/1/1', so containing object is at '/x/1'
j_query(j, "/x/1")
j_query(j, "/x/1", as = "R")
```

Both JSONpointer and JSONpath are supported; an advantage of the
latter is that the path distinguishes between integer-valued
(unquoted) and string-valued (quoted) keys

```{r}
j_find_values(j, 3, as = "tibble", path_type = "JSONpath")
```

The first argument to `j_find_*()` can be an *R* object, JSON or
NDJSON string, file, or URL. Using `j_find_values()` with an *R*
object and JSONpath `path_type` leads to a path that is easily
converted into an *R* index: double the `[` and `]` in the path and
increment each numerical index by 1:

```{r}
l <- j |> as_r()
j_find_values(l, 3, auto_unbox = TRUE, path_type = "JSONpath", as = "tibble")
l[['x']][[2]] # siblings
```

NDJSON files are flattened into character vectors, with
each element the flattened version of the corresponding NDJSON record.

# The JSON parser

The package includes a JSON parser, used with the argument `as = "R"`
or directly with `as_r()`

``` r
as_r('{"a": 1.0, "b": [2, 3, 4]}') |>
    str()
#> List of 2
#>  $ a: num 1
#>  $ b: int [1:3] 2 3 4
```

The main rules of this transformation are outlined here.  JSON arrays
of a single type (boolean, integer, double, string) are transformed to
*R* vectors of the same length and corresponding type.

```{r as_r}
as_r('[true, false, true]') # boolean -> logical
as_r('[1, 2, 3]')           # integer -> integer
as_r('[1.0, 2.0, 3.0]')     # double  -> numeric
as_r('["a", "b", "c"]')     # string  -> character
```

JSON arrays mixing integer and double values are transformed to
*R* numeric vectors.

```{r as_r_integer_numeric}
as_r('[1, 2.0]') |> class() # numeric
```

If a JSON integer array contains a value larger than *R*'s 32-bit
integer representation, the array is transformed to an *R* numeric
vector. NOTE that this results in loss of precision for JSON integer
values greater than `2^53`.

```{r as_r_64_bit}
as_r('[1, 2147483648]') |> class()  # 64-bit integers -> numeric
```

JSON objects are transformed to *R* named lists.

```{r as_r_objects}
as_r('{}')
as_r('{"a": 1.0, "b": [2, 3, 4]}') |> str()
```

There are several additional details. A JSON scalar and a JSON vector
of length 1 are represented in the same way in *R*.

```{r as_r_scalars}
identical(as_r("3.14"), as_r("[3.14]"))
```

JSON arrays mixing types other than integer and double are transformed to
*R* lists

```{r as_r_mixed_arrays}
as_r('[true, 1, "a"]') |> str()
```

JSON `null` values are represented as *R* `NULL` values; arrays of
`null` are transformed to lists

```{r as_r_null}
as_r('null')                  # NULL
as_r('[null]') |> str()       # list(NULL)
as_r('[null, null]') |> str() # list(NULL, NULL)
```

Ordering of object members is controlled by the `object_names=`
argument. The default preserves names as they appear in the JSON
definition; use `"sort"` to sort names alphabetically. This argument
is applied recursively.

```{r as_r_field_order}
json <- '{"b": 1, "a": {"d": 2, "c": 3}}'
as_r(json) |> str()
as_r(json, object_names = "sort") |> str()
```

The parser corresponds approximately to `jsonlite::fromJSON()` with
arguments `simplifyVector = TRUE, simplifyDataFrame = FALSE,
simplifyMatrix = FALSE)`. Unit tests (using the [tinytest][]
framework) providing additional details are available at

```{r as_r_tiny_test_source, eval = FALSE}
system.file(package = "rjsoncons", "tinytest", "test_as_r.R")
```

[tinytest]: https://CRAN.R-project.org/package=tinytest

## Using `jsonlite::fromJSON()`

The built-in parser can be replaced by alternative parsers by returning
the query as a JSON string, e.g., using the `fromJSON()` in the
[jsonlite][] package.

```{r jsonlite_fromJSON}
json <- '{
  "locations": [
    {"name": "Seattle", "state": "WA"},
    {"name": "New York", "state": "NY"},
    {"name": "Bellevue", "state": "WA"},
    {"name": "Olympia", "state": "WA"}
  ]
}'
j_query(json, "locations[?state == 'WA']") |>
    ## `fromJSON()` simplifies list-of-objects to data.frame
    jsonlite::fromJSON()
```

[jsonlite]: https://CRAN.R-project.org/package=jsonlite

The [rjsoncons][] package is particularly useful when accessing
elements that might otherwise require complicated application of
nested `lapply()`, [purrr][] expressions, or [tidyr][] `unnest_*()`
(see [R for Data Science][r4ds] chapter '[Hierarchical data][]').

# C++ library use in other packages

The package includes the complete 'jsoncons' C++ header-only library,
available to other R packages by adding

```
LinkingTo: rjsoncons
SystemRequirements: C++11
```

to the DESCRIPTION file. Typical use in an R package would also
include `LinkingTo:` specifications for the [cpp11][] or [Rcpp][]
(this package uses [cpp11][]) packages to provide a C / C++ interface
between R and the C++ 'jsoncons' library.

[cpp11]: https://cran.r-project.org/package=cpp11
[Rcpp]: https://cran.r-project.org/package=Rcpp

# Session information

This vignette was compiled using the following software versions

```{r session_info}
sessionInfo()
```