Data Resource

Peter Desmet

Data Resource is a simple format to describe a data resource such as an individual table or file, including its name, format, path, etc.

In this document we use the terms “package” for Data Package, “resource” for Data Resource, “dialect” for Table Dialect, and “schema” for Table Schema.

General implementation

Frictionless supports reading, manipulating and writing resources, but much of its functionality is limited to Tabular Data Resources.

Read

resources() lists all resources in a package:

library(frictionless)
package <- example_package()

# List the resources
resources(package)
#> [1] "deployments"  "observations" "media"

read_resource() reads data from a tabular resource to a data frame:

read_resource(package, "deployments")
#> # A tibble: 3 × 5
#>   deployment_id longitude latitude start      comments                     
#>   <chr>             <dbl>    <dbl> <date>     <chr>                        
#> 1 1                  4.62     50.8 2020-09-25  <NA>                        
#> 2 2                  4.64     50.8 2020-10-01 "On \"forêt\" road."         
#> 3 3                  4.65     50.8 2020-10-05 "Malfunction/no photos, data"

Frictionless does not support reading data from non-tabular resources.

Manipulate

remove_resource() removes a resource (of any type):

remove_resource(package, "deployments")
#> A Data Package with 2 resources:
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.

# This and many other functions return "package", which you can update with
# package <- remove_resource(package, "deployments")

add_resource() adds or replaces a tabular resource. The provided data must be a data frame or a tabular data file (e.g. CSV):

# Add a resource with data from a data frame
add_resource(package, "iris", data = iris)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • iris
#> Use `unclass()` to print the Data Package as a list.

# Replace a resource with one where data is stored in a tabular file
path <- system.file("extdata", "deployments.csv", package = "frictionless")
add_resource(package, "deployments", data = path, replace = TRUE)
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.

Note that you can pipe most functions (see vignette("data-package")).

Write

write_package() writes a package to disk as a datapackage.json file. This file includes the metadata of all the resources. write_package() also writes resource data to CSV files, unless the referred data are referred to be URL or inline. See the function documentation for details.

Properties implementation

name

name is required. It is used to identify a resource in read_resource(), add_resource() and remove_resource() (always as the second argument):

deployments <- read_resource(package, resource_name = "deployments")

add_resource() sets name to the provided resource_name:

add_resource(package, resource_name = "iris", data = iris)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • iris
#> Use `unclass()` to print the Data Package as a list.

path

path or data (see further) is required. Providing both is not allowed.

path is for data in files (e.g. a CSV file). It can be a local path or URL. Supported protocols are http, https, ftp, sftp and sftp. Absolute paths (/) or relative parent paths (../) are not allowed to avoid security vulnerabilities.

When multiple paths are provided ("path": ["myfile1.csv", "myfile2.csv"]), the files are expected to have the same structure. read_resource() merges these into a single data frame in the order the paths are provided (using dplyr::bind_rows()):

# The "observations" resource has multiple files in path
package$resources[[2]]$path
#> [1] "observations_1.tsv" "observations_2.tsv"
# These are combined into a single data frame when reading
read_resource(package, "observations")
#> # A tibble: 8 × 7
#>   observation_id deployment_id timestamp           scientific_name     count
#>   <chr>          <chr>         <dttm>              <chr>               <dbl>
#> 1 1-1            1             2020-09-28 00:13:07 Capreolus capreolus     1
#> 2 1-2            1             2020-09-28 15:59:17 Capreolus capreolus     1
#> 3 1-3            1             2020-09-28 16:35:23 Lepus europaeus         1
#> 4 1-4            1             2020-09-28 17:04:04 Lepus europaeus         1
#> 5 1-5            1             2020-09-28 19:19:54 Sus scrofa              2
#> 6 2-1            2             2021-10-01 01:25:06 Sus scrofa              1
#> 7 2-2            2             2021-10-01 01:25:06 Sus scrofa              1
#> 8 2-3            2             2021-10-01 04:47:30 Sus scrofa              1
#> # ℹ 2 more variables: life_stage <fct>, comments <chr>

add_resource() sets path to the path(s) provided in data:

path <- system.file("extdata", "deployments.csv", package = "frictionless")
add_resource(package, "deployments", data = path, replace = TRUE)
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.

data

Note: Support for inline data is currently limited, e.g. JSON object and string are not supported and schema, mediatype and format are ignored.

data is for inline data (included in the datapackage.json). read_resource() attempts to read data if it is provided as a JSON array:

# The "media" resource has inline data
str(package$resources[[3]]$data)
#> List of 3
#>  $ :List of 5
#>   ..$ media_id      : chr "aed5fa71-3ed4-4284-a6ba-3550d1a4de8d"
#>   ..$ deployment_id : chr "1"
#>   ..$ observation_id: chr "1-1"
#>   ..$ timestamp     : chr "2020-09-28 02:14:59+02:00"
#>   ..$ file_path     : chr "https://multimedia.agouti.eu/assets/aed5fa71-3ed4-4284-a6ba-3550d1a4de8d/file"
#>  $ :List of 5
#>   ..$ media_id      : chr "da81a501-8236-4cbd-aa95-4bc4b10a05df"
#>   ..$ deployment_id : chr "1"
#>   ..$ observation_id: chr "1-1"
#>   ..$ timestamp     : chr "2020-09-28 02:15:00+02:00"
#>   ..$ file_path     : chr "https://multimedia.agouti.eu/assets/da81a501-8236-4cbd-aa95-4bc4b10a05df/file"
#>  $ :List of 5
#>   ..$ media_id      : chr "0ba57608-3cf1-49d6-a5a2-fe680851024d"
#>   ..$ deployment_id : chr "1"
#>   ..$ observation_id: chr "1-1"
#>   ..$ timestamp     : chr "2020-09-28 02:15:01+02:00"
#>   ..$ file_path     : chr "https://multimedia.agouti.eu/assets/0ba57608-3cf1-49d6-a5a2-fe680851024d/file"
read_resource(package, "media")
#> # A tibble: 3 × 5
#>   media_id                      deployment_id observation_id timestamp file_path
#>   <chr>                         <chr>         <chr>          <chr>     <chr>    
#> 1 aed5fa71-3ed4-4284-a6ba-3550… 1             1-1            2020-09-… https://…
#> 2 da81a501-8236-4cbd-aa95-4bc4… 1             1-1            2020-09-… https://…
#> 3 0ba57608-3cf1-49d6-a5a2-fe68… 1             1-1            2020-09-… https://…

add_resource() adds the provided data frame to data:

df <- data.frame("col_1" = c(1, 2), "col_2" = c("a", "b"))
package <- add_resource(package, "df", df)
package$resources[[4]]$data
#>   col_1 col_2
#> 1     1     a
#> 2     2     b

write_package() writes that data frame to a CSV file, adds its path to path and removes data.

profile

profile is required to have the value "tabular-data-resource". add_resource() sets profile to that value.

schema

schema is required. It is used by read_resource() to parse data types and missing values. It can either be a JSON object or a path or URL referencing a JSON object. See vignette("table-schema") for details.

dialect

dialect is used by read_resource() to parse a tabular data file. It can either be a JSON object or a path or URL referencing a JSON object. See vignette("table-dialect") for details.

title

title is ignored by read_resource() and not set by add_resource(), unless provided:

add_resource(
  package,
  "iris",
  iris,
  title = "Edgar Anderson's Iris Data",
  replace = TRUE
)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • df
#> Use `unclass()` to print the Data Package as a list.

description

description is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

format

format is ignored by read_resource(). add_resource() sets format when data are provided as a file, based on the provided delim:

delim format
"," (default) "csv"
"\t" "tsv"
any other value "csv"
path <- system.file("extdata", "observations_1.tsv", package = "frictionless")
package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE)
package$resources[[2]]$format
#> [1] "tsv"

add_resource() leaves format undefined when data are provided as a data frame. write_package() sets it to "csv" when writing to disk.

mediatype

mediatype is ignored by read_resource(). add_resource() sets mediatype when data are provided as a file, based on the provided delim:

delim mediatype
"," (default) "text/csv"
"\t" "text/tab-separated-values"
any other value "text/csv"
path <- system.file("extdata", "observations_1.tsv", package = "frictionless")
package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE)
package$resources[[2]]$mediatype
#> [1] "text/tab-separated-values"

add_resource() leaves mediatype undefined when data are provided as a data frame. write_package() sets it to "text/csv" when writing to disk.

encoding

encoding (e.g. "windows-1252") is used by read_resource() to parse the file. It defaults to UTF-8 if no encoding is provided or if it cannot be recognized. The returned data frame is always UTF-8.

add_resource() guesses the encoding (using readr::guess_encoding()) when data are provided as file. It leaves the encoding undefined when data are provided as a data frame. write_package() sets it to "utf-8" when writing to disk.

path <- system.file("extdata", "deployments.csv", package = "frictionless")
package <- add_resource(package, "deployments", data = path, delim = ",", replace = TRUE)
package$resources[[2]]$encoding
#> [1] "UTF-8"

bytes

bytes is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

hash

hash is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

sources

sources is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

licenses

licenses is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

compression

compression (a recipe) is ignored by read_resource() and not set by add_resource().

Compression is derived from the provided path instead. If the path ends in .gz, .bz2, .xz, or .zip, the files are automatically decompressed by read_resource() (using default readr::read_delim() functionality). Only .gz files can be read directly from URL paths.