Data Resource

Data Resource is a simple format to describe a data resource such as an individual table or file, including its name, format, path, etc.

Properties implementation

name

name is required. It is used to identify a resource in read_resource(), add_resource() and remove_resource() (always as the second argument):

deployments <- read_resource(package, resource_name = "deployments")

add_resource() sets name to the provided resource_name:

add_resource(package, resource_name = "iris", data = iris)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • iris
#> Use `unclass()` to print the Data Package as a list.

path

path or data (see further) is required. Providing both is not allowed.

path is for data in files (e.g. a CSV file). It can be a local path or URL. Supported protocols are http, https, ftp, sftp and sftp. Absolute paths (/) or relative parent paths (../) are not allowed to avoid security vulnerabilities.

When multiple paths are provided ("path": ["myfile1.csv", "myfile2.csv"]), the files are expected to have the same structure. read_resource() merges these into a single data frame in the order the paths are provided (using dplyr::bind_rows()):

# The "observations" resource has multiple files in path
package$resources[[2]]$path
#> [1] "observations_1.tsv" "observations_2.tsv"
# These are combined into a single data frame when reading
read_resource(package, "observations")
#> # A tibble: 8 × 7
#>   observation_id deployment_id timestamp           scientific_name     count
#>   <chr>          <chr>         <dttm>              <chr>               <dbl>
#> 1 1-1            1             2020-09-28 00:13:07 Capreolus capreolus     1
#> 2 1-2            1             2020-09-28 15:59:17 Capreolus capreolus     1
#> 3 1-3            1             2020-09-28 16:35:23 Lepus europaeus         1
#> 4 1-4            1             2020-09-28 17:04:04 Lepus europaeus         1
#> 5 1-5            1             2020-09-28 19:19:54 Sus scrofa              2
#> 6 2-1            2             2021-10-01 01:25:06 Sus scrofa              1
#> 7 2-2            2             2021-10-01 01:25:06 Sus scrofa              1
#> 8 2-3            2             2021-10-01 04:47:30 Sus scrofa              1
#> # ℹ 2 more variables: life_stage <fct>, comments <chr>

add_resource() sets path to the path(s) provided in data:

path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless")
add_resource(package, "deployments", data = path, replace = TRUE)
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `unclass()` to print the Data Package as a list.

data

Support for inline data is currently limited, e.g. JSON object and string are not supported and schema, mediatype and format are ignored.

data is for inline data (included in the datapackage.json). read_resource() attempts to read data if it is provided as a JSON array:

# The "media" resource has inline data
str(package$resources[[3]]$data)
#> List of 3
#>  $ :List of 5
#>   ..$ media_id      : chr "aed5fa71-3ed4-4284-a6ba-3550d1a4de8d"
#>   ..$ deployment_id : chr "1"
#>   ..$ observation_id: chr "1-1"
#>   ..$ timestamp     : chr "2020-09-28 02:14:59+02:00"
#>   ..$ file_path     : chr "https://multimedia.agouti.eu/assets/aed5fa71-3ed4-4284-a6ba-3550d1a4de8d/file"
#>  $ :List of 5
#>   ..$ media_id      : chr "da81a501-8236-4cbd-aa95-4bc4b10a05df"
#>   ..$ deployment_id : chr "1"
#>   ..$ observation_id: chr "1-1"
#>   ..$ timestamp     : chr "2020-09-28 02:15:00+02:00"
#>   ..$ file_path     : chr "https://multimedia.agouti.eu/assets/da81a501-8236-4cbd-aa95-4bc4b10a05df/file"
#>  $ :List of 5
#>   ..$ media_id      : chr "0ba57608-3cf1-49d6-a5a2-fe680851024d"
#>   ..$ deployment_id : chr "1"
#>   ..$ observation_id: chr "1-1"
#>   ..$ timestamp     : chr "2020-09-28 02:15:01+02:00"
#>   ..$ file_path     : chr "https://multimedia.agouti.eu/assets/0ba57608-3cf1-49d6-a5a2-fe680851024d/file"
read_resource(package, "media")
#> # A tibble: 3 × 5
#>   media_id                      deployment_id observation_id timestamp file_path
#>   <chr>                         <chr>         <chr>          <chr>     <chr>    
#> 1 aed5fa71-3ed4-4284-a6ba-3550… 1             1-1            2020-09-… https://…
#> 2 da81a501-8236-4cbd-aa95-4bc4… 1             1-1            2020-09-… https://…
#> 3 0ba57608-3cf1-49d6-a5a2-fe68… 1             1-1            2020-09-… https://…

add_resource() adds the provided data frame to data:

df <- data.frame("col_1" = c(1, 2), "col_2" = c("a", "b"))
package <- add_resource(package, "df", df)
package$resources[[4]]$data
#>   col_1 col_2
#> 1     1     a
#> 2     2     b

write_package() writes that data frame to a CSV file, adds its path to path and removes data.

profile

profile is required to have the value "tabular-data-resource". add_resource() sets profile to that value.

schema

schema is required. It is used by read_resource() to parse data types and missing values. It can either be a JSON object or a path or URL referencing a JSON object. See vignette("table-schema") for details.

dialect

dialect is used by read_resource() to parse a tabular data file. It can either be a JSON object or a path or URL referencing a JSON object. See vignette("table-dialect") for details.

title

title is ignored by read_resource() and not set by add_resource(), unless provided:

add_resource(
  package,
  "iris",
  iris,
  title = "Edgar Anderson's Iris Data",
  replace = TRUE
)
#> A Data Package with 4 resources:
#> • deployments
#> • observations
#> • media
#> • df
#> Use `unclass()` to print the Data Package as a list.

description

description is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

format

format is ignored by read_resource(). add_resource() sets format when data are provided as a file, based on the provided delim:

delim	format
`","` (default)	`"csv"`
`"\t"`	`"tsv"`
any other value	`"csv"`

path <- system.file("extdata", "v1", "observations_1.tsv", package = "frictionless")
package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE)
package$resources[[2]]$format
#> [1] "tsv"

add_resource() leaves format undefined when data are provided as a data frame. write_package() sets it to "csv" when writing to disk.

mediatype

mediatype is ignored by read_resource(). add_resource() sets mediatype when data are provided as a file, based on the provided delim:

delim	mediatype
`","` (default)	`"text/csv"`
`"\t"`	`"text/tab-separated-values"`
any other value	`"text/csv"`

path <- system.file("extdata", "v1", "observations_1.tsv", package = "frictionless")
package <- add_resource(package, "observations", data = path, delim = "\t", replace = TRUE)
package$resources[[2]]$mediatype
#> [1] "text/tab-separated-values"

add_resource() leaves mediatype undefined when data are provided as a data frame. write_package() sets it to "text/csv" when writing to disk.

encoding

encoding (e.g. "windows-1252") is used by read_resource() to parse the file. It defaults to UTF-8 if no encoding is provided or if it cannot be recognized. The returned data frame is always UTF-8.

add_resource() guesses the encoding (using readr::guess_encoding()) when data are provided as file. It leaves the encoding undefined when data are provided as a data frame. write_package() sets it to "utf-8" when writing to disk.

path <- system.file("extdata", "v1", "deployments.csv", package = "frictionless")
package <- add_resource(package, "deployments", data = path, delim = ",", replace = TRUE)
package$resources[[2]]$encoding
#> [1] "UTF-8"

bytes

bytes is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

hash

hash is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

sources

sources is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

licenses

licenses is ignored by read_resource() and not set by add_resource() unless provided (cf. title).

compression

compression (a recipe) is ignored by read_resource() and not set by add_resource().

Compression is derived from the provided path instead. If the path ends in .gz, .bz2, .xz, or .zip, the files are automatically decompressed by read_resource() (using default readr::read_delim() functionality). Only .gz files can be read directly from URL paths.