Table Schema

Peter Desmet

Table Schema is a simple format to describe tabular data, including field names, types, constraints, missing values, foreign keys, etc.

In this document we use the terms “package” for Data Package, “resource” for Data Resource, “dialect” for Table Dialect, and “schema” for Table Schema.

General implementation

Frictionless supports schema$fields and schema$missingValues to parse data types and missing values when reading Tabular Data Resources. Schema manipulation is limited to extracting a schema from a resource, creating one from a data frame, and providing one back to a resource. Schema metadata is including when writing a package.

Read

get_schema() extracts the schema from a resource:

library(frictionless)
package <- example_package()

# Get the Table Schema for the resource "observations"
schema <- get_schema(package, "observations")
str(schema)
#> List of 4
#>  $ fields       :List of 7
#>   ..$ :List of 3
#>   .. ..$ name       : chr "observation_id"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 2
#>   .. .. ..$ required: logi TRUE
#>   .. .. ..$ unique  : logi TRUE
#>   ..$ :List of 3
#>   .. ..$ name       : chr "deployment_id"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi TRUE
#>   ..$ :List of 4
#>   .. ..$ name       : chr "timestamp"
#>   .. ..$ type       : chr "datetime"
#>   .. ..$ format     : chr "%Y-%m-%dT%H:%M:%S%z"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi TRUE
#>   ..$ :List of 3
#>   .. ..$ name       : chr "scientific_name"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi FALSE
#>   ..$ :List of 3
#>   .. ..$ name       : chr "count"
#>   .. ..$ type       : chr "integer"
#>   .. ..$ constraints:List of 2
#>   .. .. ..$ required: logi FALSE
#>   .. .. ..$ minimum : int 1
#>   ..$ :List of 3
#>   .. ..$ name       : chr "life_stage"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 2
#>   .. .. ..$ required: logi FALSE
#>   .. .. ..$ enum    : chr [1:5] "adult" "subadult" "juvenile" "offspring" ...
#>   ..$ :List of 3
#>   .. ..$ name       : chr "comments"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi FALSE
#>  $ missingValues: chr [1:3] "" "NA" "NaN"
#>  $ primaryKey   : chr "observation_id"
#>  $ foreignKeys  :List of 1
#>   ..$ :List of 2
#>   .. ..$ fields   : chr "deployment_id"
#>   .. ..$ reference:List of 2
#>   .. .. ..$ resource: chr "deployments"
#>   .. .. ..$ fields  : chr "deployment_id"

read_resource() uses schema$fields to parse the names and data types of the columns in a tabular data file. For example, the third field in the schema (timestamp) is defined as a datetime type with a specific format:

str(schema$fields[[3]])
#> List of 4
#>  $ name       : chr "timestamp"
#>  $ type       : chr "datetime"
#>  $ format     : chr "%Y-%m-%dT%H:%M:%S%z"
#>  $ constraints:List of 1
#>   ..$ required: logi TRUE

read_resource() uses that information to correctly parse the data type and to assign the name timestamp to the column:

observations <- read_resource(package, "observations")
observations$timestamp
#> [1] "2020-09-28 00:13:07 UTC" "2020-09-28 15:59:17 UTC"
#> [3] "2020-09-28 16:35:23 UTC" "2020-09-28 17:04:04 UTC"
#> [5] "2020-09-28 19:19:54 UTC" "2021-10-01 01:25:06 UTC"
#> [7] "2021-10-01 01:25:06 UTC" "2021-10-01 04:47:30 UTC"

The sixth field life_stage has an enum defined as one of its constraints:

str(schema$fields[[6]])
#> List of 3
#>  $ name       : chr "life_stage"
#>  $ type       : chr "string"
#>  $ constraints:List of 2
#>   ..$ required: logi FALSE
#>   ..$ enum    : chr [1:5] "adult" "subadult" "juvenile" "offspring" ...

read_resource() uses that information to parse the column as a factor, using enum as the factor levels:

class(observations$life_stage)
#> [1] "factor"
levels(observations$life_stage)
#> [1] "adult"     "subadult"  "juvenile"  "offspring" "unknown"

Manipulate

A schema is a list which you can manipulate, but frictionless does not provide functions to do that. Use {purrr} or base R instead (see vignette("frictionless")). You do not have to start a schema from scratch though: use get_schema() (see above) or create_schema() instead.

create_schema() creates a schema from a data frame and defines the name, type (and if a factor constraints$enum) for each field:

# Create a schema from the built-in dataset "iris"
iris_schema <- create_schema(iris)
str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"

add_resource() allows to include the schema with a resource. If no schema is provided, one is created with create_schema():

package <- add_resource(
  package,
  resource_name = "iris",
  data = iris,
  schema = iris_schema
)

Write

write_package() writes a package to disk as a datapackage.json file. This file includes the metadata of all the resources, including the schema. To directly write a schema to disk, use jsonlite::write_json().

Schema properties implementation

fields

fields is required. It is used by read_resource() to parse the names and data types of the columns in a tabular data file. create_schema() sets fields based on information in a data frame. See Field properties implementation for details.

missingValues

missingValues is used by read_resource() and defaults to "". It is passed to na in readr::read_delim(). create_schema() does not set missingValues. write_package() converts NA values to "" when writing a data frame to a CSV file. Since this is the default, no missingValues property is set.

primaryKey

primaryKey is ignored by read_resource() and not set by create_schema().

foreignKeys

foreignKeys is ignored by read_resource() and not set by create_schema().

Field properties implementation

name

name is used by read_resource() to assign a column name. The vector of names is passed as col_names to readr::read_delim(), ignoring names provided in the header of the data file. create_schema() uses the data frame column name to set name.

type and format

type and (for some types) format is used by read_resource() to understand the column type. The vector of types is passed as col_types to readr::read_delim(), which warns if there are parsing issues (inspect with problems()). create_schema() uses the data frame column type to set type. See Field types implementation for details.

read_resource() interprets type as follows:

field type column type
string character or factor
number double or factor
integer double or factor
boolean logical
object character
array character
datetime POSIXct
date Date
time hms::hms()
year Date
yearmonth Date
duration character
geopoint character
geojson character
any character
other value error
undefined guessed

create_schema() sets type as follows:

column type field type
character string
Date date
difftime number
factor string with factor levels as constraints$enum
hms::hms() time
integer integer
logical boolean
numeric number
POSIXct/POSIXlt datetime
any other type any

create_schema() does not set a format, since defaults are used for all types. This is also the case for datetimes, dates and times, since readr::write_csv() used by write_package() formats those to ISO8601, which is considered the default.

title

title is ignored by read_resource() and not set by create_schema().

description

description is ignored by read_resource() and not set by create_schema().

example

example is ignored by read_resource() and not set by create_schema().

constraints

constraints is ignored by read_resource() and not set by create_schema(), except for constraints$enum. read_resource() uses it set the column type to factor, with enum values as factor levels. create_schema() does the reverse.

rdfType

rdfType is ignored by read_resource() and not set by create_schema().

Field types implementation

string

string is interpreted as character. Or factor when constraints$enum is defined.

number

number is interpreted as double. Or factor when constraints$enum is defined.

integer

integer is interpreted as double (to avoid issues with big numbers). Or factor when constraints$enum is defined.

boolean

boolean is interpreted as logical.

object

object is interpreted as character

array

array is interpreted as character.

datetime

datetime is interpreted as POSIXct.

date

date is interpreted as Date.

time

time is interpreted as hms::hms().

year

year is interpreted as Date with month and day set to 01.

yearmonth

yearmonth is interpreted as Date with day set to 01.

duration

duration is interpreted as character. You can parse these values with lubridate::duration().

geopoint

geopoint is interpreted as character.

geojson

geojson is interpreted as character.

any

any is interpreted as character