Table Schema

Table Schema is a simple format to describe tabular data, including field names, types, constraints, missing values, foreign keys, etc.

General implementation

Frictionless supports schema$fields and schema$missingValues to parse data types and missing values when reading Tabular Data Resources. Schema manipulation is limited to extracting a schema from a resource, creating one from a data frame, and providing one back to a resource. Schema metadata is including when writing a package.

Read

get_schema() extracts the schema from a resource:

library(frictionless)
package <- example_package()

# Get the Table Schema for the resource "observations"
schema <- get_schema(package, "observations")
str(schema)
#> List of 4
#>  $ fields       :List of 7
#>   ..$ :List of 3
#>   .. ..$ name       : chr "observation_id"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 2
#>   .. .. ..$ required: logi TRUE
#>   .. .. ..$ unique  : logi TRUE
#>   ..$ :List of 3
#>   .. ..$ name       : chr "deployment_id"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi TRUE
#>   ..$ :List of 4
#>   .. ..$ name       : chr "timestamp"
#>   .. ..$ type       : chr "datetime"
#>   .. ..$ format     : chr "%Y-%m-%dT%H:%M:%S%z"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi TRUE
#>   ..$ :List of 3
#>   .. ..$ name       : chr "scientific_name"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi FALSE
#>   ..$ :List of 3
#>   .. ..$ name       : chr "count"
#>   .. ..$ type       : chr "integer"
#>   .. ..$ constraints:List of 2
#>   .. .. ..$ required: logi FALSE
#>   .. .. ..$ minimum : int 1
#>   ..$ :List of 3
#>   .. ..$ name       : chr "life_stage"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 2
#>   .. .. ..$ required: logi FALSE
#>   .. .. ..$ enum    : chr [1:5] "adult" "subadult" "juvenile" "offspring" ...
#>   ..$ :List of 3
#>   .. ..$ name       : chr "comments"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ required: logi FALSE
#>  $ missingValues: chr [1:3] "" "NA" "NaN"
#>  $ primaryKey   : chr "observation_id"
#>  $ foreignKeys  :List of 1
#>   ..$ :List of 2
#>   .. ..$ fields   : chr "deployment_id"
#>   .. ..$ reference:List of 2
#>   .. .. ..$ resource: chr "deployments"
#>   .. .. ..$ fields  : chr "deployment_id"

read_resource() uses schema$fields to parse the names and data types of the columns in a tabular data file. For example, the third field in the schema (timestamp) is defined as a datetime type with a specific format:

str(schema$fields[[3]])
#> List of 4
#>  $ name       : chr "timestamp"
#>  $ type       : chr "datetime"
#>  $ format     : chr "%Y-%m-%dT%H:%M:%S%z"
#>  $ constraints:List of 1
#>   ..$ required: logi TRUE

read_resource() uses that information to correctly parse the data type and to assign the name timestamp to the column:

observations <- read_resource(package, "observations")
observations$timestamp
#> [1] "2020-09-28 00:13:07 UTC" "2020-09-28 15:59:17 UTC"
#> [3] "2020-09-28 16:35:23 UTC" "2020-09-28 17:04:04 UTC"
#> [5] "2020-09-28 19:19:54 UTC" "2021-10-01 01:25:06 UTC"
#> [7] "2021-10-01 01:25:06 UTC" "2021-10-01 04:47:30 UTC"

The sixth field life_stage has an enum defined as one of its constraints:

str(schema$fields[[6]])
#> List of 3
#>  $ name       : chr "life_stage"
#>  $ type       : chr "string"
#>  $ constraints:List of 2
#>   ..$ required: logi FALSE
#>   ..$ enum    : chr [1:5] "adult" "subadult" "juvenile" "offspring" ...

read_resource() uses that information to parse the column as a factor, using enum as the factor levels:

class(observations$life_stage)
#> [1] "factor"
levels(observations$life_stage)
#> [1] "adult"     "subadult"  "juvenile"  "offspring" "unknown"

Manipulate

A schema is a list which you can manipulate, but frictionless does not provide functions to do that. Use {purrr} or base R instead (see vignette("frictionless")). You do not have to start a schema from scratch though: use get_schema() (see above) or create_schema() instead.

create_schema() creates a schema from a data frame and defines the name, type (and if a factor constraints$enum) for each field:

# Create a schema from the built-in dataset "iris"
iris_schema <- create_schema(iris)
str(iris_schema)
#> List of 1
#>  $ fields:List of 5
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Sepal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Length"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 2
#>   .. ..$ name: chr "Petal.Width"
#>   .. ..$ type: chr "number"
#>   ..$ :List of 3
#>   .. ..$ name       : chr "Species"
#>   .. ..$ type       : chr "string"
#>   .. ..$ constraints:List of 1
#>   .. .. ..$ enum: chr [1:3] "setosa" "versicolor" "virginica"

add_resource() allows to include the schema with a resource. If no schema is provided, one is created with create_schema():

package <- add_resource(
  package,
  resource_name = "iris",
  data = iris,
  schema = iris_schema
)

Write

write_package() writes a package to disk as a datapackage.json file. This file includes the metadata of all the resources, including the schema. To directly write a schema to disk, use jsonlite::write_json().

Schema properties implementation

fields

fields is required. It is used by read_resource() to parse the names and data types of the columns in a tabular data file. create_schema() sets fields based on information in a data frame. See Field properties implementation for details.

missingValues

missingValues is used by read_resource() and defaults to "". It is passed to na in readr::read_delim(). create_schema() does not set missingValues. write_package() converts NA values to "" when writing a data frame to a CSV file. Since this is the default, no missingValues property is set.

primaryKey

primaryKey is ignored by read_resource() and not set by create_schema().

foreignKeys

foreignKeys is ignored by read_resource() and not set by create_schema().

Field properties implementation

name

name is used by read_resource() to assign a column name. The vector of names is passed as col_names to readr::read_delim(), ignoring names provided in the header of the data file. create_schema() uses the data frame column name to set name.

type and format

type and (for some types) format is used by read_resource() to understand the column type. The vector of types is passed as col_types to readr::read_delim(), which warns if there are parsing issues (inspect with problems()). create_schema() uses the data frame column type to set type. See Field types implementation for details.

read_resource() interprets type as follows:

field type	column type
`string`	`character` or `factor`
`number`	`double` or `factor`
`integer`	`double` or `factor`
`boolean`	`logical`
`object`	`character`
`array`	`character`
`datetime`	`POSIXct`
`date`	`Date`
`time`	`hms::hms()`
`year`	`Date`
`yearmonth`	`Date`
`duration`	`character`
`geopoint`	`character`
`geojson`	`character`
`any`	`character`
other value	error
undefined	guessed

create_schema() sets type as follows:

column type	field type
`character`	`string`
`Date`	`date`
`difftime`	`number`
`factor`	`string` with factor levels as `constraints$enum`
`hms::hms()`	`time`
`integer`	`integer`
`logical`	`boolean`
`numeric`	`number`
`POSIXct`/`POSIXlt`	`datetime`
any other type	`any`

create_schema() does not set a format, since defaults are used for all types. This is also the case for datetimes, dates and times, since readr::write_csv() used by write_package() formats those to ISO8601, which is considered the default.

title

title is ignored by read_resource() and not set by create_schema().

description

description is ignored by read_resource() and not set by create_schema().

example

example is ignored by read_resource() and not set by create_schema().

constraints

constraints is ignored by read_resource() and not set by create_schema(), except for constraints$enum. read_resource() uses it set the column type to factor, with enum values as factor levels. create_schema() does the reverse.

rdfType

rdfType is ignored by read_resource() and not set by create_schema().

Field types implementation

string

string is interpreted as character. Or factor when constraints$enum is defined.

format is ignored.

number

number is interpreted as double. Or factor when constraints$enum is defined.

bareNumber is supported. If false, whitespace and non-numeric characters are ignored.
decimalChar (. by default) is supported, but as a single value for all number fields. If different values are defined, the most occurring one is selected.
groupChar (undefined by default) is supported, but as a single value for all number fields. If different values are defined, the most occurring one is selected.

integer

integer is interpreted as double (to avoid issues with big numbers). Or factor when constraints$enum is defined.

bareNumber is supported. If false, whitespace and non-numeric characters are ignored.

boolean

boolean is interpreted as logical.

trueValues that are not defaults are not supported.
falseValues that are not defaults are not supported.

object

object is interpreted as character

array

array is interpreted as character.

datetime

datetime is interpreted as POSIXct.

format is supported for the values default (ISO datetime), any (ISO datetime) and the same patterns as for date and time. The value %c is not supported.

date

date is interpreted as Date.

format is supported for the values default (ISO date), any (guess ymd) and Python/C strptime patterns, such as %a, %d %B %Y for Sat, 23 November 2013. %x is interpreted as %m/%d/%y. The values %j, %U, %w and %W are not supported.

time

time is interpreted as hms::hms().

format is supported for the values default (ISO time), any (guess hms) and Python/C strptime patterns, such as %I%p%M:%S.%f%z for 8AM30:00.300+0200.

year

year is interpreted as Date with month and day set to 01.

yearmonth

yearmonth is interpreted as Date with day set to 01.

duration

duration is interpreted as character. You can parse these values with lubridate::duration().

geopoint

geopoint is interpreted as character.

geojson

geojson is interpreted as character.

any

any is interpreted as character