---
title: "Reading REDATAM files in R"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{reading-redatam-in-r}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Please read our article for the full context of this project (Open Access):

> Vargas SepÃºlveda, Mauricio and Barkai, Lital. 2025. "The REDATAM format and
  its challenges for data access and information creation in public policy."
  *Data & Policy* 7 (January): e18. [https://dx.doi.org/10.1017/dap.2025.4](https://dx.doi.org/10.1017/dap.2025.4). 

We start by downloading the Chilean Census 2017 from ECLAC website (link valid
as of 2024-10-01):

```{r download, eval = FALSE}
url <- "https://redatam.org/cdr/descargas/censos/poblacion/CP2017CHL.zip"
zip <- "CP2017CHL.zip"

if (!file.exists(zip)) {
  download.file(url, zip, method = "wget")
}
```

Now we can extract the files:

```{r extract, eval = FALSE}
# install.packages("archive")
dout <- basename(zip)
dout <- sub("\\.zip$", "", dout)
archive::archive_extract(zip, dir = dout)
```

You can use `unzip()` from base R, but this file in particular gave me an
error. The `archive` package is a wrapper around 'libarchive' that provides
multi-format archive and compression support.

The REDATAM files are now stored in the `CP2017CHL` directory. We can read the
REDATAM dictionary file (DIC or DICX for this particular case):

```{r read_dic, eval = FALSE}
library(redatam)

fout <- "chile2017.rds"

if (!file.exists(fout)) {
  chile2017 <- read_redatam("CP2017CHL/BaseOrg16/CPV2017-16.dicx")
  saveRDS(chile2017, fout)
} else {
  chile2017 <- readRDS(fout)
}
```

One of the many possibilities with this census is to obtain the number of houses
with overcrowding. For this, the Secretary for Social Development and Family
(Ministerio de Desarrollo Social y Familia) divides the number of people
residing in a dwelling and the number of bedrooms in the dwelling, with the
special case of adding one to studio apartments and similar units, and the
result is discretized as follows.

* No overcrowding $[0,2.5)$
* Mean $[2.5,3.5)$
* High $[3.5,4.9)$
* Critical $[5,Inf)$

According to the census documentation in the previous ZIP file, this consists in
dividing the variables `cant_pers` and `p04` from the `vivienda` (housing) table
to then discretize the result. The documentation also states that we must join
the `vivienda` table with `zonaloc` (zones), `area`, `distrito` (district) and
`communa` (municipality) to match each house with its corresponding
municipality. This can be done with `dplyr`:

```{r, eval = FALSE}
library(dplyr)

overcrowding <- chile2017$comuna %>%
  select(ncomuna, comuna_ref_id) %>%
  inner_join(
    chile2017$distrito %>%
    select(distrito_ref_id, comuna_ref_id)
  ) %>%
  inner_join(
    chile2017$area %>%
      select(area_ref_id, distrito_ref_id)
  ) %>%
  inner_join(
    chile2017$zonaloc %>%
      select(zonaloc_ref_id, area_ref_id)
  ) %>%
  inner_join(
    chile2017$vivienda %>%
      select(zonaloc_ref_id, vivienda_ref_id, cant_per, p04) %>%
      mutate(
        p04 = case_when(
          p04 == 98 ~ NA_integer_,
          p04 == 99 ~ NA_integer_,
          TRUE ~ p04
        )
      ) %>%
      filter(!is.na(p04))
  ) %>%  
  mutate(
    overcrowding = case_when(
      p04 >=1 ~ cant_per / p04,
      p04 ==0 ~ cant_per / (p04 + 1)
    )
  ) %>% 
  mutate(
    overcrowding_discrete = case_when(
      overcrowding  < 2.5                      ~ "No Overcrowding",
      overcrowding >= 2.5 & overcrowding < 3.5 ~ "Mean",
      overcrowding >= 3.5 & overcrowding < 5   ~ "High",
      overcrowding >= 5                        ~ "Critical"
    )
  ) %>%
  group_by(comuna = ncomuna, overcrowding_discrete) %>%
  count()
```

Now we can filter for any municipality of our interest, for example:

```{r metropolitana, eval = FALSE}
overcrowding %>%
  filter(comuna == "VITACURA")

overcrowding %>%
  filter(comuna == "LA PINTANA")
```

```
# A tibble: 4 x 3
# Groups:   comuna, overcrowding_discrete [4]
  comuna   overcrowding_discrete     n
  <fct>    <chr>                 <int>
1 VITACURA Critical                  9
2 VITACURA High                     18
3 VITACURA Mean                    174
4 VITACURA No Overcrowding       26752

# A tibble: 4 x 3
# Groups:   comuna, overcrowding_discrete [4]
  comuna     overcrowding_discrete     n
  <fct>      <chr>                 <int>
1 LA PINTANA Critical                497
2 LA PINTANA High                   1112
3 LA PINTANA Mean                   4522
4 LA PINTANA No Overcrowding       39163
```