Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cynkra/historian

Three ways of storing temporal tables
https://github.com/cynkra/historian

Last synced: 17 days ago
JSON representation

Three ways of storing temporal tables

Awesome Lists containing this project

README

        

---
output: github_document
---

# historian

```{r}
library(conflicted)
library(tidyverse)
conflicts_prefer(dplyr::filter)
```

## Problem

We want to keep track of the state of a table at different points in time.
The table has a primary key `id` and a column `x` that we want to keep track of.
The `id` column is essential to identify rows across different points in time, and the `x` column is a proxy for arbitrary payload data.
In this example, `V1` is the initial state of the table, `V2` is the state of the table after adding a row, `V3` is the state of the table after modifying a row, and `V4` is the state of the table after deleting a row.

```{r}
V1 <- tibble(id = 1L, x = letters[1])
V1

# Adding a row
V2 <- tibble(id = 1:2, x = letters[1:2])
V2

# Modifying a row
V3 <- tibble(id = 1:2, x = letters[3:2])
V3

# Deleting a row
V4 <- tibble(id = 2L, x = letters[2])
V4
```

## History (temporal) table

At each point in time, there is a table `H` that contains the history of the table `V` at that point in time.
The table `H` has columns `from` and `to` that define the time interval for which the row is valid.
The table `H` also contains the details from table `V` at that point in time.

```{r}
H0 <- tibble(from = integer(), to = integer(), V1[integer(), ])
H1 <- tibble(from = 1L, to = NA_integer_, V1)
H2 <- tibble(from = 1:2, to = NA_integer_, V2)
H3 <- tibble(
from = 1:3,
to = c(3L, NA_integer_, NA_integer_),
bind_rows(V1[1, ], V3[2:1, ])
)
H4 <- tibble(
from = 1:3,
to = c(3L, NA_integer_, 4L),
bind_rows(V1[1, ], V3[2:1, ])
)
```

`H4` is smaller than `V1`, `V2`, `V3`, and `V4` combined because we do not store the same data multiple times:

```{r}
nrow(H4)
nrow(V1) + nrow(V2) + nrow(V3) + nrow(V4)
```

With that, we can define a function `at_time()` that takes a history table and a point in time, and returns the observation table at that point in time.

```{r}
at_time <- function(V, time) {
V |>
filter(coalesce(from <= !!time, TRUE), coalesce(to > !!time, TRUE)) |>
select(-from, -to) |>
arrange(id)
}

H1 |>
at_time(1) |>
waldo::compare(V1)

H2 |>
at_time(2) |>
waldo::compare(V2)

H2 |>
at_time(1) |>
waldo::compare(V1)

H3 |>
at_time(3) |>
waldo::compare(V3)

H3 |>
at_time(2) |>
waldo::compare(V2)

H3 |>
at_time(1) |>
waldo::compare(V1)

H4 |>
at_time(4) |>
waldo::compare(V4)

H4 |>
at_time(3) |>
waldo::compare(V3)

H4 |>
at_time(2) |>
waldo::compare(V2)

H4 |>
at_time(1) |>
waldo::compare(V1)
```

## Decomposition

The history tables can be decomposed into two tables: `O` (observation) and `D` (difference).
The observation table contains the details from the history table at the point in time, and is identical to the data at that point in time, save for the `from` and `to` columns.
The difference table contains the changes that happened compared to the prior point in time.

Because we want to avoid storing the same data multiple times, we omit rows in a difference table that are identical to rows found in previous difference tables.

```{r}
O1 <- H1
O1 |>
select(-from, -to) |>
waldo::compare(V1)

D1 <- H2[0, ]

O2 <- H2
O2 |>
select(-from, -to) |>
waldo::compare(V2)

D2 <- H3[1, ]

O3 <- H3[3:2, ]
O3 |>
select(-from, -to) |>
waldo::compare(V3)

# This does not contain H4[1, ], on purpose:
D3 <- H4[3, ]

O4 <- H4[2, ]

O4 |>
select(-from, -to) |>
waldo::compare(V4)
```

Binding the an observed table and past history tables give exactly the history table at that point in time.

```{r}
bind_rows(O1) |>
arrange(from, id) |>
waldo::compare(H1)

bind_rows(O2, D1) |>
arrange(from, id) |>
waldo::compare(H2)

bind_rows(O3, D2, D1) |>
arrange(from, id) |>
waldo::compare(H3)

bind_rows(O4, D3, D2, D1) |>
arrange(from, id) |>
waldo::compare(H4)
```

Therefore, the `at_time()` function also works when combining and observation table with difference tables.

```{r}
O4 |>
at_time(4)

bind_rows(O4, D3) |>
at_time(3)

bind_rows(O4, D3, D2) |>
at_time(2)

bind_rows(O4, D3, D2, D1) |>
at_time(1)
```

## Generalization

Because observation and difference tables are a superset of history tables, combining, e.g., one observation table and two difference tables allows reconstructing the original data for three points in time in the past.

```{r}
bind_rows(O4, D3, D2, D1) |>
at_time(1) |>
waldo::compare(V1)

bind_rows(O4, D3, D2, D1) |>
at_time(2) |>
waldo::compare(V2)

bind_rows(O4, D3, D2) |>
at_time(2) |>
waldo::compare(V2)

bind_rows(O4, D3, D2) |>
at_time(3) |>
waldo::compare(V3)

bind_rows(O4, D3) |>
at_time(3) |>
waldo::compare(V3)

bind_rows(O4, D3) |>
at_time(4) |>
waldo::compare(V4)
```

## Updating observation and difference tables

How to construct `O4` and `D3` from `O3`, `D2`, `D1`, and `V4`?
Same question for constructing `O3` and `D2` from `O2`, `D1`, and `V3`?
Or for constructing `O2` and `D1` from `O1` and `V2`?
Or for the initialization, constructing `O1` from `V1`?

We know that we can reconstruct the history table from the observation and difference tables.
This then boils down to the question of how to construct `O4` and `D3` from `O3`, `H3`, and `V4`.

```{r}
O4
D3
O3
H3
V4
```

We know how to extract `V3` from `O3`:

```{r}
O3 |>
at_time(3) |>
waldo::compare(V3)
```

We then can compute the new or updated, and deleted rows.
We also define `V0` and `O0` as the empty tables.

```{r}
V0 <- V1[0, ]
O0 <- O1[0, ]

compute_diff <- function(old, new, time) {
# Contains both new and updated rows
P <-
new |>
anti_join(old, by = names(new)) |>
mutate(from = as.integer(!!time), to = NA_integer_, .before = 1)

# The id values of the deleted rows
M <-
old |>
anti_join(new, by = "id") |>
select(id)

# The id values of the changed (new, updated, or deleted) rows
PM <-
P |>
select(id) |>
bind_rows(M)

list(P = P, M = M, PM = PM)
}

X4 <- compute_diff(H3, V4, 4)
X4

X3 <- compute_diff(H2, V3, 3)
X3

X2 <- compute_diff(H1, V2, 2)
X2

X1 <- compute_diff(H0, V1, 1)
X1
```

The observation table is the same as the new table with `from` and `to` set to the relevant points in time.
For new and updated rows, `from` is set to the current point in time; otherwise, the point in time from the old observation table is used.
The `to` column is always set to missing.
Deleted rows must be removed from the observation table.

```{r}
X4$P |>
bind_rows(O3) |>
distinct(id, .keep_all = TRUE) |>
anti_join(X4$M, by = "id") |>
arrange(id) |>
waldo::compare(O4)

X3$P |>
bind_rows(O2) |>
distinct(id, .keep_all = TRUE) |>
anti_join(X3$M, by = "id") |>
arrange(id) |>
waldo::compare(O3)

X2$P |>
bind_rows(O1) |>
distinct(id, .keep_all = TRUE) |>
anti_join(X2$M, by = "id") |>
arrange(id) |>
waldo::compare(O2)

X1$P |>
bind_rows(O0) |>
distinct(id, .keep_all = TRUE) |>
anti_join(X1$M, by = "id") |>
arrange(id) |>
waldo::compare(O1)
```

The new difference table is the history table with the changed rows and `to` set to the current point in time.

```{r}
H3 |>
semi_join(X4$PM, by = "id") |>
filter(.by = id, row_number(from) == n()) |>
mutate(to = 4L) |>
waldo::compare(D3)

H2 |>
semi_join(X3$PM, by = "id") |>
filter(.by = id, row_number(from) == n()) |>
mutate(to = 3L) |>
waldo::compare(D2)

H1 |>
semi_join(X2$PM, by = "id") |>
filter(.by = id, row_number(from) == n()) |>
mutate(to = 2L) |>
waldo::compare(D1)
```

The first observation table is the same as the first table with `from` set to the first point in time and `to` set to missing.

```{r}
V1 |>
mutate(from = 1L, to = NA_integer_, .before = 1) |>
waldo::compare(O1)
```

This defines a process for efficiently maintaining the observation and difference tables as new data arrives.

## Maintaining an inline history table

The approach above is useful if the data is stored in multiple flat files.
Given `H3` and `V4`, how to update `H3` in the most efficient way so that it becomes `H4`?
Can we use a variant of `compute_diff()` and a combination of `rows_append()`, `rows_update()`, `rows_upsert()` and/or `rows_delete()` for this task?

```{r}
compute_diff_history <- function(old, new, time) {
# Contains both new and updated rows
P <-
new |>
anti_join(old, by = names(new)) |>
mutate(from = as.integer(!!time), to = NA_integer_, .before = 1)

last <-
old |>
select(id, from) |>
arrange(from) |>
filter(.by = id, row_number(from) == n())

deleted <- anti_join(last, new, by = "id")

changed <- semi_join(last, P, by = "id")

# The id values of the rows to be patched to reflect deletion
C <-
bind_rows(deleted, changed) |>
mutate(to = as.integer(!!time))

list(P = P, C = C)
}

Y4 <- compute_diff_history(H3, V4, 4)
Y4

H3 |>
rows_patch(Y4$C, by = c("id", "from")) |>
rows_append(Y4$P) |>
waldo::compare(H4)

Y3 <- compute_diff_history(H2, V3, 3)
Y3

H2 |>
rows_patch(Y3$C, by = c("id", "from")) |>
rows_append(Y3$P) |>
waldo::compare(H3)

Y2 <- compute_diff_history(H1, V2, 2)
Y2

H1 |>
rows_patch(Y2$C, by = c("id", "from")) |>
rows_append(Y2$P) |>
waldo::compare(H2)

Y1 <- compute_diff_history(H0, V1, 1)
Y1

H0 |>
rows_patch(Y1$C, by = c("id", "from")) |>
rows_append(Y1$P) |>
waldo::compare(H1)

Y0 <- compute_diff_history(H0, V0, 0)
```

Because `rows_patch()` and `rows_append()` work on data frames and databases alike, and can persist the changes to a database with `in_place = TRUE`, the approach above defines a process for efficiently maintaining the history table as new data arrives.
Using a single `rows_upsert()` call is possible but worse because this would mean that the payload would be overwritten for old rows.

## Data that is changing for each data delivery

The example above assumes that only few rows are changing for each data delivery.
In real-world datasets, situations can occur where a few columns are changing for each data delivery across the entire dataset.
In this case, no compression can be achieved by storing only the changed rows.
A viable solution is to store the ever-changing columns in a separate table and join them with the history table when needed.

## Conclusion

The naive approach to maintaining different versions of a table is to store the entire table for each version (`V#` in our example).
This is inefficient in terms of storage but offers the best performance for querying.

A good compromise is to maintain a history or temporal table (`H#` in our example).
This requires each row to be identified by a unique identifier (the `id` column).
The `id` column can be an integer, a GUID, or any other unique identifier.
Composite keys are also possible.
A temproal table contains two extra columns, `from` and `to`, that define the time period during which a row is valid.
These columns can be of any ordered type, such as integers, dates, or timestamps.
The `at_time()` function provides a way to query such a table at a specific point in time.

The maintenance of a temporal table as new data arrives is slightly different depending on the storage medium because they have different trade-offs.
Flat files are easy to work with but require the entire table to be read and written.
To maintain efficiency, the history table can be split into observation (`O#`) and difference (`D#`) tables.
In contrast, a database table can be changed in-place but requires the changesets to be specified in bulk for efficiency.

For flat files, the `compute_diff()` function provides a way to efficiently maintain the observation and difference tables as new data arrives.
For each new data delivery, only the most recent observation table must be replaced with a new difference table, and the new delivery essentially becomes the new observation table.

The `compute_diff_history()` function provides a way to efficiently maintain a history table on a database new data arrives.
It specifies precisely the rows to be updated and appended.
For updated rows, the payload `x` is never touched.