An open API service indexing awesome lists of open source software.

https://github.com/krlmlr/duckdb-mem

Analyze duckdb memory usage
https://github.com/krlmlr/duckdb-mem

Last synced: about 2 months ago
JSON representation

Analyze duckdb memory usage

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)

library(tidyverse)
```

# duckdb-mem

The goal of duckdb-mem is to analyze the memory usage of DuckDB.

## `dbWriteTable()`

Running variants of the code in `setup.R`, by `run_setup.R`:

- duckdb: Baseline
- r: Without `dbWriteTable()`
- limited: With a duckdb memory limit of 10 MB
- limited_20: With a duckdb memory limit of 20 MB
- register: With `duckdb_register()` instead of `dbWriteTable()`
- manual: With `duckdb_register()` and `CREATE TABLE` instead of `dbWriteTable()`
- manual_limited: With `duckdb_register()`, `CREATE TABLE`, and a duckdb memory limit of 10 MB

```{r echo = FALSE, message = FALSE}
setup <- readRDS("setup.rds")

resident_size <-
setup |>
mutate(res = map_chr(out, ~ grep("resident", .x, value = TRUE))) |>
mutate(mem = map_dbl(res, ~ as.numeric(str_extract(.x, "\\d+")) / 2^20))

ggplot(resident_size, aes(x = n, y = mem, color = workload)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Resident memory usage",
x = "Number of rows",
y = "Resident memory (MB)") +
theme_minimal()
```

### Linear model

```{r echo = FALSE}
lm(mem ~ workload, data = resident_size)
```

### Overhead

```{r echo = FALSE}
max <-
resident_size |>
filter(.by = workload, n == max(n)) |>
select(workload, mem_max = mem)

min <-
resident_size |>
filter(.by = workload, n == min(n)) |>
select(workload, mem_min = mem)

left_join(min, max, join_by(workload)) |>
mutate(mem_delta = mem_max - mem_min) |>
arrange(mem_delta) |>
mutate(overhead = mem_delta / mem_delta[[1]])
```

### Conclusion

- Registering the data frame consumes a bit of memory, but not that much.
- The `setup-manual.R` script is equivalent to `setup.R` in terms of memory usage, but uses functions at a lower level compared to `dbWriteTable()`.
- The `CREATE TABLE` statement in `setup-manual.R` seems to be responsible for the memory overhead.
- Despite the limit of 10MB DuckDB memory in `setup-manual-limited.R`, the memory overhead is over 25MB.

## `dbGetQuery()`

Running variants of the code in `read`.R`, by `run_read.R`:

- duckdb: Baseline, `dbGetQuery()`
- limited: With a duckdb memory limit of 10 MB
- limited_20: With a duckdb memory limit of 20 MB
- limited_collect: With a duckdb memory limit of 10 MB, using `collect(n = n)`
- limited_collect_from: With a duckdb memory limit of 10 MB, using `tbl(con, "FROM data LIMIT ...") |> collect()`
- limited: With a duckdb memory limit of 10 MB, using `dbGetQuery(n = n)`

```{r read, echo = FALSE, message = FALSE}
read <- readRDS("read.rds")

resident_size <-
read |>
mutate(res = map_chr(out, ~ grep("resident", .x, value = TRUE))) |>
mutate(mem = map_dbl(res, ~ as.numeric(str_extract(.x, "\\d+")) / 2^20))

ggplot(resident_size, aes(x = n, y = mem, color = workload)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Resident memory usage for reading",
x = "Number of rows",
y = "Resident memory (MB)") +
theme_minimal()
```

### Linear model

```{r echo = FALSE}
lm(mem ~ workload, data = resident_size)
```

### Overhead

```{r echo = FALSE}
max <-
resident_size |>
filter(.by = workload, n == max(n)) |>
select(workload, mem_max = mem)

min <-
resident_size |>
filter(.by = workload, n == min(n)) |>
select(workload, mem_min = mem)

left_join(min, max, join_by(workload)) |>
mutate(mem_delta = mem_max - mem_min) |>
arrange(mem_delta) |>
mutate(overhead = mem_delta / mem_delta[[1]])
```

### Conclusion

- The size of the data is about 48 MB, so the memory overhead is about twofold.
- `collect(n = n)` is poison, with far worse overhead, only surpassed by `dbGetQuery(n = n)` (which is very surprising).
- `tbl(con, "FROM data LIMIT ...") |> collect()` is the best option for a lazy table.
- Action items:
- Understand double memory usage in `dbGetQuery()`
- Understand `dbGetQuery(n = n)`
- See if ALTREP or a different way of fetching partial results (e.g., in the C++ glue) can help