https://github.com/krlmlr/duckdb-mem

Analyze duckdb memory usage
https://github.com/krlmlr/duckdb-mem

Last synced: 3 months ago
JSON representation

Analyze duckdb memory usage

Host: GitHub
URL: https://github.com/krlmlr/duckdb-mem
Owner: krlmlr
Created: 2024-03-03T15:55:35.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-11T16:30:00.000Z (over 1 year ago)
Last Synced: 2025-02-10T11:12:37.226Z (5 months ago)
Language: R
Size: 590 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>"

)

library(tidyverse)

```

# duckdb-mem

The goal of duckdb-mem is to analyze the memory usage of DuckDB.

## `dbWriteTable()`

Running variants of the code in `setup.R`, by `run_setup.R`:

- duckdb: Baseline

- r: Without `dbWriteTable()`

- limited: With a duckdb memory limit of 10 MB

- limited_20: With a duckdb memory limit of 20 MB

- register: With `duckdb_register()` instead of `dbWriteTable()`

- manual: With `duckdb_register()` and `CREATE TABLE` instead of `dbWriteTable()`

- manual_limited: With `duckdb_register()`, `CREATE TABLE`, and a duckdb memory limit of 10 MB

```{r echo = FALSE, message = FALSE}

setup <- readRDS("setup.rds")

resident_size <-

  setup |>

  mutate(res = map_chr(out, ~ grep("resident", .x, value = TRUE))) |>

  mutate(mem = map_dbl(res, ~ as.numeric(str_extract(.x, "\\d+")) / 2^20))

ggplot(resident_size, aes(x = n, y = mem, color = workload)) +

  geom_point() +

  geom_smooth(method = "lm") +

  labs(title = "Resident memory usage",

       x = "Number of rows",

       y = "Resident memory (MB)") +

  theme_minimal()

```

### Linear model

```{r echo = FALSE}

lm(mem ~ workload, data = resident_size)

```

### Overhead

```{r echo = FALSE}

max <-

  resident_size |>

  filter(.by = workload, n == max(n)) |>

  select(workload, mem_max = mem)

min <-

  resident_size |>

  filter(.by = workload, n == min(n)) |>

  select(workload, mem_min = mem)

left_join(min, max, join_by(workload)) |>

  mutate(mem_delta = mem_max - mem_min) |>

  arrange(mem_delta) |>

  mutate(overhead = mem_delta / mem_delta[[1]])

```

### Conclusion

- Registering the data frame consumes a bit of memory, but not that much.

- The `setup-manual.R` script is equivalent to `setup.R` in terms of memory usage, but uses functions at a lower level compared to `dbWriteTable()`.

- The `CREATE TABLE` statement in `setup-manual.R` seems to be responsible for the memory overhead.

- Despite the limit of 10MB DuckDB memory in `setup-manual-limited.R`, the memory overhead is over 25MB.

## `dbGetQuery()`

Running variants of the code in `read`.R`, by `run_read.R`:

- duckdb: Baseline, `dbGetQuery()`

- limited: With a duckdb memory limit of 10 MB

- limited_20: With a duckdb memory limit of 20 MB

- limited_collect: With a duckdb memory limit of 10 MB, using `collect(n = n)`

- limited_collect_from: With a duckdb memory limit of 10 MB, using `tbl(con, "FROM data LIMIT ...") |> collect()`

- limited: With a duckdb memory limit of 10 MB, using `dbGetQuery(n = n)`

```{r read, echo = FALSE, message = FALSE}

read <- readRDS("read.rds")

resident_size <-

  read |>

  mutate(res = map_chr(out, ~ grep("resident", .x, value = TRUE))) |>

  mutate(mem = map_dbl(res, ~ as.numeric(str_extract(.x, "\\d+")) / 2^20))

ggplot(resident_size, aes(x = n, y = mem, color = workload)) +

  geom_point() +

  geom_smooth(method = "lm") +

  labs(title = "Resident memory usage for reading",

       x = "Number of rows",

       y = "Resident memory (MB)") +

  theme_minimal()

```

### Linear model

```{r echo = FALSE}

lm(mem ~ workload, data = resident_size)

```

### Overhead

```{r echo = FALSE}

max <-

  resident_size |>

  filter(.by = workload, n == max(n)) |>

  select(workload, mem_max = mem)

min <-

  resident_size |>

  filter(.by = workload, n == min(n)) |>

  select(workload, mem_min = mem)

left_join(min, max, join_by(workload)) |>

  mutate(mem_delta = mem_max - mem_min) |>

  arrange(mem_delta) |>

  mutate(overhead = mem_delta / mem_delta[[1]])

```

### Conclusion

- The size of the data is about 48 MB, so the memory overhead is about twofold.

- `collect(n = n)` is poison, with far worse overhead, only surpassed by `dbGetQuery(n = n)` (which is very surprising).

- `tbl(con, "FROM data LIMIT ...") |> collect()` is the best option for a lazy table.

- Action items:

    - Understand double memory usage in `dbGetQuery()`

    - Understand `dbGetQuery(n = n)`

    - See if ALTREP or a different way of fetching partial results (e.g., in the C++ glue) can help

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/krlmlr/duckdb-mem

Awesome Lists containing this project

README