https://github.com/krlmlr/duckdb-mem
Analyze duckdb memory usage
https://github.com/krlmlr/duckdb-mem
Last synced: about 2 months ago
JSON representation
Analyze duckdb memory usage
- Host: GitHub
- URL: https://github.com/krlmlr/duckdb-mem
- Owner: krlmlr
- Created: 2024-03-03T15:55:35.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-11T16:30:00.000Z (about 1 year ago)
- Last Synced: 2025-02-10T11:12:37.226Z (3 months ago)
- Language: R
- Size: 590 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)library(tidyverse)
```# duckdb-mem
The goal of duckdb-mem is to analyze the memory usage of DuckDB.
## `dbWriteTable()`
Running variants of the code in `setup.R`, by `run_setup.R`:
- duckdb: Baseline
- r: Without `dbWriteTable()`
- limited: With a duckdb memory limit of 10 MB
- limited_20: With a duckdb memory limit of 20 MB
- register: With `duckdb_register()` instead of `dbWriteTable()`
- manual: With `duckdb_register()` and `CREATE TABLE` instead of `dbWriteTable()`
- manual_limited: With `duckdb_register()`, `CREATE TABLE`, and a duckdb memory limit of 10 MB```{r echo = FALSE, message = FALSE}
setup <- readRDS("setup.rds")resident_size <-
setup |>
mutate(res = map_chr(out, ~ grep("resident", .x, value = TRUE))) |>
mutate(mem = map_dbl(res, ~ as.numeric(str_extract(.x, "\\d+")) / 2^20))ggplot(resident_size, aes(x = n, y = mem, color = workload)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Resident memory usage",
x = "Number of rows",
y = "Resident memory (MB)") +
theme_minimal()
```### Linear model
```{r echo = FALSE}
lm(mem ~ workload, data = resident_size)
```### Overhead
```{r echo = FALSE}
max <-
resident_size |>
filter(.by = workload, n == max(n)) |>
select(workload, mem_max = mem)min <-
resident_size |>
filter(.by = workload, n == min(n)) |>
select(workload, mem_min = mem)left_join(min, max, join_by(workload)) |>
mutate(mem_delta = mem_max - mem_min) |>
arrange(mem_delta) |>
mutate(overhead = mem_delta / mem_delta[[1]])
```### Conclusion
- Registering the data frame consumes a bit of memory, but not that much.
- The `setup-manual.R` script is equivalent to `setup.R` in terms of memory usage, but uses functions at a lower level compared to `dbWriteTable()`.
- The `CREATE TABLE` statement in `setup-manual.R` seems to be responsible for the memory overhead.
- Despite the limit of 10MB DuckDB memory in `setup-manual-limited.R`, the memory overhead is over 25MB.## `dbGetQuery()`
Running variants of the code in `read`.R`, by `run_read.R`:
- duckdb: Baseline, `dbGetQuery()`
- limited: With a duckdb memory limit of 10 MB
- limited_20: With a duckdb memory limit of 20 MB
- limited_collect: With a duckdb memory limit of 10 MB, using `collect(n = n)`
- limited_collect_from: With a duckdb memory limit of 10 MB, using `tbl(con, "FROM data LIMIT ...") |> collect()`
- limited: With a duckdb memory limit of 10 MB, using `dbGetQuery(n = n)````{r read, echo = FALSE, message = FALSE}
read <- readRDS("read.rds")resident_size <-
read |>
mutate(res = map_chr(out, ~ grep("resident", .x, value = TRUE))) |>
mutate(mem = map_dbl(res, ~ as.numeric(str_extract(.x, "\\d+")) / 2^20))ggplot(resident_size, aes(x = n, y = mem, color = workload)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Resident memory usage for reading",
x = "Number of rows",
y = "Resident memory (MB)") +
theme_minimal()
```### Linear model
```{r echo = FALSE}
lm(mem ~ workload, data = resident_size)
```### Overhead
```{r echo = FALSE}
max <-
resident_size |>
filter(.by = workload, n == max(n)) |>
select(workload, mem_max = mem)min <-
resident_size |>
filter(.by = workload, n == min(n)) |>
select(workload, mem_min = mem)left_join(min, max, join_by(workload)) |>
mutate(mem_delta = mem_max - mem_min) |>
arrange(mem_delta) |>
mutate(overhead = mem_delta / mem_delta[[1]])
```### Conclusion
- The size of the data is about 48 MB, so the memory overhead is about twofold.
- `collect(n = n)` is poison, with far worse overhead, only surpassed by `dbGetQuery(n = n)` (which is very surprising).
- `tbl(con, "FROM data LIMIT ...") |> collect()` is the best option for a lazy table.
- Action items:
- Understand double memory usage in `dbGetQuery()`
- Understand `dbGetQuery(n = n)`
- See if ALTREP or a different way of fetching partial results (e.g., in the C++ glue) can help