https://github.com/jmacd/duckpond
Small telemetry system
https://github.com/jmacd/duckpond
Last synced: 3 months ago
JSON representation
Small telemetry system
- Host: GitHub
- URL: https://github.com/jmacd/duckpond
- Owner: jmacd
- Created: 2024-04-23T07:16:55.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2026-02-15T08:04:35.000Z (4 months ago)
- Last Synced: 2026-02-15T08:59:34.406Z (4 months ago)
- Language: Rust
- Homepage:
- Size: 6.72 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSES/Apache-2.0.txt
- Security: SECURITY.md
Awesome Lists containing this project
README
# Duckpond
Duckpond is a very small data lake. :-)
Duckpond is built by the [Caspar Water System](https://github.com/jmacd/caspar.water).
Duckpond writes timeseries data into multi-Parquet file databases and
assembles them for export using DuckDB. A sibling project [Noyo Blue
Economy](https://github.com/jmacd/noyo-blue-econ) shows how to combine
this output within [Observable
Framework](https://observablehq.com/framework/) markdown.
Warning! Work-in-progress. This works and needs testing! :-)

## Resource categories
### HydroVu
A receiver for www.hydrovu.com environmental monitoring data. To
instantiate one of these, for example:
```
apiVersion: www.hydrovu.com/v1
kind: HydroVu
name: noyo-harbor
desc: Noyo Harbor Blue Economy
spec:
key: ...
secret: ...
```
This downloads the complete set of data for all locations and
instruments. Configure key and secret fields using values supplied by
the HydroVu system.
### Backup
An exporter for backing up all files in the Pond to a storage bucket.
```
apiVersion: github.com/jmacd/duckpond/v1
kind: Backup
name: cloudflare
desc: Backup to Cloudflare R2
spec:
bucket: noyoharbor
region: ...
key: ...
secret: ...
endpoint: ...
```
Configure region, key, secret, and endpoint fields using values
supplied by your service provider.
### Copy
Loads from a Backup storage bucket.
```
apiVersion: github.com/jmacd/duckpond/v1
kind: Copy
name: cloudflare
desc: Backup from storage
spec:
bucket: noyoharbor
region: ...
key: ...
secret: ...
endpoint: ...
backup_uuid: 6fe1e0e1-b262-4d99-ae6f-3cae39e1196a
```
Configure region, key, secret, and endpoint the same as you
would for the corresponding Backup.
### Inbox
Ingest files from a local directory on the host file system.
```
apiVersion: github.com/jmacd/duckpond/v1
kind: Inbox
name: csvin
desc: CSV inbox
spec:
pattern: /home/user/inbox/csv/**
```
The files will be placed in a directory named `/Inbox/{UUID}`.
### Derive
Register a SQL query to transform arbitrary data (e.g., from the
Inbox) into a desired representation. The query will be executed once
per target file matching the pattern in the Pond.
```
apiVersion: github.com/jmacd/duckpond/v1
kind: Derive
name: csvextract
desc: Extract from CSV
spec:
collections:
- pattern: /Inbox/csvdata/*surface*.csv
name: SurfaceMeasurements
query: >
WITH INPUT as ...
SELECT ...
FROM read_csv('$1')
...
```
The placeholder `$1` is replaced by the real path of each file
matching the configured pattern, and the resulting derived files are
populated with the results of the `query` in a synthetic Pond
directory named `/Derive/{UUID}/{spec.name}`.
### Combine
Use the Combine resource to merge a set of files with different names
and identical schemas. This is accomplished by an automatically
generated UNION statement executed by DuckDB.
```
apiVersion: www.hydrovu.com/v1
kind: Combine
name: noyo-data
desc: Noyo harbor instrument collections
spec:
scopes:
- name: FieldStation-Surface
series:
- pattern: /Derive/othersource/SurfaceMeasurements/*
```
### Template
Use the Template resource to synthesize content generated through the
Rust `Tera` template engine. Each named collection applies the
supplied pattern, which must capture a single variable. For each
match, the captured value becomes the basename of a file with the
contents of the expanded template.
```
apiVersion: github.com/jmacd/duckpond/v1
kind: Template
name: website
desc: observable
spec:
collections:
- name: details
in_pattern: "/Combine/noyodata/*/combine"
out_pattern: "$0.md"
template: |-
---
title: Detail with combined data
---
{% for field in schema.fields %}
{{ field.name }}
{% endfor %}
```
The template context includes the schema of the matching file in a
variable named "schema". The schema is an array of fields:
```
pub struct Schema {
fields: Vec,
}
pub struct Field {
name: String,
instrument: String,
unit: String,
}
```
TODO: The schema is structured according to the HydroVu data model,
which is not general purpose.
#### Template functions
##### Group
The `group(by=..., in=...)` built-in groups any array of objects,
returning a map of arrays of objects. It can be used to group fields
by instrument name, for example.
```
{% for key, fields in group(by="name",in=schema.fields) %}
Name is {{ key }}
{% for field in fields %}
Instrument is {{ field.instrument }}
{% endfor %}
{% endfor %}
```
### Reduce
Use the Reduce resource to aggregate timeseries into larger time
buckets. Each collection's datasets are synthesized, expanding the
input wildcard (e.g., `.../*/...`) to the output placeholder (e.g.,
`output-$0`).
Multiple datasets may be listed within a collection, which must not
generate overlapping output names.
```
apiVersion: duckpond/v1
kind: Reduce
name: downsampled
desc: Downsampled datasets
spec:
collections:
- name: single_instrument
resolutions: [1h, 2h, 4h, 12h, 24h]
datasets:
- in_pattern: "/Combine/noyodata/*/combine"
out_pattern: "reduce-$0"
columns:
- "AT500_Bottom.DO.mg/L"
- "AT500_Surface.DO.mg/L"
```
### Scribble
Synthetic data generator for testing.
```
apiVersion: github.com/jmacd/duckpond/v1
kind: Scribble
name: first-scribble
desc: Test data generator
spec:
count_min: 5
count_max: 10
probs:
tree: 0.05
table: 0.4
```
## Usage
### Init
To initialize a new pond in the current working directory or `$POND`,
if set. The pond directory is named ".pond". To create a new pond:
```
duckpond init
```
### Apply
To apply a resource definition, such as to create a new temporal data
set. For example:
```
duckpond apply -f noyo.yaml
```
### Run
Fetches new data from registered resources. For example:
```
duckpond run
```
### Check
Determines whether expected files are present in the pond and various
other consistency checks. WIP: Simply prints the unexpected files.
```
duckpond check
```
### List
Produces a directory listing. Accepts shell wildcards.
```
duckpond list PATTERN
```
### Cat
Writes a file to the standard output.
```
duckpond cat PATH
```
### Export
Writes a sert of matching files to corresponding paths in the host
file system from one or more patterns.
```
duckpond export -d OUTPUT_DIR -p PATTERN [-p PATTERN ...] --temporal=year,month
```
If the file is tabular and the `--temporal` flag is set , data will be
exported in date-partitioned Parquet files. The `--temporal` argument
determines the partition keys for the output (which must include
`year`).
The set of files exported by each pattern are placed into the
`Template` context object, under the `export` key, making it possible
for exported templates to refer to exported files. For each wildcard
in the export pattern, a nested map is built for the captured
variable. If a pattern `/var/*/data/*` matched a path
`/var/log/data/messages`, the `export` context would include a list of
generated files, for example, like this:
```
{
"log": {
"messages": [
{
"file": "log/messages/year=2024/month=1/data_01.parquet"
"start_time": 1234,
"end_time": 5678,
},
...
],
}
}
```
TODO: This example belongs with `Template`. Also document `args`, the
matching wildcard variables.
## Configuration template expansion
Configuration values that are String typed have template expansion applied,
meaning that a line of configuration such as:
```
property: "{{ variable }}"
```
can be configured during `duckpond apply`. As an special case, the
Template resource's `template` field itself is not expanded. For example,
```
duckpond apply -f config.yaml -v variable=value
```