https://github.com/freshmag/scarfolder-py
Data and file scaffolding via configurable YAML pipelines in a ETL fashion
https://github.com/freshmag/scarfolder-py
docker python scaffolder scaffolding utility
Last synced: 3 months ago
JSON representation
Data and file scaffolding via configurable YAML pipelines in a ETL fashion
- Host: GitHub
- URL: https://github.com/freshmag/scarfolder-py
- Owner: FreshMag
- License: mit
- Created: 2026-04-05T08:45:21.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-05T10:08:47.000Z (3 months ago)
- Last Synced: 2026-04-05T10:20:27.754Z (3 months ago)
- Topics: docker, python, scaffolder, scaffolding, utility
- Language: Python
- Homepage: https://freshmag.github.io/scarfolder-py/
- Size: 176 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
Data and file scaffolding via configurable YAML pipelines.
Define generators, transformers, and loaders — wire them together in YAML — run anywhere.
---
## Table of Contents
- [Concepts](#concepts)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Pipeline Configuration](#pipeline-configuration)
- [Structure](#structure)
- [Inline Chaining](#inline-chaining)
- [Args & Placeholders](#args--placeholders)
- [External Refs](#external-refs)
- [CLI Reference](#cli-reference)
- [Built-in Plugins](#built-in-plugins)
- [Writing Custom Plugins](#writing-custom-plugins)
- [Running with Docker](#running-with-docker)
---
## Concepts
A **Scarf** is a full pipeline defined in a single `.yaml` file. It contains one or more **Steps**. Each step has three plugin roles:
| Plugin | Role |
|---|---|
| **Generator** | Produces a list of values |
| **Transformer** | Receives a list and returns a new list |
| **Loader** | Consumes a list — writes files, runs queries, prints, etc. |
Each step can be given an `id` so its output can be referenced by downstream steps via `${steps.id}`.
Steps are executed in **topological order** — declaration order in the file does not matter.
---
## Installation
**Requirements:** Python 3.11+
```bash
git clone scarfolder-py
cd scarfolder-py
python3.11 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e . # add [dev] for pytest
```
The `scarfolder` command is now available in your shell.
---
## Quick Start
```bash
# Run the included hello-world example
scarfolder run examples/hello_world/scarf.yaml
# Override a config arg at runtime
scarfolder run examples/hello_world/scarf.yaml -pcount=10 -poutput=out.txt
# Check a config file without running it
scarfolder validate examples/hello_world/scarf.yaml
# Inspect the steps of a pipeline
scarfolder list-steps examples/hello_world/scarf.yaml
```
---
## Pipeline Configuration
### Structure
```yaml
name: my-pipeline
description: Optional description
# (optional) External YAML files accessible via ${ref_name.key}
refs:
queries: ./sql/queries.yaml
# Default argument values.
# Set a value to null to mark it as required — the CLI will prompt for it.
args:
language: en
count: 10
output: null # required — must be supplied via -p or interactive prompt
steps:
- id: names # optional; required if referenced downstream
generator:
name: my_pkg.generators.Name
args:
language: ${args.language}
count: ${args.count}
- generator:
name: scarfolder.generators.util.Combine
args:
streams:
- ${steps.names}
- ${steps.surnames}
transformer: scarfolder.transformers.text.join
loader:
name: scarfolder.loaders.file.WriteLines
args:
path: ${args.output}
```
### Inline Chaining
A step can combine a generator, one or more transformers, and one or more loaders into a single declaration. The pipeline automatically injects the output of each stage as `values` into the next — no intermediate steps or explicit `${steps.*}` references needed.
```yaml
- id: greetings
generator:
name: scarfolder.generators.util.Constant
args:
value: hello
count: 5
transformers:
- name: scarfolder.transformers.text.capitalize_first # values auto-injected
- name: scarfolder.transformers.text.format_template # values auto-injected
args:
template: "Greeting: {value}"
loaders:
- name: scarfolder.loaders.console.Print # values auto-injected
- name: scarfolder.loaders.file.WriteLines # values auto-injected
args:
path: output.txt
```
Use `transformer` (singular) and `loader` (singular) for the common single-item case. The plural forms accept a YAML list.
When a step has **no generator**, the first transformer is the primary producer and must declare its input explicitly:
```yaml
- id: upper_names
transformer:
name: scarfolder.transformers.text.upper
args:
values: ${steps.names} # explicit — no generator to inject from
```
### Args & Placeholders
Placeholders use `${namespace.key}` syntax and are resolved before each step runs.
| Placeholder | Resolves to |
|---|---|
| `${args.key}` | A runtime argument (CLI or config default) |
| `${key}` | Shorthand for `${args.key}` |
| `${steps.id}` | The output list of a previously executed step |
| `${refname.key}` | A value from an external YAML file (see `refs:`) |
| `${env.VAR}` | An OS environment variable |
**Type preservation:** a value that is entirely a placeholder (e.g. `${steps.names}`) receives the actual Python object — not its string representation. This allows lists to flow between steps.
**Required args** are declared with a `null` default. If not provided via `-p`, the CLI prompts interactively.
### External Refs
```yaml
refs:
queries: ./sql/queries.yaml
steps:
- generator:
name: my_pkg.generators.SqlRows
args:
query: ${queries.select_users}
```
---
## CLI Reference
```
scarfolder [OPTIONS] COMMAND [ARGS]
```
### `run`
```bash
scarfolder run SCARF_FILE [OPTIONS]
Options:
-p, --param KEY=VALUE Override or supply a config arg. Repeatable.
--dry-run Validate config without executing any steps.
```
### `validate`
Parse and validate a Scarf file without running it.
```bash
scarfolder validate SCARF_FILE
```
### `list-steps`
Print a summary of all steps and their plugin chains. Each step shows its full chain with role labels — `[G]` Generator, `[T]` Transformer, `[L]` Loader.
```bash
scarfolder list-steps SCARF_FILE
```
---
## Built-in Plugins
### Generators
| Path | Description |
|---|---|
| `scarfolder.generators.util.Constant` | Repeat a single value `count` times |
| `scarfolder.generators.util.Range` | Integer sequence (`start`, `stop`, `step`) |
| `scarfolder.generators.util.Combine` | Zip multiple streams into tuples |
| `scarfolder.generators.util.Enumerate` | Pair each item with its index |
### Transformers
All built-in text transformers operate on `list[str]`. When chained to a generator, `values` is auto-injected; when used standalone, declare `values: ${steps.}` in args.
| Path | Description |
|---|---|
| `scarfolder.transformers.text.capitalize_first` | Capitalise first letter of each string |
| `scarfolder.transformers.text.upper` | Upper-case every string |
| `scarfolder.transformers.text.lower` | Lower-case every string |
| `scarfolder.transformers.text.strip` | Strip leading/trailing whitespace |
| `scarfolder.transformers.text.join` | Join each inner sequence into a string |
| `scarfolder.transformers.text.prefix` | Prepend a fixed string |
| `scarfolder.transformers.text.suffix` | Append a fixed string |
| `scarfolder.transformers.text.format_template` | Apply `{value}` format template |
### Loaders
When chained to a step, `values` is auto-injected; when used standalone, declare `values: ${steps.}` in args.
| Path | Description |
|---|---|
| `scarfolder.loaders.file.WriteLines` | Write one value per line to a text file |
| `scarfolder.loaders.file.WriteJson` | Serialise values as a JSON array |
| `scarfolder.loaders.console.Print` | Print values to stdout with optional template/header/footer |
| `scarfolder.loaders.file.print_values` | Print values to stdout (simple function) |
| `scarfolder.loaders.sql.ExecuteStatements` | Execute each value as a raw SQL statement |
| `scarfolder.loaders.sql.ExecuteMany` | Execute a parameterised query for each row |
---
## Writing Custom Plugins
Any Python class or plain callable can be a plugin — reference it by its fully qualified dotted path.
### Class-based (recommended for stateful plugins)
All data arrives through the constructor. Action methods take no positional arguments.
```python
# my_project/generators.py
from scarfolder.core.base import Generator
class Name(Generator):
def __init__(self, language: str = "en", count: int = 5):
self.pool = ["Alice", "Bob"] if language == "en" else ["Luca", "Sofia"]
self.count = count
def generate(self) -> list[str]:
import random
return [random.choice(self.pool) for _ in range(self.count)]
```
```python
# my_project/loaders.py
import csv
from pathlib import Path
from scarfolder.core.base import Loader
class WriteCsv(Loader):
def __init__(self, values: list, path: str, headers: list[str] | None = None):
self.values = values # auto-injected when chained; explicit via ${steps.*} otherwise
self.path = Path(path)
self.headers = headers
def load(self) -> None:
self.path.parent.mkdir(parents=True, exist_ok=True)
with self.path.open("w", newline="") as f:
writer = csv.writer(f)
if self.headers:
writer.writerow(self.headers)
writer.writerows([[v] for v in self.values])
```
### Function-based (simpler for stateless transforms)
All resolved args are passed as keyword arguments. The data input is just another named keyword argument.
```python
# my_project/transforms.py
def shout(values: list[str], mark: str = "!") -> list[str]:
return [v.upper() + mark for v in values]
```
### Referencing in YAML
```yaml
steps:
- id: names
generator:
name: my_project.generators.Name
args:
language: it
count: 20
transformer: # chained — values auto-injected
name: my_project.transforms.shout
args:
mark: "!!!"
loader: # chained — values auto-injected
name: my_project.loaders.WriteCsv
args:
path: output/names.csv
headers: [name]
```
Make sure your project directory is on `PYTHONPATH`:
```bash
PYTHONPATH=. scarfolder run pipeline.yaml
```
---
## Running with Docker
A pre-built image is available. Mount your project to `/workspace` — that directory is automatically on `PYTHONPATH`, so your custom plugins are importable with no extra setup.
### One-off run
```bash
docker run --rm \
-v ./my_project:/workspace \
ghcr.io/freshmag/scarfolder:latest \
run scarf.yaml -planguage=it
```
### With Docker Compose
```yaml
# docker-compose.yml
services:
scarfolder:
image: ghcr.io/freshmag/scarfolder:latest
volumes:
- .:/workspace
command: ["run", "scarf.yaml", "-planguage=it"]
```
```bash
docker compose run --rm scarfolder
```
### Plugins outside the project directory
Use `SCARFOLDER_PLUGINS_PATH` (colon-separated) to inject additional paths:
```bash
docker run --rm \
-v ./my_project:/workspace \
-v ./shared_plugins:/plugins \
-e SCARFOLDER_PLUGINS_PATH=/plugins \
ghcr.io/freshmag/scarfolder:latest \
run scarf.yaml
```
---
Made with ❤️ and a warm scarf.