https://github.com/mta-tech/seeknal
Seeknal is an all-in-one platform for data and AI/ML engineering
https://github.com/mta-tech/seeknal
analytics-engineering data-engineering data-science duckdb feature-engineering feature-management feature-store machine-learning mlops
Last synced: 11 days ago
JSON representation
Seeknal is an all-in-one platform for data and AI/ML engineering
- Host: GitHub
- URL: https://github.com/mta-tech/seeknal
- Owner: mta-tech
- License: apache-2.0
- Created: 2024-11-05T10:46:51.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-03-06T06:46:48.000Z (about 2 months ago)
- Last Synced: 2026-03-06T09:36:12.975Z (about 2 months ago)
- Topics: analytics-engineering, data-engineering, data-science, duckdb, feature-engineering, feature-management, feature-store, machine-learning, mlops
- Language: Python
- Homepage: https://mta-tech.github.io/seeknal/
- Size: 14.2 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
Seeknal
Transform data with SQL and Python. Build ML features with point-in-time joins. Materialize to PostgreSQL and Iceberg — all from one CLI.
Seeknal is an all-in-one platform for data and AI/ML engineering. Define pipelines in YAML or Python, run them through a safe `draft → dry-run → apply` workflow, and materialize outputs to PostgreSQL and Apache Iceberg simultaneously. Python 3.11+ required.
## Quick Start
```bash
pip install seeknal
seeknal init --name my_project
seeknal draft --name my_pipeline --type transform
seeknal dry-run
seeknal apply
```
Explore your data interactively or search docs from the terminal:
```bash
seeknal repl # Interactive SQL on pipeline outputs
seeknal docs query # Search documentation from the CLI
```
```sql
SELECT customer_id, COUNT(*) as order_count
FROM target.my_transform
GROUP BY customer_id;
```
## Key Features
**Dual Pipeline Authoring** — Write pipelines in YAML, Python decorators, or both:
```python
from seeknal.pipeline import source, transform
@source(name="orders", source="csv", table="data/orders.csv")
def orders():
pass
@transform(name="order_metrics", inputs=["source.orders"])
def order_metrics(ctx):
df = ctx.ref("source.orders")
return ctx.duckdb.sql(
"SELECT customer_id, SUM(amount) as total FROM df GROUP BY customer_id"
).df()
```
**Multi-Target Materialization** — Write to PostgreSQL and Iceberg from a single node:
```yaml
materializations:
- type: postgresql
connection: local_pg
table: analytics.my_table
mode: upsert_by_key
unique_keys: [id]
- type: iceberg
table: atlas.namespace.my_table
```
**Environment Management** — Isolated namespaces with per-environment profiles:
```bash
seeknal env plan dev --profile profiles-dev.yml
seeknal env apply dev
seeknal run --env dev
```
**Feature Store** — Define ML features in YAML or Python with entity keys, point-in-time joins, and automatic versioning. Supports offline (batch) and online (real-time) serving.
```yaml
# seeknal/feature_groups/customer_features.yml
kind: feature_group
name: customer_features
entity:
name: customer
join_keys: ["customer_id"]
materialization:
event_time_col: latest_order_date
offline: { enabled: true, format: parquet }
online: { enabled: false, ttl: 7d }
features:
total_orders: { dtype: integer }
total_spent: { dtype: float }
avg_order_value: { dtype: float }
inputs:
- ref: transform.customer_orders
```
```python
# Or use Python decorators
@feature_group(name="customer_rfm", entity="customer")
def customer_rfm(ctx):
df = ctx.ref("transform.clean_transactions")
return ctx.duckdb.sql("""
SELECT CustomerID, COUNT(DISTINCT InvoiceNo) as frequency,
SUM(TotalAmount) as monetary_value
FROM df GROUP BY CustomerID
""").df()
```
```bash
seeknal entity list # Cross-feature-group consolidation
seeknal entity show customer # Inspect entity schema and feature groups
```
**Interactive SQL REPL** — Auto-registers parquets, PostgreSQL, and Iceberg sources at startup. Query pipeline outputs, explore data, iterate on SQL — all without leaving the terminal.
**AI-Powered Thinking Partner** — `seeknal ask chat` is your collaborative partner for data work. The agent uses 16 tools for fast data access and 11 built-in skills for multi-step workflows like report generation, pipeline building, and data profiling — all loaded on demand to keep responses fast:
```bash
seeknal ask chat # Start a brainstorm / build session
seeknal ask "What are the top 5 customers by revenue?" # Quick one-shot question
seeknal ask report "customer analysis" # Generate interactive HTML dashboard
seeknal ask chat --web # Enable web search for benchmarks
```
Ask it to build a pipeline from scratch, and it will draft a plan, walk you through the design, and wait for your go-ahead before generating code. Publish reports to a self-hosted **Seeknal Report Server** and share them with your team via a URL.
```bash
seeknal report-server start # Host published reports
seeknal gateway start # Expose ask as an API (WebSocket/SSE/REST)
```
Supports Google Gemini (default) and Ollama (local). Use `--provider ollama` for fully local, private analysis.
## Documentation
| | |
|---|---|
| **[Getting Started](docs/index.md)** | Installation, configuration, first pipeline |
| **[CLI Reference](docs/reference/cli.md)** | All commands and flags |
| **[YAML Schema](docs/reference/yaml-schema.md)** | Pipeline YAML reference |
| **[CLI Docs Search](docs/cli/docs.md)** | Search documentation from the terminal (`seeknal docs`) |
| **Tutorials** | [YAML Pipelines](docs/tutorials/yaml-pipeline-tutorial.md) · [Python Pipelines](docs/tutorials/python-pipelines-tutorial.md) · [Mixed](docs/tutorials/mixed-yaml-python-pipelines.md) · [Seeknal Ask Agent](docs/tutorials/seeknal-ask-agent.md) · [Report Exposures](docs/tutorials/report-exposures.md) |
| **Guides** | [Python Pipelines](docs/guides/python-pipelines.md) · [Testing & Audits](docs/guides/testing-and-audits.md) · [Iceberg Materialization](docs/iceberg-materialization.md) · [Training to Serving](docs/guides/training-to-serving.md) |
| **Servers** | [Gateway Server](docs/cli/gateway.md) · [Report Server](docs/cli/report-server.md) |
| **Concepts** | [Point-in-Time Joins](docs/concepts/point-in-time-joins.md) · [Virtual Environments](docs/concepts/virtual-environments.md) · [Exposures](docs/concepts/exposures.md) · [Glossary](docs/concepts/glossary.md) |
## Changelog
### v2.6.0 (April 2026)
**Skills-Powered Agent + Report Server** — The ask agent now uses a thin-tools/fat-skills architecture: 16 lean tools for fast data access, 11 built-in skills for multi-step workflows (reports, pipelines, profiling, metrics, publishing). Skills load on demand via progressive disclosure, keeping the agent's context lean.
- **Seeknal Report Server** (`seeknal report-server start`): self-hosted server for publishing and sharing reports via unique URLs — publish from the chat TUI or the agent tool
- **11 built-in skills**: report generation, pipeline building, data profiling, Python analysis, semantic model bootstrap, metric query/save, report exposure codification, Proof Editor publishing
- **Chat enhancements**: `--style` (concise/explanatory/formal/conversational), `--budget` (USD cap), `--web` (DuckDuckGo search), `--session`/`--name` (named session resume)
- **Gateway improvements**: cloud-only backend mode, standalone workers, Redis multi-replica, split topology
- **Auto `.env` loading**: `--project ` loads `/.env` automatically
- **Error UX**: network errors classified with actionable hints; error logs saved to `~/.seeknal/logs/`
### v2.5.0 (April 2026)
**Seeknal as Your Thinking Partner** — `seeknal ask chat` is now a collaborative partner that brainstorms, builds pipelines, and trains models with you through conversation. It always asks for confirmation before acting — you stay in control.
- **Interactive chat mode** (`seeknal ask chat`): multi-turn brainstorm and build sessions with persistent history, streaming UI with Claude Code-inspired visual hierarchy
- **Confirmation-first workflow**: the agent proposes plans and analysis directions, then waits for your go-ahead via interactive menus before executing
- **Pipeline and ML building**: describe what you want to build in plain language — the agent drafts YAML pipelines, feature groups, or model training code and checks in before generating
- **Session management**: create, resume, list, and delete sessions with full message persistence (`seeknal session list/show/delete`)
- **Iceberg REST catalog support**: integrates with any Iceberg REST catalog provider (Lakekeeper, Tabular, Polaris, etc.)
- **Gateway server**: WebSocket, SSE, and REST endpoints for web clients; optional Telegram bot integration
- **UI refresh**: animated fox mascot, interactive arrow-key menus, real token/tool counters, subordinate reasoning display
### v2.4.0 (March 2026)
**Seeknal Ask — AI-Powered Data Agent** — Natural language data analysis with 12 built-in tools:
```bash
seeknal ask "What are the top 5 customers by revenue?"
seeknal ask chat # Interactive multi-turn session
seeknal ask report "customer segmentation" # AI-guided HTML dashboard
seeknal ask report --exposure monthly_kpis # Deterministic report exposure
seeknal ask report serve my-report # Live-preview with Evidence dev server
```
- **One-shot & chat modes**: Ask questions or start multi-turn sessions with conversation memory
- **12 agent tools**: Data discovery, SQL execution, Python analysis (pandas/scipy/matplotlib), pipeline inspection, and report generation
- **Report exposures**: Define repeatable reports in YAML with pinned SQL queries, chart types (BigValue, BarChart, LineChart, AreaChart, DataTable), and LLM-generated narratives
- **Deterministic reports**: `sections` key pins SQL and charts — LLM only writes commentary
- **Dual output**: Both interactive HTML dashboards and standalone Markdown reports
- **LLM providers**: Google Gemini (default) and Ollama (local, no API key)
- **Subprocess sandbox**: Python execution runs in isolated subprocess with restricted imports
### v2.3.0 (March 2026)
**Incremental Detection** — Automatically skip unchanged data sources and process only new data:
```yaml
# PostgreSQL watermark-based incremental detection
- kind: source
name: events
source: postgresql
table: public.events
freshness:
time_column: created_at # Tracks MAX(created_at) watermark
params:
connection: my_pg
```
- **PostgreSQL Incremental**: Watermark-based detection using `MAX(time_column)` comparison. Automatically generates `WHERE time_col > 'watermark' OR time_col IS NULL` for incremental reads.
- **Iceberg Incremental**: Snapshot-based detection comparing current snapshot ID. Supports partition pruning for time-partitioned tables.
- **Skip Optimization**: If fingerprint and watermark match, source execution is skipped entirely.
- **Cascade Invalidation**: Dependent nodes are automatically invalidated when source data changes.
- **Full Refresh**: Use `--full` flag to ignore stored watermarks and reload all data.
**Other Changes**:
- Enhanced QA automation with multi-spec execution support
- Pipeline error logging with `--verbose` mode
- Security fix: Updated `cryptography` to 46.0.5 (CVE-2026-26007)
### v2.2.2 (February 2026)
- Entity consolidation for per-entity feature views
- Multi-target materialization (PostgreSQL + Iceberg from single node)
- Environment-aware execution with namespace prefixing
## Install from Source
For development or contributing:
```bash
git clone https://github.com/mta-tech/seeknal.git
cd seeknal
uv venv --python 3.11 && source .venv/bin/activate
uv pip install -e ".[all]"
```
## Contributing
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, code style, testing, and PR guidelines.
## License
Seeknal is [Apache 2.0 licensed](LICENSE).