{"id":34951231,"url":"https://github.com/felixmccuaig/flowbase","last_synced_at":"2026-04-27T09:03:50.524Z","repository":{"id":318138990,"uuid":"1070005013","full_name":"felixmccuaig/flowbase","owner":"felixmccuaig","description":"A declarative ML platform for tabular data that eliminates infrastructure complexity. SQL-first feature engineering, type-safe data cleaning, and automatic model comparison.","archived":false,"fork":false,"pushed_at":"2026-03-28T04:32:30.000Z","size":839,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-28T09:38:42.074Z","etag":null,"topics":["data-science","declarative","feature-engineering","machine-learning","mlops","python","sql"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/felixmccuaig.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-05T04:10:14.000Z","updated_at":"2026-03-28T04:32:33.000Z","dependencies_parsed_at":"2025-10-05T12:05:26.265Z","dependency_job_id":"aeac0930-7406-405d-8f48-1714b85d67e5","html_url":"https://github.com/felixmccuaig/flowbase","commit_stats":null,"previous_names":["felixmccuaig/flowbase"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/felixmccuaig/flowbase","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/felixmccuaig%2Fflowbase","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/felixmccuaig%2Fflowbase/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/felixmccuaig%2Fflowbase/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/felixmccuaig%2Fflowbase/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/felixmccuaig","download_url":"https://codeload.github.com/felixmccuaig/flowbase/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/felixmccuaig%2Fflowbase/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32329467,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","declarative","feature-engineering","machine-learning","mlops","python","sql"],"created_at":"2025-12-26T21:34:05.693Z","updated_at":"2026-04-27T09:03:50.509Z","avatar_url":"https://github.com/felixmccuaig.png","language":"Python","readme":"# Flowbase\n\n**A declarative ML platform for tabular data that eliminates infrastructure complexity.**\n\nFlowbase lets data scientists build production-ready machine learning pipelines using SQL and YAML configs—no DevOps, no infrastructure setup, no complexity. Just clean data, features, and models.\n\n## Why Flowbase?\n\nTraditional ML platforms force you to choose between:\n- **Simple tools** that don't scale (notebooks, pandas scripts)\n- **Complex platforms** that require dedicated engineering teams (Kubernetes, Airflow, MLflow)\n\nFlowbase gives you the power of enterprise ML infrastructure with the simplicity of local development:\n\n✅ **SQL-first feature engineering** - All features defined in SQL\n✅ **Declarative configs** - YAML-based, version-controlled\n✅ **Type-safe data cleaning** - Handle messy data with explicit type casting\n✅ **Multiple models, one feature set** - Test different algorithms on the same features\n✅ **Local-first development** - Work on your laptop, deploy anywhere\n✅ **Built-in comparisons** - Automatic model evaluation and comparison\n\n## Core Concepts\n\nFlowbase follows a clear, linear workflow:\n\n```\nRaw Data → Datasets → Features → Models → Evaluations\n```\n\n1. **Datasets**: Clean, typed data with quality checks\n2. **Features**: SQL-based feature engineering with transformations, aggregations, and window functions\n3. **Models**: Train multiple models with different algorithms and feature subsets\n4. **Evaluations**: Compare models and select the best performer\n\n## Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone \u003crepo-url\u003e\ncd flowbase\n\n# Create virtual environment\npython3 -m venv venv\nsource venv/bin/activate\n\n# Install dependencies\npip install -e .\n```\n\n### Your First Pipeline\n\nLet's predict California housing prices:\n\n**1. Define a dataset** (`configs/datasets/housing_clean.yaml`):\n\n```yaml\nname: housing_clean\ndescription: Clean California housing data\n\ncolumns:\n  - name: median_income\n    type: DOUBLE\n  - name: median_house_value\n    type: DOUBLE\n  - name: ocean_proximity\n    type: VARCHAR\n    transform: trim\n\nfilters:\n  - column: median_income\n    operator: \"\u003e\"\n    value: 0\n  - column: median_house_value\n    operator: \"\u003e\"\n    value: 0\n```\n\n**2. Compile the dataset**:\n\n```bash\nflowbase dataset compile configs/datasets/housing_clean.yaml data/housing.csv\n# Output: data/datasets/housing_clean.parquet\n```\n\n**3. Define features** (`configs/features/housing_features.yaml`):\n\n```yaml\nname: housing_features\ndataset: housing_clean\n\nfeatures:\n  # Basic ratios\n  - name: rooms_per_household\n    expression: \"total_rooms / households\"\n\n  # Income transformations\n  - name: income_log\n    expression: \"LOG(median_income + 1)\"\n\n  # One-hot encoding\n  - name: ocean_inland\n    expression: \"CASE WHEN ocean_proximity = 'INLAND' THEN 1 ELSE 0 END\"\n\nwindow_features:\n  # Regional averages\n  - name: avg_income_by_ocean\n    function: AVG\n    column: median_income\n    partition_by: [ocean_proximity]\n```\n\n**4. Compile features**:\n\n```bash\nflowbase features compile configs/features/housing_features.yaml \\\n  -d data/datasets/housing_clean.parquet\n# Output: data/features/housing_features.parquet\n```\n\n**5. Train models** (`configs/models/random_forest.yaml`):\n\n```yaml\nname: housing_random_forest\nfeature_set: housing_features\ntarget: median_house_value\n\nfeatures:\n  - median_income\n  - income_log\n  - rooms_per_household\n  - ocean_inland\n  - avg_income_by_ocean\n\nmodel:\n  type: sklearn\n  class: ensemble.RandomForestRegressor\n  hyperparameters:\n    n_estimators: 100\n    random_state: 42\n\nsplit:\n  method: random\n  test_size: 0.2\n  random_state: 42\n```\n\n```bash\nflowbase model train configs/models/random_forest.yaml \\\n  -f data/features/housing_features.parquet\n# Output: data/models/housing_random_forest.pkl\n```\n\n**6. Compare models**:\n\n```bash\nflowbase eval compare \\\n  data/models/model1.pkl \\\n  data/models/model2.pkl \\\n  data/models/model3.pkl \\\n  -n housing_comparison\n```\n\nOutput:\n```\n┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓\n┃ Model             ┃ Type    ┃ Test Score┃ RMSE   ┃ MAE    ┃ R²     ┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩\n│ baseline_simple   │ sklearn │ 0.6139    │ 71,578 │ 51,921 │ 0.6139 │\n│ random_forest     │ sklearn │ 0.8167    │ 49,319 │ 31,931 │ 0.8167 │\n│ advanced_gbm      │ sklearn │ 0.8411    │ 45,921 │ 29,483 │ 0.8411 │\n└───────────────────┴─────────┴───────────┴────────┴────────┴────────┘\n\nBest model: advanced_gbm (test_score: 0.8411)\n```\n\n## Examples\n\nFlowbase includes three complete, progressively complex examples that demonstrate the full workflow from raw data to model comparison.\n\n### 1. Iris Classification (`examples/iris/`)\n\n**Perfect for getting started** - A clean, simple dataset to learn the basics.\n\n**The Problem**: Classify iris flowers into 3 species (setosa, versicolor, virginica) based on petal and sepal measurements.\n\n**Dataset**: 150 samples, 4 measurements, 1 target\n- `sepal_length`, `sepal_width`, `petal_length`, `petal_width` → `species`\n\n**What You'll Learn**:\n- Basic dataset compilation with type casting\n- Simple feature engineering (ratios, areas, rankings)\n- Training multiple models\n- Model comparison for classification\n\n**Run the example**:\n```bash\n# 1. Clean and type the data\nflowbase dataset compile examples/iris/configs/datasets/iris_clean.yaml examples/iris/data/iris.csv\n\n# 2. Engineer features (5 → 19 columns)\n#    - Ratios: sepal_ratio, petal_ratio\n#    - Areas: sepal_area, petal_area\n#    - Rankings: petal_length_rank by species\nflowbase features compile examples/iris/configs/features/iris_features.yaml -d data/datasets/iris_clean.parquet\n\n# 3. Train two models\nflowbase model train examples/iris/configs/models/random_forest.yaml -f data/features/iris_features.parquet\nflowbase model train examples/iris/configs/models/logistic_regression.yaml -f data/features/iris_features.parquet\n\n# 4. Compare results\nflowbase eval compare data/models/iris_*.pkl -n iris_comparison\n```\n\n**Results**:\n```\n┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓\n┃ Model                 ┃ Type    ┃ Test Score ┃ Accuracy ┃ F1     ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩\n│ iris_random_forest    │ sklearn │ 0.9778     │ 0.9778   │ 0.9776 │\n│ iris_logistic_regr... │ sklearn │ 1.0000     │ 1.0000   │ 1.0000 │ ✨\n└───────────────────────┴─────────┴────────────┴──────────┴────────┘\n```\n\n**Winner**: Logistic Regression with perfect 100% accuracy on test set.\n\n---\n\n### 2. Titanic Survival Prediction (`examples/titanic/`)\n\n**Real-world data messiness** - Learn how to handle missing values, mixed types, and data quality issues.\n\n**The Problem**: Predict passenger survival on the Titanic based on demographics, ticket class, and family relationships.\n\n**Dataset**: 891 passengers, 12 features (messy!)\n- **Missing values**: 177 missing ages, 687 missing cabin numbers, 2 missing embarkation ports\n- **Type issues**: Mixed case text, numbers stored as strings\n- **Complex strings**: Names with titles (\"Mr.\", \"Mrs.\", \"Master.\")\n\n**What You'll Learn**:\n- Handling missing values with `allow_null` filters\n- Type casting messy data (Age strings → DOUBLE)\n- Text normalization (`transform: lower`, `transform: trim`)\n- String manipulation (extracting titles from names with `REGEXP_EXTRACT`)\n- Creating complex features from multiple columns\n- Imputing missing values during feature engineering\n\n**Key Dataset Config Techniques**:\n```yaml\ncolumns:\n  - name: Age\n    type: DOUBLE  # Handles missing values → NULL\n\n  - name: Sex\n    type: VARCHAR\n    transform: lower  # \"Male\" → \"male\"\n\nfilters:\n  - column: Embarked\n    operator: in\n    value: [\"S\", \"C\", \"Q\"]\n    allow_null: true  # ← Keep rows with missing embarkation\n```\n\n**Run the example**:\n```bash\n# 1. Clean data: handle nulls, normalize text, cast types\nflowbase dataset compile examples/titanic/configs/datasets/titanic_clean.yaml examples/titanic/data/titanic.csv\n\n# 2. Engineer 35 features from 12 base columns:\n#    - Family features: family_size, is_alone\n#    - Age features: age_filled (median imputation), is_child, age_group\n#    - Fare features: fare_per_person, fare_category\n#    - Text extraction: title from name (Mr, Mrs, Miss, Master, etc.)\n#    - Cabin features: has_cabin, cabin_deck\n#    - One-hot encoding: sex, class, embarkation port\n#    - Window functions: avg_fare_by_class, title_count\nflowbase features compile examples/titanic/configs/features/survival_features.yaml -d data/datasets/titanic_clean.parquet\n\n# 3. Train three different models\nflowbase model train examples/titanic/configs/models/random_forest.yaml -f data/features/survival_features.parquet\nflowbase model train examples/titanic/configs/models/gradient_boosting.yaml -f data/features/survival_features.parquet\nflowbase model train examples/titanic/configs/models/logistic_regression.yaml -f data/features/survival_features.parquet\n\n# 4. Compare all models\nflowbase eval compare data/models/titanic_*.pkl -n titanic_comparison\n```\n\n**Results**:\n```\n┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓\n┃ Model                 ┃ Type    ┃ Test Score ┃ Accuracy ┃ F1     ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩\n│ titanic_random_forest │ sklearn │ 0.8156     │ 0.8156   │ 0.7591 │\n│ titanic_gradient_bo...│ sklearn │ 0.8380     │ 0.8380   │ 0.7883 │ ✨\n│ titanic_logistic_re...│ sklearn │ 0.8101     │ 0.8101   │ 0.7671 │\n└───────────────────────┴─────────┴────────────┴──────────┴────────┘\n```\n\n**Winner**: Gradient Boosting with 83.8% accuracy.\n\n**Key Insight**: Even with 20% missing age data and 77% missing cabin data, proper feature engineering (family_size, title extraction, fare_per_person) enables strong predictive performance.\n\n---\n\n### 3. California Housing Price Prediction (`examples/housing/`)\n\n**Production-scale complexity** - The ultimate demonstration of Flowbase's power.\n\n**The Problem**: Predict median house prices for California districts based on location, demographics, and housing characteristics.\n\n**Dataset**: 20,640 districts, 10 base features\n- Geographic: `longitude`, `latitude`\n- Housing: `housing_median_age`, `total_rooms`, `total_bedrooms` (207 nulls), `households`\n- Demographics: `population`\n- Economic: `median_income`\n- Categorical: `ocean_proximity` (5 categories)\n- Target: `median_house_value`\n\n**What You'll Learn**:\n- **Scale**: Working with 20K+ rows\n- **Comprehensive feature engineering**: 10 base features → 50 engineered features\n- **Advanced SQL**: Window functions, aggregations, complex CASE statements\n- **Feature strategy**: Building one rich feature set, then selecting different subsets for different models\n- **Model comparison**: Testing simple vs. complex approaches on the same data\n\n**The Feature Engineering Strategy**:\n\nWe create **50 total features** organized into categories:\n\n**1. Basic Ratios** (4 features)\n```yaml\n- name: rooms_per_household\n  expression: \"total_rooms / households\"\n\n- name: bedrooms_per_room\n  expression: \"COALESCE(total_bedrooms, total_rooms * 0.2) / total_rooms\"\n```\n\n**2. Geographic Features** (7 features)\n```yaml\n- name: distance_to_sf\n  expression: \"SQRT(POW(latitude - 37.77, 2) + POW(longitude + 122.42, 2))\"\n\n- name: min_distance_to_city\n  expression: |\n    LEAST(\n      SQRT(POW(latitude - 34.05, 2) + POW(longitude + 118.24, 2)),  -- LA\n      SQRT(POW(latitude - 37.77, 2) + POW(longitude + 122.42, 2)),  -- SF\n      SQRT(POW(latitude - 32.72, 2) + POW(longitude + 117.16, 2))   -- SD\n    )\n```\n\n**3. Income Transformations** (4 features)\n```yaml\n- name: income_squared\n  expression: \"median_income * median_income\"\n\n- name: income_log\n  expression: \"LOG(median_income + 1)\"\n```\n\n**4. Interaction Features** (2 features)\n```yaml\n- name: income_age_interaction\n  expression: \"median_income * housing_median_age\"\n```\n\n**5. Window Aggregations** (7 features)\n```yaml\nwindow_features:\n  - name: avg_income_by_ocean\n    function: AVG\n    column: median_income\n    partition_by: [ocean_proximity]\n\n  - name: income_percentile_by_region\n    function: PERCENT_RANK\n    partition_by: [ocean_proximity]\n    order_by: [median_income]\n```\n\n**6. One-Hot Encoding** (9 features for ocean_proximity + 4 for income_category)\n\n**The Model Strategy** - Different feature subsets for different approaches:\n\n**Model 1: Baseline Simple** (12 features)\n- Just the basics: raw features + simple ratios\n- Algorithm: Ridge Regression (linear)\n- Philosophy: \"Keep it simple, establish a baseline\"\n\n**Model 2: Random Forest Medium** (22 features)\n- Core features + some engineering, no heavy interactions\n- Algorithm: Random Forest (ensemble)\n- Philosophy: \"Moderate complexity, let trees handle interactions\"\n\n**Model 3: Advanced Engineered** (45 features)\n- Everything: geographic distances, transformations, interactions, aggregations\n- Algorithm: Gradient Boosting (advanced ensemble)\n- Philosophy: \"Give the model everything we've got\"\n\n**Run the example**:\n```bash\n# 1. Clean and validate data\nflowbase dataset compile examples/housing/configs/datasets/housing_clean.yaml examples/housing/data/housing.csv\n\n# 2. Create comprehensive feature set (10 → 50 columns)\nflowbase features compile examples/housing/configs/features/housing_features_comprehensive.yaml \\\n  -d data/datasets/housing_clean.parquet\n\n# 3. Train three models with DIFFERENT feature subsets from the SAME feature file\n#    This is the key insight: one feature set, multiple strategies\n\n# Baseline: 12 simple features\nflowbase model train examples/housing/configs/models/baseline_simple.yaml \\\n  -f data/features/housing_features_comprehensive.parquet\n\n# Medium: 22 moderate features\nflowbase model train examples/housing/configs/models/random_forest_medium.yaml \\\n  -f data/features/housing_features_comprehensive.parquet\n\n# Advanced: 45 complex features\nflowbase model train examples/housing/configs/models/advanced_engineered.yaml \\\n  -f data/features/housing_features_comprehensive.parquet\n\n# 4. Compare all three strategies\nflowbase eval compare data/models/housing_*.pkl -n housing_comparison\n```\n\n**Results**:\n```\n┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓\n┃ Model                 ┃ Type    ┃ Test Score ┃ RMSE   ┃ MAE    ┃ R²     ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩\n│ housing_baseline_si...│ sklearn │ 0.6139     │ 71,578 │ 51,921 │ 0.6139 │\n│ housing_random_fore...│ sklearn │ 0.8167     │ 49,319 │ 31,931 │ 0.8167 │\n│ housing_advanced_en...│ sklearn │ 0.8411     │ 45,921 │ 29,483 │ 0.8411 │ ✨\n└───────────────────────┴─────────┴────────────┴────────┴────────┴────────┘\n```\n\n**Winner**: Advanced GBM with R² = 0.841 (84.1% variance explained)\n\n**Key Insights**:\n\n1. **Feature engineering matters**: Going from 12 → 45 features improved R² by **37%** (0.614 → 0.841)\n\n2. **RMSE tells the story**:\n   - Baseline: $71,578 average error\n   - Advanced: $45,921 average error\n   - **$25,657 improvement** in prediction accuracy\n\n3. **One feature set, multiple experiments**: We engineered features once, then tried different combinations. This is real-world ML workflow.\n\n4. **Geographic features are powerful**: Distance to major cities (LA, SF, SD) was highly predictive.\n\n5. **Window aggregations help**: Knowing the average income in your region (ocean_proximity group) provides valuable context.\n\n6. **Model selection via comparison**: Flowbase's automatic comparison made it easy to identify the best approach.\n\n---\n\n## Example Comparison Summary\n\n| Example | Complexity | Samples | Features | Key Learning |\n|---------|-----------|---------|----------|--------------|\n| **Iris** | Simple | 150 | 5 → 19 | Basic workflow, feature ratios, rankings |\n| **Titanic** | Messy | 891 | 12 → 35 | Missing values, type casting, text extraction |\n| **Housing** | Production | 20,640 | 10 → 50 | Scale, window functions, feature strategy |\n\n**Recommended Learning Path**:\n1. Start with **Iris** to understand the basic workflow\n2. Move to **Titanic** to learn data cleaning and handling messiness\n3. Master **Housing** to see production-scale feature engineering and model comparison\n\n## Key Features\n\n### 1. SQL-First Feature Engineering\n\nDefine features in SQL—the universal language of data:\n\n```yaml\nfeatures:\n  # Simple expressions\n  - name: age_squared\n    expression: \"age * age\"\n\n  # Complex logic\n  - name: risk_score\n    expression: |\n      CASE\n        WHEN age \u003c 25 THEN income * 0.5\n        WHEN age \u003c 40 THEN income * 1.0\n        ELSE income * 1.5\n      END\n\n  # Window functions\nwindow_features:\n  - name: income_rank\n    function: RANK\n    partition_by: [state, city]\n    order_by: [income DESC]\n```\n\n### 2. Type-Safe Data Cleaning\n\nHandle messy real-world data explicitly:\n\n```yaml\ncolumns:\n  - name: age\n    type: DOUBLE  # Casts VARCHAR \"25\" → 25.0\n    transform: trim\n\n  - name: total_bedrooms\n    type: DOUBLE  # Handles NULLs gracefully\n\n  - name: sex\n    type: VARCHAR\n    transform: lower  # \"MALE\" → \"male\"\n\nfilters:\n  - column: age\n    operator: \"\u003e\"\n    value: 0\n  - column: ocean_proximity\n    operator: in\n    value: [\"INLAND\", \"NEAR OCEAN\"]\n    allow_null: true\n```\n\n### 3. Multiple Models, One Feature Set\n\nTest different algorithms and feature combinations:\n\n```yaml\n# baseline_simple.yaml - 12 features\nfeatures: [longitude, latitude, median_income, rooms_per_household, ...]\nmodel:\n  type: sklearn\n  class: linear_model.Ridge\n\n# advanced_engineered.yaml - 45 features\nfeatures: [longitude, latitude, median_income, income_squared, income_log,\n           distance_to_la, distance_to_sf, avg_income_by_ocean, ...]\nmodel:\n  type: sklearn\n  class: ensemble.GradientBoostingRegressor\n```\n\n### 4. Automatic Model Comparison\n\nBuilt-in evaluation with proper metrics:\n\n- **Classification**: Accuracy, Precision, Recall, F1\n- **Regression**: RMSE, MAE, R², MSE\n\nResults saved to JSON for further analysis.\n\n### 5. Model Inference\n\nOnce trained, use your models for predictions on new data:\n\n**Example: Iris Classification**\n```bash\n# Predict iris species from measurements\nflowbase model predict iris_simple \\\n  --input '{\"sepal length (cm)\": 5.1, \"sepal width (cm)\": 3.5, \"petal length (cm)\": 1.4, \"petal width (cm)\": 0.2}'\n```\n\nOutput:\n```\nMaking prediction with: iris_simple\n\nInput features:\n  sepal length (cm): 5.1\n  sepal width (cm): 3.5\n  petal length (cm): 1.4\n  petal width (cm): 0.2\n\n✓ Prediction complete\n\nPrediction: 0\n\nClass Probabilities:\n  Class 0: 0.9766 (97.66%)  ← setosa\n  Class 1: 0.0234 (2.34%)   ← versicolor\n  Class 2: 0.0000 (0.00%)   ← virginica\n```\n\n**Example: Titanic Survival**\n```bash\n# Predict passenger survival\nflowbase model predict titanic_simple \\\n  --input '{\"pclass\": 3, \"is_male\": 1, \"age\": 22, \"sibsp\": 1, \"parch\": 0, \"fare\": 7.25}'\n```\n\nOutput:\n```\nPrediction: 0\n\nClass Probabilities:\n  Class 0: 0.8800 (88.00%)  ← died\n  Class 1: 0.1200 (12.00%)  ← survived\n```\n\n**Example: Housing Price**\n```bash\n# Predict median house value (in $100k)\nflowbase model predict housing_simple \\\n  --input '{\"MedInc\": 8.3, \"HouseAge\": 41, \"AveRooms\": 6.98, \"AveBedrms\": 1.02, \"Population\": 322, \"AveOccup\": 2.55, \"Latitude\": 37.88, \"Longitude\": -122.23}'\n```\n\nOutput:\n```\nPrediction: 4.242422900000001\n\n# This means ~$424k median house value\n```\n\n**Input from JSON File**\n\nFor complex inputs, use a JSON file:\n\n```json\n// input.json\n{\n  \"sepal length (cm)\": 5.1,\n  \"sepal width (cm)\": 3.5,\n  \"petal length (cm)\": 1.4,\n  \"petal width (cm)\": 0.2\n}\n```\n\n```bash\nflowbase model predict iris_simple --input input.json\n```\n\n**Key Features**:\n- ✅ **Automatic validation**: Checks for missing required features\n- ✅ **Correct ordering**: Ensures features are passed in the right order\n- ✅ **Probability scores**: Shows class probabilities for classifiers\n- ✅ **Type handling**: Automatically handles type conversions\n- ✅ **Model metadata**: Uses saved feature names and preprocessing\n\n### 6. Inference Jobs (Serverless-Ready)\n\nFor production deployments, especially serverless environments (AWS Lambda, Cloud Functions), use **inference jobs** to coordinate:\n- Data scraping from APIs/sources\n- Feature generation from historical data\n- Batch predictions\n- Result storage in partitioned tables\n\n**Why Inference Jobs?**\n- 🚀 **Serverless-first**: Designed to run on Lambda without a data warehouse\n- 📊 **Historical data**: Access bulk storage (S3 partitioned tables) for features\n- 🔄 **Orchestration**: Coordinate scraping → features → inference → storage\n- 💰 **Cost-effective**: No expensive data warehouse needed\n\n**Project Structure**:\n```\nyour-project/\n├── datasets/         # Data cleaning configs\n├── features/         # Feature engineering configs\n├── models/           # Model training configs\n├── scrapers/         # Data collection configs\n├── tables/           # Table/storage configs\n└── inference/        # 👈 Inference job configs\n    ├── daily_batch/\n    │   └── config.yaml\n    ├── real_time/\n    │   └── config.yaml\n    └── event_predict/\n        └── config.yaml\n```\n\n**Inference Config Example** (`examples/iris/configs/inference/iris_batch_daily/config.yaml`):\n\n```yaml\nname: iris_batch_daily\ndescription: Batch predictions for iris flowers\nversion: 1.0\n\n# Just specify the model (features auto-resolved from model → feature_set → data/features/)\nmodel: iris_logistic_regression\n\n# Identifier columns to include in output\nselect_columns:\n  - sepal_length\n  - petal_length\n  - species  # Ground truth for evaluation\n\n# Dynamic filters passed as CLI parameters (all optional)\nfilters:\n  - param: species\n    column: species\n    operator: \"=\"\n    type: string\n    required: false\n\n# Where to save predictions\noutput:\n  file:\n    directory: ../../../data/predictions\n    filename: iris_predictions.parquet\n    format: parquet\n```\n\n**Running Inference Jobs**:\n\n```bash\n# Run predictions on all iris flowers\nflowbase infer run iris_batch_daily \\\n  --config examples/iris/configs/inference/iris_batch_daily/config.yaml\n\n# Filter by species\nflowbase infer run iris_batch_daily \\\n  --config examples/iris/configs/inference/iris_batch_daily/config.yaml \\\n  --species \"setosa\"\n\n# Preview without saving\nflowbase infer run iris_batch_daily \\\n  --config examples/iris/configs/inference/iris_batch_daily/config.yaml \\\n  --species \"setosa\" \\\n  --preview --skip-outputs\n\n# List available inference configs\nflowbase infer list --base-dir examples/iris/configs/inference\n```\n\n**Output**:\n```\n✓ Inference complete for iris_batch_daily: 50 row(s)\nFeature source: data/features/iris_features.parquet\nWHERE: species = 'setosa'\n\nOutputs:\n  file: /path/to/flowbase/data/predictions/iris_predictions.parquet\n```\n\n**Serverless Deployment Example**:\n\nThe inference runner is designed to work in Lambda/Cloud Functions:\n\n```python\n# lambda_handler.py\nfrom flowbase.inference.runner import InferenceRunner\n\ndef lambda_handler(event, context):\n    \"\"\"\n    Triggered daily or by API Gateway\n    Event: {\"date\": \"2025-10-11\", \"venue\": \"Sandown Park\"}\n    \"\"\"\n    runner = InferenceRunner(base_dir=\"s3://my-bucket/configs/inference\")\n\n    result = runner.run(\n        model_name=\"daily_batch\",\n        params=event,\n        skip_outputs=False\n    )\n\n    return {\n        \"statusCode\": 200,\n        \"predictions\": len(result[\"results\"]),\n        \"outputs\": result[\"outputs\"]\n    }\n```\n\n**Key Features**:\n- ✅ **Read from S3**: DuckDB can query S3 partitioned tables directly\n- ✅ **Dynamic filters**: Pass parameters via CLI or Lambda events\n- ✅ **Flexible outputs**: Save to files, tables, or both\n- ✅ **Date partitioning**: Automatic date-based table partitioning\n- ✅ **Dependencies**: Optionally scrape/generate features before inference\n- ✅ **No data warehouse**: Query parquet files directly from S3\n\n## CLI Commands\n\n```bash\n# Dataset management\nflowbase dataset compile \u003cconfig.yaml\u003e \u003csource.csv\u003e [--output \u003cpath\u003e] [--preview]\n\n# Feature engineering\nflowbase features compile \u003cconfig.yaml\u003e --dataset \u003cdataset.parquet\u003e [--output \u003cpath\u003e] [--preview]\n\n# Model training\nflowbase model train \u003cconfig.yaml\u003e --features \u003cfeatures.parquet\u003e [--output \u003cmodels-dir\u003e]\n\n# Model prediction (single record)\nflowbase model predict \u003cmodel_name\u003e --input \u003cjson-input\u003e [--models-dir \u003cdir\u003e]\n\n# Batch inference jobs (serverless-ready)\nflowbase infer run \u003cjob_name\u003e [--param value] [--preview] [--skip-outputs]\nflowbase infer list [--base-dir \u003cinference-configs-dir\u003e]\n\n# Model evaluation\nflowbase eval compare \u003cmodel1.pkl\u003e \u003cmodel2.pkl\u003e ... --name \u003ceval-name\u003e\n```\n\n## Project Structure\n\n```\nflowbase/\n├── flowbase/\n│   ├── cli/              # Command-line interface\n│   ├── core/             # Configuration loading\n│   ├── pipelines/        # Dataset \u0026 feature compilers\n│   ├── models/           # Model training\n│   ├── inference/        # Inference runner for batch jobs\n│   ├── query/            # DuckDB query engine\n│   └── storage/          # Storage abstraction\n│\n├── examples/\n│   ├── iris/             # Simple classification\n│   │   └── configs/\n│   │       ├── datasets/\n│   │       ├── features/\n│   │       ├── models/\n│   │       └── inference/  # Inference job configs\n│   ├── titanic/          # Messy data handling\n│   └── housing/          # Production-scale example\n│\n└── data/                 # Generated outputs\n    ├── datasets/         # Cleaned, typed data\n    ├── features/         # Engineered features\n    ├── models/           # Trained models\n    ├── predictions/      # Inference outputs\n    └── evals/            # Evaluation results\n```\n\n## How It Works\n\n### 1. Dataset Compiler\n\nTransforms raw CSV → clean, typed Parquet:\n\n**Input**: `housing.csv` with mixed types, nulls, outliers\n**Config**: Type specifications, transformations, quality filters\n**Output**: `housing_clean.parquet` - clean, typed, validated\n\nGenerated SQL:\n```sql\nSELECT\n    TRY_CAST(longitude AS DOUBLE) AS longitude,\n    TRY_CAST(median_income AS DOUBLE) AS median_income,\n    CAST(TRIM(ocean_proximity) AS VARCHAR) AS ocean_proximity\nFROM raw_data\nWHERE median_income \u003e 0\n  AND ocean_proximity IN ('INLAND', 'NEAR OCEAN')\nORDER BY latitude DESC\n```\n\n### 2. Feature Compiler\n\nTransforms datasets → feature-rich training data:\n\n**Input**: Clean dataset parquet\n**Config**: Feature expressions, window functions\n**Output**: Feature-engineered parquet ready for training\n\nSupports:\n- **Computed features**: `income * age`, `CASE WHEN ...`\n- **Window functions**: `RANK()`, `AVG() OVER (PARTITION BY ...)`\n- **One-hot encoding**: Automatic categorical expansion\n- **Aggregations**: Group-by operations\n\n### 3. Model Trainer\n\nTrains sklearn/XGBoost/LightGBM models from config:\n\n**Input**: Feature parquet + model config\n**Processing**:\n- Automatic train/test split\n- Missing value imputation\n- Model training with hyperparameters\n- Metric calculation (classification or regression)\n\n**Output**:\n- Trained model (`.pkl`)\n- Metadata JSON (features, metrics, hyperparameters)\n\n### 4. Model Evaluator\n\nCompares multiple models with rich tables:\n\n**Input**: Multiple trained model files\n**Output**:\n- Comparison table (formatted)\n- Best model selection\n- JSON results file\n\n## Design Principles\n\n1. **SQL is the interface** - All data transformations in SQL\n2. **Declarative over imperative** - YAML configs, not code\n3. **Type safety matters** - Explicit type casting and validation\n4. **Local-first** - Develop on your laptop, deploy anywhere\n5. **One feature set, many models** - Reuse features across experiments\n6. **Reproducible** - Configs are version-controlled, deterministic\n\n## Roadmap\n\n- [ ] Time-series train/test splits\n- [ ] S3/GCS dataset support\n- [ ] Trino query engine (for production scale)\n- [ ] Model serving API\n- [ ] Streaming pipelines\n- [ ] Automated hyperparameter tuning\n- [ ] Feature store\n- [ ] Model monitoring \u0026 drift detection\n\n## Contributing\n\nContributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n---\n\n**Built for data scientists who want to focus on models, not infrastructure.**\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffelixmccuaig%2Fflowbase","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffelixmccuaig%2Fflowbase","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffelixmccuaig%2Fflowbase/lists"}