https://github.com/mta-tech/seeknal

Seeknal is an all-in-one platform for data and AI/ML engineering
https://github.com/mta-tech/seeknal

analytics-engineering data-engineering data-science duckdb feature-engineering feature-management feature-store machine-learning mlops

Last synced: 1 day ago
JSON representation

Seeknal is an all-in-one platform for data and AI/ML engineering

Host: GitHub
URL: https://github.com/mta-tech/seeknal
Owner: mta-tech
License: apache-2.0
Created: 2024-11-05T10:46:51.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-02-23T05:30:15.000Z (4 days ago)
Last Synced: 2026-02-23T12:05:07.608Z (4 days ago)
Topics: analytics-engineering, data-engineering, data-science, duckdb, feature-engineering, feature-management, feature-store, machine-learning, mlops
Language: Python
Homepage:
Size: 13.7 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          
    


        Seeknal

    

    

        An all-in-one platform for data and AI/ML engineering

    


Seeknal is a platform that abstracts away the complexity of data transformation and AI/ML engineering. It is a collection of tools that help you transform data, store it, and use it for machine learning and data analytics.

Seeknal lets you:

- **Define** data and feature transformations from raw data sources using Pythonic APIs and YAML.

- **Register** transformations and feature groups by names and get transformed data and features for various use cases including AI/ML modeling, data engineering, business metrics calculation and more.

- **Share** transformations and feature groups across teams and company.

Seeknal is useful in multiple use cases including:

- AI/ML modeling: computes your feature transformations and incorporates them into your training data, using point-in-time joins to prevent data leakage while supporting the materialization and deployment of your features for online use in production.

- Data analytics: build data pipelines to extract features and metrics from raw data for Analytics and AI/ML modeling.

Seeknal is designed as a comprehensive data processing tool that enables you to create an end-to-end pipeline by allowing you to utilize one or more data processing engines (such as Apache Spark combined with DuckDB). To facilitate execution across various engines, Seeknal defines the pipeline in JSON format, which the respective engine processes. In this context, the engines need to support JSON input for the pipeline to function correctly. Since some data processors do not naturally handle YAML input, we enhance these data processors to incorporate this feature, which we refer to as engines. These engines are located in the`engines` folder.

## Getting started

We recommend to use uv for installing Seeknal. The following steps are expecting you to have [UV](https://docs.astral.sh/uv/guides/install-python/) installed.

To install Seeknal, follow these steps:

1. Download the Seeknal package:

    

    - Visit the [releases](https://github.com/mta-tech/seeknal/releases) page and download the latest package.

2. Extract the Downloaded File:

    - Unzip the downloaded zip file to your working directory.

3. Initialize the environment using uv:

    - Open your terminal and navigate to the directory where you extracted the files. Then, run the following command to initialize the environment:

    ```

    $ cd seeknal_build

    $ uv venv --python 3.11

    ```

    - Activate the environment:

    ```

    source .venv/bin/activate  

    ```

4. Install Seeknal using `uv pip`:

    ```

    uv pip install seeknal--py3-none-any.whl

    ```

    Replace  with the actual version number of the wheel file you downloaded.

5. Verify the Installation:

    To ensure that Seeknal has been installed correctly, you can run:

    

    ```

    uv pip show seeknal

    ```

    This command will display information about the installed package, confirming that the installation was successful.

6. Edit `.env` variable `SEEKNAL_BASE_CONFIG_PATH` and `SEEKNAL_USER_CONFIG_PATH` to point to the directory where you have `config.toml` file. For getting started, we have an example config.toml which you can find inside the `seeknal_build` directory. This case necessary update to the .env to point to the directory.

    ```

    SEEKNAL_BASE_CONFIG_PATH="path/to/seeknal_build"

    SEEKNAL_USER_CONFIG_PATH="path/to/seeknal_build/config.toml"

    ```

Congratulation!

Your seeknal has been installed on your machine and ready to use in your projects. To see it in action, check out the `feature-store-demo.ipynb` notebook or see it below.

## Seeknal in action

1. Create a data pipeline

    ```python

    from seeknal.project import Project

    from seeknal.flow import (

        Flow,

        FlowInput,

        FlowOutput,

        FlowInputEnum,

        FlowOutputEnum,

    )

    from seeknal.tasks.sparkengine import SparkEngineTask

    from seeknal.tasks.duckdb import DuckDBTask

    project = Project(name="my_project", description="My project")

    project.get_or_create()

    flow_input = FlowInput(kind=FlowInputEnum.HIVE_TABLE, value="my_df")

    flow_output = FlowOutput(kind=FlowOutputEnum.SPARK_DATAFRAME)

    # Develop a pipeline that mixes Spark and DuckDB.

    task_on_spark = SparkEngineTask().add_sql("SELECT * FROM __THIS__ WHERE day = date_format(current_date(), 'yyyy-MM-dd')")

    task_on_duckdb = DuckDBTask().add_sql("SELECT id, lat, lon, movement_type, day FROM __THIS__")

    flow = Flow(

        name="my_flow",

        input=flow_input,

        tasks=[task_on_spark, task_on_duckdb],

        output=FlowOutput(),

    )

    # save the data pipeline

    flow.get_or_create()

    res = flow.run()

    ```

2. Load the saved data pipeline

    ```python

    project = Project(name="my_project", description="My project")

    project.get_or_create()

    flow = Flow(name="my_flow").get_or_create()

    res = flow.run()

    ```

3. Save the results to a feature group

    ```python

    from datetime import datetime

    from seeknal.entity import Entity

    from seeknal.featurestore.feature_group import (

        FeatureGroup,

        Materialization,

        OfflineMaterialization,

        OfflineStore,

        OfflineStoreEnum,

        FeatureStoreFileOutput,

        OnlineStore,

        OnlineStoreEnum,

        HistoricalFeatures,

        FeatureLookup,

        FillNull,

        GetLatestTimeStrategy,

        OnlineFeatures,

    )

    # Define a materialization for the offline feature store

    materialization = Materialization(event_time_col="day", 

    offline_materialization=OfflineMaterialization(

        store=OfflineStore(kind=OfflineStoreEnum.FILE, 

                           name="object_storage",

                           value=FeatureStoreFileOutput(path="s3a://warehouse/feature_store")), 

                           mode="overwrite", ttl=None),

        offline=True)

    # Define feature group

    loc_feature_group = FeatureGroup(

        name="location_feature_group",

        entity=Entity(name="user_movement", join_keys=["msisdn", "movement_type"]).get_or_create(),

        materialization=materialization,

    )

    # Attach transformation for create the feature group

    loc_feature_group.set_flow(flow)

    # Register all columns as features

    loc_feature_group.set_features()

    # Save feature group

    loc_feature_group.get_or_create()

    # materialize the feature group to offline feature store

    loc_feature_group.write(

        # store features from specific date to the latest

        feature_start_time=datetime(2019, 3, 5)

    )

    ```

4. Load feature group from offline feature store

    ```python

    loc_feature_group = FeatureGroup(name="location_feature_group").get_or_create()

    # lookup for all features of loc_feature_group

    fs = FeatureLookup(source=loc_feature_group)

    # impute null to 0.0

    fillnull = FillNull(value="0.0", dataType="double")

    # load the features from offline feature store

    hist = HistoricalFeatures(lookups=[fs], fill_nulls=[fillnull])

    df = hist.to_dataframe(feature_start_time=datetime(2019, 3, 5))

    ```

5. Serve features to online feature store

    ```python

    latest_features = hist.using_latest.serve()

    user_one = Entity(name="user_movement").get_or_create().set_key_values("05X5wBWKN3")

    user_one_features = latest_features.get_features(keys=[user_one])

    ```

## Use Turso as Database

Seeknal uses an SQLite database to store internal data. For production or collaborative use of Seeknal, we suggest using [Turso](https://turso.com/) as your database provider. This allows you to share your Seeknal projects seamlessly across teams and environments, given that it operates using the same database. To set up Turso as your database, edit the `config.toml` file and adjust the `context.database` setting accordingly:

```toml

[context.database]

TURSO_DATABASE_URL = ""

TURSO_AUTH_TOKEN = ""

```

## Contributing

Contributions are welcome! Please read our contributing guidelines before submitting pull requests.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mta-tech/seeknal

Awesome Lists containing this project

README

Seeknal

An all-in-one platform for data and AI/ML engineering