{"id":13571145,"url":"https://github.com/alibaba/feathub","last_synced_at":"2025-10-14T08:45:03.070Z","repository":{"id":59900454,"uuid":"539819783","full_name":"alibaba/feathub","owner":"alibaba","description":"FeatHub - A stream-batch unified feature store for real-time machine learning","archived":false,"fork":false,"pushed_at":"2024-05-27T11:29:47.000Z","size":4782,"stargazers_count":338,"open_issues_count":104,"forks_count":57,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-08-20T03:16:05.487Z","etag":null,"topics":["apache-flink","data","data-engineering","data-quality","data-science","feature-engineering","feature-store","machine-learning","mlops","streaming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alibaba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-09-22T05:50:44.000Z","updated_at":"2025-08-16T17:17:19.000Z","dependencies_parsed_at":"2024-01-14T04:07:24.107Z","dependency_job_id":null,"html_url":"https://github.com/alibaba/feathub","commit_stats":{"total_commits":89,"total_committers":8,"mean_commits":11.125,"dds":0.5617977528089888,"last_synced_commit":"cb46d4ce222584b3db30459ee4a7359548090601"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/alibaba/feathub","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Ffeathub","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Ffeathub/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Ffeathub/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Ffeathub/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alibaba","download_url":"https://codeload.github.com/alibaba/feathub/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alibaba%2Ffeathub/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279018302,"owners_count":26086345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-flink","data","data-engineering","data-quality","data-science","feature-engineering","feature-store","machine-learning","mlops","streaming"],"created_at":"2024-08-01T14:00:59.156Z","updated_at":"2025-10-14T08:45:03.045Z","avatar_url":"https://github.com/alibaba.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"FeatHub is a stream-batch unified feature store that simplifies feature\ndevelopment, deployment, monitoring, and sharing for machine learning\napplications.\n\n- [Introduction](#introduction)\n- [Core Benefits](#core-benefits)\n- [What you can do with FeatHub](#what-you-can-do-with-feathub)\n- [Architecture Overview](#architecture-overview)\n- [Supported Compute Engines](#supported-compute-engines)\n- [FeatHub SDK Highlights](#feathub-sdk-highlights)\n- [User Guide](#user-guide)\n  * [Prerequisites](#prerequisites)\n  * [Install FeatHub Nightly Build](#install-feathub-nightly-build)\n  * [Quickstart](#quickstart)\n  * [Examples](#examples)\n- [Developer Guide](#developer-guide)\n- [Roadmap](#roadmap)\n- [Contact Us](#contact-us)\n- [Additional Resources](#additional-resources)\n\n## Introduction\n\nFeatHub is an open-source feature store designed to simplify the development and\ndeployment of machine learning models. It supports feature ETL and provides an\neasy-to-use Python SDK that abstracts away the complexities of point-in-time\ncorrectness needed to avoid training-serving skew. With FeatHub, data scientists\ncan speed up the feature deployment process and optimize feature ETL by\nautomatically compiling declarative feature definitions into performant\ndistributed ETL jobs using state-of-the-art computation engines of their choice,\nsuch as Flink or Spark.\n\nCheckout [Documentation](docs/content) for guidance on compute\nengines, connectors, expression language, and more.\n\n\n## Core Benefits\n\nSimilar to other feature stores, FeatHub provides the following core benefits:\n\n- **Simplified feature development**: The Pythonic [FeatHub\nSDK](docs/content/feathub-sdk) makes it easy to develop features without worrying\nabout point-in-time correctness.  This helps to avoid training-serving skew,\nwhich can negatively impact the accuracy of machine learning models.\n- **Faster feature deployment**: FeatHub automatically compiles user-specified\ndeclarative feature definitions into performant distributed ETL jobs using\nstate-of-the-art computation engines, such as Flink or Spark. This speeds up\nthe feature deployment process and eliminates the need for data engineers to\nre-write Python programs into distributed stream or batch processing jobs.\n- **Performant feature generation**: FeatHub offers a range of [built-in\n  optimizations](docs/content/deep-dive/optimizations.md) that leverage commonly\nobserved feature ETL job patterns. These optimizations are automatically applied\nto ETL jobs compiled from the declarative feature definitions, much like how SQL\noptimizations are applied.\n- **Facilitated feature sharing**: FeatHub allows developers to register and\nquery feature definitions in a persistent [feature\nregistry](docs/content/registries). This capability reduces the duplication of\ndata engineering efforts and the resource cost of feature generation by\nallowing developers in the organization to share and re-use existing feature\ndefinitions and datasets.\n\nIn addition to the above benefits, FeatHub provides several architectural\nbenefits compared to other feature stores, including:\n\n- **Real-time feature generation**: FeatHub supports real-time feature\ngeneration using [Apache Flink](docs/content/engines/flink.md) as the stream\ncomputation engine with milli-second latency. This provides better performance\nthan other open-source feature stores that only support feature generation\nusing Apache Spark.\n\n- **Assisted feature monitoring**: FeatHub provides [built-in\nmetrics](docs/content/metric-stores) to monitor the quality of features and\nalert users to issues such as feature drift. This helps to improve the accuracy\nand reliability of machine learning models.\n\n- **Stream-batch unified computation**: FeatHub allows for consistent feature\ncomputation across offline, nearline, and online stacks using [Apache\nFlink](docs/content/engines/flink.md) for real-time features with low latency,\n[Apache Spark](docs/content/engines/spark.md) for offline features with high\nthroughput, and FeatureService for computing features online when the request\nis received.\n\n- **Extensible framework**: FeatHub's Python SDK is decoupled from the APIs of\nthe underlying computation engines, providing flexibility and avoiding lock-in.\nThis allows for the support of additional computation engines in the future.\nFor example, FeatHub supports [Local\nProcessor](docs/content/engines/local.md) that is implemented using Pandas\nlibrary, in addition to its support for Apache Flink and Apache Spark.\n\nUsability is a crucial factor that sets feature store projects apart. Our SDK is\ndesigned to be **Pythonic**, **declarative**, intuitive, and highly expressive to\nsupport all the necessary feature transformations. We understand that a feature\nstore's success depends on its usability as it directly affects developers'\nproductivity. Check out the [FeatHub SDK Highlights](#feathub-sdk-highlights)\nsection below to learn more about the exceptional usability of our SDK.\n\n\n\u003c!-- TODO: provide examples showing the advantage of python SDK over SQL. --\u003e\n\n## What you can do with FeatHub\n\nWith FeatHub, you can:\n- **Define new features**: Define features as the result of applying\nexpressions, aggregations, and cross-table joins on existing features, all with\npoint-in-time correctness.\n- **Read and write features data**: Read and write feature data into a variety\n  of offline, nearline, and online [storage\nsystems](docs/content/connectors) for both offline training and online\nserving.\n- **Backfill features data**: Process historical data with the given time range\nand/or keys to backfill feature data, whic\n- **Run experiments**: Run experiments on the local machine using\nLocalProcessor without connecting to Apache Flink or Apache Spark cluster. Then\ndeploy the FeatHub program in a distributed Apache Flink or Apache Spark\ncluster by changing the program configuration.\n\n## Architecture Overview\n\nThe architecture of FeatHub and its key components are shown in the figure below.\n\n\u003cimg src=\"docs/static/img/architecture_1.png\" width=\"50%\" height=\"auto\"\u003e\n\nThe workflow of defining, computing, and serving features using FeatHub is illustrated in the figure below.\n\n\u003cimg src=\"docs/static/img/architecture_2.png\" width=\"70%\" height=\"auto\"\u003e\n\nSee [Basic Concepts](docs/content/basic-concepts.md) for more details about the key components in FeatHub.\n\n## Supported Compute Engines\n\nFeatHub supports the following compute engines to execute feature ETL pipeline:\n- [Apache Flink 1.16](docs/content/engines/flink.md)\n- [Aapche Spark 3.3](docs/content/engines/spark.md)\n- [Local Processor](docs/content/engines/local.md)\n\n## FeatHub SDK Highlights\n\nThe following examples demonstrate how to define a variety of features\nconcisely using FeatHub SDK. See [FeatHub\nSDK](docs/content/feathub-sdk) for more details.\n\nSee [NYC Taxi Demo](docs/examples/nyc_taxi.ipynb) to learn more about how to\ndefine, generate and serve features using FeatHub SDK.\n\n- Define features via table joins with point-in-time correctness\n\n```python\nf_price = Feature(\n    name=\"price\",\n    transform=JoinTransform(\n        table_name=\"price_update_events\",\n        feature_name=\"price\"\n    ),\n    keys=[\"item_id\"],\n)\n```\n\n- Define over-window aggregation features:\n\n```python\nf_total_payment_last_two_minutes = Feature(\n    name=\"total_payment_last_two_minutes\",\n    transform=OverWindowTransform(\n        expr=\"item_count * price\",\n        agg_func=\"SUM\",\n        window_size=timedelta(minutes=2),\n        group_by_keys=[\"user_id\"]\n    )\n)\n```\n\n- Define sliding-window aggregation features:\n\n```python\nf_total_payment_last_two_minutes = Feature(\n    name=\"total_payment_last_two_minutes\",\n    transform=SlidingWindowTransform(\n        expr=\"item_count * price\",\n        agg_func=\"SUM\",\n        window_size=timedelta(minutes=2),\n        step_size=timedelta(minutes=1),\n        group_by_keys=[\"user_id\"]\n    )\n)\n```\n\n- Define features via built-in functions and the FeatHub expression language:\n\n```python\nf_trip_time_duration = Feature(\n    name=\"f_trip_time_duration\",\n    transform=\"UNIX_TIMESTAMP(taxi_dropoff_datetime) - UNIX_TIMESTAMP(taxi_pickup_datetime)\",\n)\n```\n\n- Define a feature via Python UDF:\n\n```python\nf_lower_case_name = Feature(\n    name=\"lower_case_name\",\n    dtype=types.String,\n    transform=PythonUdfTransform(lambda row: row[\"name\"].lower()),\n)\n```\n\n\u003c!-- TODO: Add SqlFeatureView. --\u003e\n\n## User Guide\n\nCheckout [Documentation](docs/content) for guidance on compute\nengines, connectors, expression language, and more.\n\n### Prerequisites\n\nYou need the following to run FeatHub installed using pip:\n- Unix-like operating system (e.g. Linux, Mac OS X)\n- Python 3.7/3.8/3.9\n\n### Install FeatHub Nightly Build\n\n\nTo install the nightly version of FeatHub and the corresponding extra\nrequirements based on the compute engine you plan to use, run one of the\nfollowing commands:\n\n```bash\n# Run the following command if you plan to run FeatHub using a local process\n$ python -m pip install --upgrade feathub-nightly\n\n# Run the following command if you plan to use Apache Flink cluster\n$ python -m pip install --upgrade \"feathub-nightly[flink]\"\n\n# Run the following command if you plan to use Apache Spark cluster, or to use\n# Spark-supported storage in a local process. \n$ python -m pip install --upgrade \"feathub-nightly[spark]\"\n```\n\n### Quickstart\n\n#### Quickstart using Local Processor\n\nExecute the following command to compute features defined in\n[nyc_taxi.py](python/feathub/examples/nyc_taxi.py) in the given Python process.\n\n```bash\n$ python python/feathub/examples/nyc_taxi.py\n```\n\n#### Quickstart using Flink Processor\n\nYou can use the following quickstart guides to compute features in a Flink\ncluster with different deployment modes:\n\n- [Flink Processor Session Mode Quickstart](docs/content/quickstarts/flink-session-mode.md)\n- [Flink Processor Cli Mode Quickstart](docs/content/quickstarts/flink-cli-mode.md)\n\n#### Quickstart using Spark Processor\n\nYou can use the following quickstart guides to compute features in a standalone\nSpark cluster.\n\n- [Spark Processor Client Mode Quickstart](docs/content/quickstarts/spark-client-mode.md)\n\n### Examples\n\nThe following examples can be run on Google Colab.\n\n| Name                                                         | Description                                                  |\n| ------------------------------------------------------------ | ------------------------------------------------------------ |\n| [NYC Taxi Demo](./docs/examples/nyc_taxi.ipynb)              | Quickstart notebook that demonstrates how to define, extract, transform and materialize features with NYC taxi-fare prediction sample data. |\n| [Feature Embedding Demo](./docs/examples/feature_embedding.ipynb) | FeatHub UDF example showing how to define and use feature embedding with a pre-trained Transformer model and hotel review sample data. |\n| [Fraud Detection Demo](./docs/examples/fraud_detection.ipynb) | An example to demonstrate usage with multiple data sources such as user account and transaction data. |\n\nExamples in this [this](https://github.com/flink-extended/feathub-examples)\nrepo can be run using docker-compose.\n\n\n## Developer Guide\n\n### Prerequisites\n\nYou need the following to build FeatHub from source:\n- Unix-like operating system (e.g. Linux, Mac OS X)\n- x86_64 architecture\n- Python 3.7/3.8/3.9\n- Java 8\n- Maven \u003e= 3.1.1\n\n### Install Development Dependencies\n\n1. Install the required Python libraries.\n\n```bash\n$ python -m pip install -r python/dev-requirements.txt\n```\n \n2. Start docker engine and pull the required images.\n\n```bash\n$ docker image pull redis:latest\n$ docker image pull confluentinc/cp-kafka:5.4.3\n```\n\n3. Increase open file limit to be at least 1024.\n\n```bash\n$ ulimit -n 1024\n```\n\n### Build and Install FeatHub from Source\n\u003c!-- TODO: Add instruction to install \"./python[all]\" after the dependency confliction in PyFlink and PySpark is resolved. --\u003e\n```bash\n$ mvn clean package -DskipTests -f ./java\n$ python -m pip install \"./python[flink]\"\n$ python -m pip install \"./python[spark]\"\n```\n\n### Run Tests\n\nPlease execute the following commands under Feathub's root folder to run tests.\n\n```bash\n$ mvn clean package -f ./java\n$ pytest --tb=line -W ignore::DeprecationWarning ./python\n```\n\nWhile the commands above cover most of Feathub's tests, some FlinkProcessor's\npython tests, such as tests related to Parquet format, have been ignored by\ndefault as they require a Hadoop environment to function correctly. In order to\nrun these tests, please install Hadoop on your local machine and set up\nenvironment variables as follows before executing the commands above.\n\n```bash\nexport FEATHUB_TEST_HADOOP_CLASSPATH=`hadoop classpath`\n```\n\nYou may refer to [Flink's document for Hive\nconnector](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/hive/overview/#supported-hive-versions)\nfor supported Hadoop \u0026 Hive versions.\n\n### Format Code Style\n\nFeatHub uses the following tools to maintain code quality:\n- [Black](https://black.readthedocs.io/en/stable/index.html) to format Python code\n- [flake8](https://flake8.pycqa.org/en/latest/) to check Python code style\n- [mypy](https://mypy.readthedocs.io/en/stable/) to check type annotation\n\nBefore uploading pull requests (PRs) for review, format codes, check code\nstyle, and check type annotations using the following commands:\n\n```bash\n# Format python code\n$ python -m black ./python\n\n# Check python code style\n$ python -m flake8 --config=python/setup.cfg ./python\n\n# Check python type annotation\n$ python -m mypy --config-file python/setup.cfg ./python\n```\n\n## Roadmap\n\nHere is a list of key features that we plan to support:\n\n- [x] Support all FeatureView transformations with FlinkProcessor\n- [x] Support all FeatureView transformations with LocalProcessor\n- [x] Support all FeatureView transformations with SparkProcessor\n- [x] Support common online and offline feature storages (e.g. Kafka, Redis, Hive, MySQL)\n- [x] Support persisting feature metadata in MySQL\n- [x] Support exporting pre-defined and user-defined feature metrics to Prometheus\n- [ ] Support online transformation with feature service\n- [ ] Support feature metadata exploration (e.g. definition, lineage, metrics) with FeatHub UI\n\n## Contact Us\n\nChinese-speaking users are recommended to join the following DingTalk group for\nquestions and discussion. You need to join the \"Apache Flink China\" DingTalk\norganization via\n[this](https://wx-in-i.dingtalk.com/invite-page/weixin.html?bizSource=____source____\u0026corpId=ding82d2a9eeaf9e30ff35c2f4657eb6378f\u0026inviteCode=zmC5CSqct5jEXoi)\nlink first in order to join the following DingTalk Group.\n\n\u003cimg src=\"docs/static/img/dingtalk.png\" width=\"20%\" height=\"auto\"\u003e\n\nEnglish-speaking users can use this [invitation\nlink](https://join.slack.com/t/feathubworkspace/shared_invite/zt-1ik9wk0xe-MoMEotpCEYvRRc3ulpvg2Q)\nto join our [Slack channel](https://feathub.slack.com/) for questions and\ndiscussion.\n\nWe are actively looking for user feedback and contributors from the community.\nPlease feel free to create pull requests and open Github issues for feedback and\nfeature requests.\n\nCome join us!\n\n\n## Additional Resources\n- [Documentation](docs/content): Our documentation provides guidance\non compute engines, connectors, expression language, and more. Check it out if\nyou need help getting started or want to learn more about FeatHub.\n- [FeatHub Examples](https://github.com/flink-extended/feathub-examples): This\nrepository provides a wide variety of FeatHub demos that can be executed using\nDocker Compose. It's a great resource if you want to try out FeatHub and see\nwhat it can do.\n- Tech Talks and Articles\n  - DataFun 2023 ([slides](https://www.slideshare.net/DongLin1/feathubdatafun2023pptx))\n  - Flink Forward Asia 2022 ([slides](https://www.slideshare.net/DongLin1/feathub), [video](https://www.bilibili.com/video/BV1714y1E7fQ/?spm_id_from=333.337.search-card.all.click), [article](https://mp.weixin.qq.com/s/ZFKRNaQODe0LwRT1nlwZgA))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falibaba%2Ffeathub","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falibaba%2Ffeathub","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falibaba%2Ffeathub/lists"}