{"id":15021503,"url":"https://github.com/spotify/featran","last_synced_at":"2025-05-15T03:08:42.651Z","repository":{"id":20698751,"uuid":"90654004","full_name":"spotify/featran","owner":"spotify","description":"A Scala feature transformation library for data science and machine learning","archived":false,"fork":false,"pushed_at":"2025-02-07T19:39:26.000Z","size":3055,"stargazers_count":467,"open_issues_count":11,"forks_count":68,"subscribers_count":28,"default_branch":"main","last_synced_at":"2025-05-11T10:57:00.787Z","etag":null,"topics":["algebird","breeze","data","flink","ml","scala","scalding","scio","spark","tensorflow","xgboost"],"latest_commit_sha":null,"homepage":"https://spotify.github.io/featran","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spotify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-08T17:20:27.000Z","updated_at":"2025-04-03T07:13:19.000Z","dependencies_parsed_at":"2025-02-28T17:19:40.442Z","dependency_job_id":null,"html_url":"https://github.com/spotify/featran","commit_stats":{"total_commits":798,"total_committers":35,"mean_commits":22.8,"dds":0.6541353383458647,"last_synced_commit":"6359cc941c95c3d4574f8a36c07dd7664d3b1bd0"},"previous_names":[],"tags_count":39,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Ffeatran","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Ffeatran/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Ffeatran/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spotify%2Ffeatran/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spotify","download_url":"https://codeload.github.com/spotify/featran/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254264771,"owners_count":22041794,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algebird","breeze","data","flink","ml","scala","scalding","scio","spark","tensorflow","xgboost"],"created_at":"2024-09-24T19:56:39.158Z","updated_at":"2025-05-15T03:08:37.624Z","avatar_url":"https://github.com/spotify.png","language":"Scala","funding_links":[],"categories":["数据科学"],"sub_categories":[],"readme":"featran\n=======\n\n[![Build Status](https://img.shields.io/github/actions/workflow/status/spotify/featran/ci.yml?branch=main)](https://github.com/spotify/featran/actions?query=workflow%3Aci)\n[![codecov.io](https://codecov.io/github/spotify/featran/coverage.svg?branch=master)](https://codecov.io/github/spotify/featran?branch=master)\n[![Maven Central](https://img.shields.io/maven-central/v/com.spotify/featran-core_2.12.svg)](https://maven-badges.herokuapp.com/maven-central/com.spotify/featran-core_2.12)\n[![Scaladoc](https://img.shields.io/badge/scaladoc-latest-blue.svg)](https://spotify.github.io/featran/api/com/spotify/featran/index.html)\n[![Scala Steward badge](https://img.shields.io/badge/Scala_Steward-helping-brightgreen.svg?style=flat\u0026logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA4AAAAQCAMAAAARSr4IAAAAVFBMVEUAAACHjojlOy5NWlrKzcYRKjGFjIbp293YycuLa3pYY2LSqql4f3pCUFTgSjNodYRmcXUsPD/NTTbjRS+2jomhgnzNc223cGvZS0HaSD0XLjbaSjElhIr+AAAAAXRSTlMAQObYZgAAAHlJREFUCNdNyosOwyAIhWHAQS1Vt7a77/3fcxxdmv0xwmckutAR1nkm4ggbyEcg/wWmlGLDAA3oL50xi6fk5ffZ3E2E3QfZDCcCN2YtbEWZt+Drc6u6rlqv7Uk0LdKqqr5rk2UCRXOk0vmQKGfc94nOJyQjouF9H/wCc9gECEYfONoAAAAASUVORK5CYII=)](https://scala-steward.org)\n\nFeatran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.\n\n# Introduction\n\nMost feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:\n\n- Min-Max Scaler\n  - Aggregation: global min \u0026 max\n  - Mapping: scale each value to `[min, max]`\n- One-Hot Encoder\n  - Aggregation: distinct labels\n  - Mapping: convert each label to a binary vector\n\nWe can implement this in a naive way using `reduce` and `map`.\n\n```scala\ncase class Point(score: Double, label: String)\nval data = Seq(Point(1.0, \"a\"), Point(2.0, \"b\"), Point(3.0, \"c\"))\n\nval a = data\n  .map(p =\u003e (p.score, p.score, Set(p.label)))\n  .reduce((x, y) =\u003e (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))\n\nval features = data.map { p =\u003e\n  (p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s =\u003e if (s == p.label) 1.0 else 0.0)\n}\n```\n\nBut this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.\n\n```scala\nimport com.spotify.featran._\nimport com.spotify.featran.transformers._\n\nval fs = FeatureSpec.of[Point]\n  .required(_.score)(MinMaxScaler(\"min-max\"))\n  .required(_.label)(OneHotEncoder(\"one-hot\"))\n\nval fe = fs.extract(data)\nval names = fe.featureNames\nval features = fe.featureValues[Seq[Double]]\n```\n\nFeatran also supports these additional features.\n\n- Extract from Scala collections, [Flink](http://flink.apache.org/) `DataSet`s, [Scalding](https://github.com/twitter/scalding) `TypedPipe`s, [Scio](https://github.com/spotify/scio) `SCollection`s and [Spark](https://spark.apache.org/) `RDD`s\n- Output as Scala collections, [Breeze](https://github.com/scalanlp/breeze) dense and sparse vectors,  [TensorFlow](https://www.tensorflow.org/) `Example` Protobuf, [XGBoost](https://github.com/dmlc/xgboost) `LabeledPoint` and [NumPy](http://www.numpy.org/) `.npy` file\n- Import aggregation from a previous extraction for training, validation and test sets\n- Compose feature specifications and separate outputs\n\nSee [Examples](https://spotify.github.io/featran/examples/Examples.scala.html) ([source](https://github.com/spotify/featran/blob/master/examples/src/main/scala/Examples.scala)) for detailed examples. See [transformers](https://spotify.github.io/featran/api/index.html#com.spotify.featran.transformers.package) package for a complete list of available feature transformers.\n\nSee [ScalaDocs](https://spotify.github.io/featran) for current API documentation.\n\n# Presentations\n\n- [Featran - Type safe and generic feature transformation in Scala](https://www.lyh.me/slides/featran.html) - NABD Conf Palo Alto 2017 talk\n\n# Artifacts\n\nFeature includes the following artifacts:\n\n- `featran-core` - core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors\n- `featran-java` - Java interface, see [JavaExample.java](https://github.com/spotify/featran/blob/master/java/src/test/java/com/spotify/featran/java/examples/JavaExample.java)\n- `featran-flink` - support for extraction from Flink `DataSet`\n- `featran-scalding` - support for extraction from Scalding `TypedPipe`\n- `featran-scio` - support for extraction from Scio `SCollection`\n- `featran-spark` - support for extraction from Spark `RDD`\n- `featran-tensorflow` - support for output as TensorFlow `Example` Protobuf\n- `featran-xgboost` - support for output as XGBoost `LabeledPoint`\n- `featran-numpy` - support for output as NumPy `.npy` file\n\n# License\n\nCopyright 2016-2017 Spotify AB.\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Ffeatran","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspotify%2Ffeatran","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspotify%2Ffeatran/lists"}