{"id":13933021,"url":"https://github.com/TalkingData/Fregata","last_synced_at":"2025-07-19T16:32:35.623Z","repository":{"id":43801156,"uuid":"68701822","full_name":"TalkingData/Fregata","owner":"TalkingData","description":"A light weight, super fast, large scale machine learning library on spark .","archived":false,"fork":false,"pushed_at":"2018-03-23T06:23:19.000Z","size":200,"stargazers_count":680,"open_issues_count":6,"forks_count":187,"subscribers_count":84,"default_branch":"master","last_synced_at":"2024-08-08T21:19:53.475Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TalkingData.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-20T10:39:55.000Z","updated_at":"2024-07-17T02:25:16.000Z","dependencies_parsed_at":"2022-09-26T21:51:25.545Z","dependency_job_id":null,"html_url":"https://github.com/TalkingData/Fregata","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalkingData%2FFregata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalkingData%2FFregata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalkingData%2FFregata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TalkingData%2FFregata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TalkingData","download_url":"https://codeload.github.com/TalkingData/Fregata/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226643914,"owners_count":17662968,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-07T21:01:29.327Z","updated_at":"2024-11-26T23:30:53.586Z","avatar_url":"https://github.com/TalkingData.png","language":"Scala","funding_links":[],"categories":["Scala","人工智能"],"sub_categories":[],"readme":"Fregata: Machine Learning\n==================================\n\n[![GitHub license](http://og41w30k3.bkt.clouddn.com/apache2.svg)](./LICENSE)\n\n- [Fregata](http://talkingdata.com) is a light weight, super fast, large scale machine learning library based on [Apache Spark](http://spark.apache.org/), and it provides high-level APIs in Scala.\n\n- More accurate: For various problems, Fregata can achieve higher accuracy compared to MLLib.\n\n- Higher speed: For Generalized Linear Model, Fregata often converges in one data epoch. For a 1 billion X 1 billion data set, Fregata can train a Generalized Linear Model in 1 minute with memory caching or 10 minutes without it. Usually, Fregata is 10-100 times faster than MLLib.\n\n- Parameter Free: Fregata uses [GSA](http://arxiv.org/abs/1611.03608) SGD optimization, which dosen't require learning rate tuning, because we found a way to calculate appropriate learning rate in the training process. When confronted with super high-dimension problem, Fregata calculates remaining memory dynamically to determine the sparseness of the output, balancing accuracy and efficiency automatically. Both features enable Fregata to be treated as a standard module in data processing for different problems.\n\n- Lighter weight: Fregata just uses Spark's standard API,  which allows it to be integrated into most business’ data processing flow on Spark quickly and seamlessly.\n\n## Architecture\nThis documentation is about Fregata version 0.1\n\n- core : mainly implements stand-alone algorithms based on GSA, including  **Classification** \u003cfont color=#808080\u003e **Regression**\u003c/font\u003e and \u003cfont color=#808080\u003e  **Clustering** \u003c/font\u003e\n  - Classification: supports both binary and multiple classification\n  - Regression: will release later\n  - Clustering: will release later\n- spark : mainly implements large scale machine learning algorithms based on **spark** by wrapping **core.jar** and supplies the corresponding algorithms\n\n**Fregata supports spark 1.x and 2.x with scala 2.10 and scala 2.11 .**\n\n## Algorithms\n- [Trillion LR](./docs/largescale_lr.md)\n- [Trillion SoftMax](./docs/largescale_softmax.md)\n- [Logistic Regression](./docs/logistic_regression.md)\n- [Combine Freatures Logistic Regression](./docs/clr.md)\n- [SoftMax](./docs/softmax.md)\n- [RDT](./docs/rdt.md)\n\n## Installation\n\nTwo ways to get Fregata by Maven or SBT :\n\n- Maven's pom.xml\n\n```xml\n    \u003cdependency\u003e\n       \u003cgroupId\u003ecom.talkingdata.fregata\u003c/groupId\u003e\n        \u003cartifactId\u003ecore\u003c/artifactId\u003e\n        \u003cversion\u003e0.0.3\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003ecom.talkingdata.fregata\u003c/groupId\u003e\n        \u003cartifactId\u003espark\u003c/artifactId\u003e\n        \u003cversion\u003e0.0.3\u003c/version\u003e\n    \u003c/dependency\u003e\n```\n\n- SBT's build.sbt\n\n```scala\n    // if you deploy to local mvn repository please add\n    // resolvers += Resolver.mavenLocal\n    libraryDependencies += \"com.talkingdata.fregata\" % \"core\" % \"0.0.3\"\n    libraryDependencies += \"com.talkingdata.fregata\" % \"spark\" % \"0.0.3\"\n```\n\nIf you want to manual deploy to local maven repository , as follow :\n```\ngit clone https://github.com/TalkingData/Fregata.git\ncd Fregata\nmvn clean package install\n```\n\n## Quick Start\nSuppose that you're familiar with Spark, the example below shows how to use Fregata's **Logistic Regression**, and experimental datas can be obtained on [LIBSVM Data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)\n\n- adding Fregata into project by Maven or SBT referring to the **Downloading** part\n- importing packages\n\n```scala\n\timport fregata.spark.data.LibSvmReader\n\timport fregata.spark.metrics.classification.{AreaUnderRoc, Accuracy}\n\timport fregata.spark.model.classification.LogisticRegression\n\timport org.apache.spark.{SparkConf, SparkContext}\n```\n\n- loading training datas by Fregata's LibSvmReader API\n\n```scala\n    val (_, trainData)  = LibSvmReader.read(sc, trainPath, numFeatures.toInt)\n    val (_, testData)  = LibSvmReader.read(sc, testPath, numFeatures.toInt)\n```\n\n- building Logsitic Regression model by trainging datas\n\n```scala\n    val model = LogisticRegression.run(trainData)\n```\n\n- predicting the scores of instances\n\n```scala\n    val pd = model.classPredict(testData)\n```\n\n- evaluating the quality of predictions of the model by auc or other metrics\n\n```scala\n    val auc = AreaUnderRoc.of( pd.map{\n      case ((x,l),(p,c)) =\u003e\n        p -\u003e l\n    })\n```\n\n## Input Data Format\nFregata's training API needs *RDD[(fregata.Vector, fregata.Num)]*, predicting API needs the same or *RDD[fregata.Vector]* without label\n\n```scala\n\timport breeze.linalg.{Vector =\u003e BVector , SparseVector =\u003e BSparseVector , DenseVector =\u003e BDenseVector}\n\timport fregata.vector.{SparseVector =\u003e VSparseVector }\n\n\tpackage object fregata {\n\t  type Num = Double\n\t  type Vector = BVector[Num]\n\t  type SparseVector = BSparseVector[Num]\n\t  type SparseVector2 = VSparseVector[Num]\n\t  type DenseVector = BDenseVector[Num]\n\t  def zeros(n:Int) = BDenseVector.zeros[Num](n)\n\t  def norm(x:Vector) = breeze.linalg.norm(x,2.0)\n\t  def asNum(v:Double) : Num = v\n\t}\n\n```\n\n- if the data format is LibSvm, then *Fregata's LibSvmReader.read() API* can be used directly\n\n```scala\n\t// sc is Spark Context\n\t// path is the location of input datas on HDFS\n\t// numFeatures is the number of features for single instance\n\t// minPartitions is the minimum number of partitions for the returned RDD pointing the input datas\n\tread(sc:SparkContext, path:String, numFeatures:Int=-1, minPartition:Int=-1):(Int, RDD[(fregata.Vector, fregata.Num)])\n```\n\n- else some constructions are needed\n\n\t- Using SparseVector\n\n\t```scala\n\t\t// indices is an 0-based Array and the index-th feature is not equal to zero\n\t\t// values  is an Array storing the corresponding value of indices\n\t\t// length  is the total features of each instance\n\t\t// label   is the instance's label\n\n\t\t// input datas with label\n\t\tsc.textFile(input).map{\n\t\t\tval indicies = ...\n\t\t\tval values   = ...\n\t\t\tval label    = ...\n\t\t\t...\n\t\t\t(new SparseVector(indices, values, length).asInstanceOf[Vector], asNum(label))\n\t\t}\n\n\t\t// input datas without label(just for predicting API)\n\t\tsc.textFile(input).map{\n\t\t\tval indicies = ...\n\t\t\tval values   = ...\n\t\t\t...\n\t\t\tnew SparseVector(indices, values, length).asInstanceOf[Vector]\n\t\t}\n\t```\n\t- Using DenseVector\n\n\t```scala\n\t\t// datas is the value of each feature\n\t\t// label   is the instance's label\n\n\t\t// input datas with label\n\t\tsc.textFile(input).map{\n\t\t\tval datas = ...\n\t\t\tval label = ...\n\t\t\t...\n\t\t\t(new DenseVector(datas).asInstanceOf[Vector], asNum(label))\n\t\t}\n\n\t\t// input datas without label(just for predicting API)\n\t\tsc.textFile(input).map{\n\t\t\tval datas = ...\n\t\t\t...\n\t\t\tnew DenseVector(indices, values, length).asInstanceOf[Vector]\n\t\t}\n\t```\n\n## MailList:\n   - yongjun.tian@tendcloud.com\n   - haijun.liu@tendcloud.com\n   - xiatian.zhang@tendcloud.com\n   - fan.yao@tendcloud.com\n\n## Contributors:\n\nContributed by [TalkingData](https://github.com/TalkingData/Fregata/contributors) .\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTalkingData%2FFregata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTalkingData%2FFregata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTalkingData%2FFregata/lists"}