{"id":23080625,"url":"https://github.com/chen0040/java-glm","last_synced_at":"2025-08-15T22:31:04.532Z","repository":{"id":57719653,"uuid":"89673817","full_name":"chen0040/java-glm","owner":"chen0040","description":"Generalized linear models for regression and classification problems","archived":false,"fork":false,"pushed_at":"2017-05-24T03:09:01.000Z","size":230,"stargazers_count":4,"open_issues_count":3,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-01-29T16:33:57.925Z","etag":null,"topics":["classifier-model","forecasting","generalized-linear-models","glm","java","libsvm-format","regression","regression-models"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chen0040.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-28T06:15:00.000Z","updated_at":"2019-06-16T16:05:19.000Z","dependencies_parsed_at":"2022-09-02T13:10:47.846Z","dependency_job_id":null,"html_url":"https://github.com/chen0040/java-glm","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen0040%2Fjava-glm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen0040%2Fjava-glm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen0040%2Fjava-glm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen0040%2Fjava-glm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chen0040","download_url":"https://codeload.github.com/chen0040/java-glm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229964387,"owners_count":18152034,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classifier-model","forecasting","generalized-linear-models","glm","java","libsvm-format","regression","regression-models"],"created_at":"2024-12-16T13:15:51.408Z","updated_at":"2024-12-16T13:15:53.725Z","avatar_url":"https://github.com/chen0040.png","language":"Java","readme":"# Generalized Linear Model implementation in Java\n\nPackage implements the generalized linear model in Java\n\n[![Build Status](https://travis-ci.org/chen0040/java-glm.svg?branch=master)](https://travis-ci.org/chen0040/java-glm) [![Coverage Status](https://coveralls.io/repos/github/chen0040/java-glm/badge.svg?branch=master)](https://coveralls.io/github/chen0040/java-glm?branch=master) \n\n![GLM](glm.png)\n\n# Install\n\nAdd the following to dependencies of your pom file:\n\n```\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.github.chen0040\u003c/groupId\u003e\n  \u003cartifactId\u003ejava-glm\u003c/artifactId\u003e\n  \u003cversion\u003e1.0.6\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n# Features\n\nThe current implementation of GLM supports as many distribution families as glm package in R:\n\n* Normal\n* Exponential\n* Gamma\n* InverseGaussian\n* Poisson\n* Bernouli\n* Binomial\n* Categorical\n* Multinomial\n\nFor the solvers, the current implementation of GLM supports a number of variants of the iteratively re-weighted least squares estimation algorithm:\n \n* IRLS\n* IRLS with QR factorization\n* IRLS with SVD factorization\n\n\n\n# Usage\n\n\n## Step 1: Create and train the glm against the training data in step 1\n \nSuppose you want to create logistic regression model from GLM and train the logistic regression model against the data frame \n\n```java\nimport com.github.chen0040.glm.solvers.Glm;\nimport com.github.chen0040.glm.enums.GlmSolverType;\n\ntrainingData = loadTrainingData();\n\nGlm glm = Glm.logistic();\nglm.setSolverType(GlmSolverType.GlmIrls);\nglm.fit(trainingData);\n```\n\nThe \"trainingData\" is a data frame (Please refers to this [link](https://github.com/chen0040/java-data-frame) on how to create a data frame from file or from scratch)\n\nThe line \"Glm.logistic()\" create the logistic regression model, which can be easily changed to create other regression models (For example, calling \"Glm.linear()\" create a linear regression model) \n\nThe line \"glm.fit(..)\" performs the GLM training.\n\n## Step 2: Use the trained regression model to predict on new data\n\nThe trained glm can then run on the testing data, below is a java code example for logistic regression:\n\n```java\ntestingData = loadTestingData();\nfor(int i = 0; i \u003c testingData.rowCount(); ++i){\n    boolean predicted = glm.transform(testingData.row(i)) \u003e 0.5;\n    boolean actual = frame.row(i).target() \u003e 0.5;\n    System.out.println(\"predicted(Irls): \" + predicted + \"\\texpected: \" + actual);\n}\n```\n\nThe \"testingData\" is a data frame\n\nThe line \"glm.transform(..)\" perform the regression \n\n# Sample code\n\n### Sample code for linear regression\n\nThe sample code below shows the linear regression example\n\n```\nDataQuery.DataFrameQueryBuilder schema = DataQuery.blank()\n      .newInput(\"x1\")\n      .newInput(\"x2\")\n      .newOutput(\"y\")\n      .end();\n\n// y = 4 + 0.5 * x1 + 0.2 * x2\nSampler.DataSampleBuilder sampler = new Sampler()\n      .forColumn(\"x1\").generate((name, index) -\u003e randn() * 0.3 + index)\n      .forColumn(\"x2\").generate((name, index) -\u003e randn() * 0.3 + index * index)\n      .forColumn(\"y\").generate((name, index) -\u003e 4 + 0.5 * index + 0.2 * index * index + randn() * 0.3)\n      .end();\n\nDataFrame trainingData = schema.build();\n\ntrainingData = sampler.sample(trainingData, 200);\n\nSystem.out.println(trainingData.head(10));\n\nDataFrame crossValidationData = schema.build();\n\ncrossValidationData = sampler.sample(crossValidationData, 40);\n\nGlm glm = Glm.linear();\nglm.setSolverType(GlmSolverType.GlmIrlsQr);\nglm.fit(trainingData);\n\nfor(int i = 0; i \u003c crossValidationData.rowCount(); ++i){\n double predicted = glm.transform(crossValidationData.row(i));\n double actual = crossValidationData.row(i).target();\n System.out.println(\"predicted: \" + predicted + \"\\texpected: \" + actual);\n}\n\nSystem.out.println(\"Coefficients: \" + glm.getCoefficients());\n```\n\n### Sample code for logistic regression\n\nThe sample code below performs binary classification using logistic regression:\n\n```java\nInputStream inputStream = new FileInputStream(\"heart_scale.txt\");\nDataFrame dataFrame = DataQuery.libsvm().from(inputStream).build();\n\nfor(int i=0; i \u003c dataFrame.rowCount(); ++i){\n DataRow row = dataFrame.row(i);\n String targetColumn = row.getTargetColumnNames().get(0);\n row.setTargetCell(targetColumn, row.getTargetCell(targetColumn) == -1 ? 0 : 1); // change output from (-1, +1) to (0, 1)\n}\n\nTupleTwo\u003cDataFrame, DataFrame\u003e miniFrames = dataFrame.shuffle().split(0.9);\nDataFrame trainingData = miniFrames._1();\nDataFrame crossValidationData = miniFrames._2();\n\nGlm algorithm = Glm.logistic();\nalgorithm.setSolverType(GlmSolverType.GlmIrlsQr);\nalgorithm.fit(trainingData);\n\ndouble threshold = 1.0;\nfor(int i = 0; i \u003c trainingData.rowCount(); ++i){\n double prob = algorithm.transform(trainingData.row(i));\n if(trainingData.row(i).target() == 1 \u0026\u0026 prob \u003c threshold){\n    threshold = prob;\n }\n}\nlogger.info(\"threshold: {}\",threshold);\n\n\nBinaryClassifierEvaluator evaluator = new BinaryClassifierEvaluator();\n\nfor(int i = 0; i \u003c crossValidationData.rowCount(); ++i){\n double prob = algorithm.transform(crossValidationData.row(i));\n boolean predicted = prob \u003e 0.5;\n boolean actual = crossValidationData.row(i).target() \u003e 0.5;\n evaluator.evaluate(actual, predicted);\n System.out.println(\"probability of positive: \" + prob);\n System.out.println(\"predicted: \" + predicted + \"\\tactual: \" + actual);\n}\n\nevaluator.report();\n```\n\n### Sample code for multi-class classification\n\nThe sample code below perform multi class classification using the logistic regression model as the generator\n\n```java\nInputStream irisStream = FileUtils.getResource(\"iris.data\");\nDataFrame irisData = DataQuery.csv(\",\")\n      .from(irisStream)\n      .selectColumn(0).asNumeric().asInput(\"Sepal Length\")\n      .selectColumn(1).asNumeric().asInput(\"Sepal Width\")\n      .selectColumn(2).asNumeric().asInput(\"Petal Length\")\n      .selectColumn(3).asNumeric().asInput(\"Petal Width\")\n      .selectColumn(4).asCategory().asOutput(\"Iris Type\")\n      .build();\n\nTupleTwo\u003cDataFrame, DataFrame\u003e parts = irisData.shuffle().split(0.9);\n\nDataFrame trainingData = parts._1();\nDataFrame crossValidationData = parts._2();\n\nSystem.out.println(crossValidationData.head(10));\n\nOneVsOneGlmClassifier multiClassClassifier = Glm.oneVsOne(Glm::logistic);\nmultiClassClassifier.fit(trainingData);\n\nClassifierEvaluator evaluator = new ClassifierEvaluator();\n\nfor(int i=0; i \u003c crossValidationData.rowCount(); ++i) {\n String predicted = multiClassClassifier.classify(crossValidationData.row(i));\n String actual = crossValidationData.row(i).categoricalTarget();\n System.out.println(\"predicted: \" + predicted + \"\\tactual: \" + actual);\n evaluator.evaluate(actual, predicted);\n}\n\nevaluator.report();\n```\n\n# Background on GLM \n\n### Introduction\n\nGLM is generalized linear model for exponential family of distribution model b = g(a).\ng(a) is the inverse link function.\n\nTherefore, for a regressions characterized by inverse link function g(a), the regressions problem be formulated\nas we are looking for model coefficient set x in\n\n```math\ng(A * x) = b + e\n```\n\nAnd the objective is to find x such for the following objective:\n\n```math\nmin (g(A * x) - b).transpose * W * (g(A * x) - b)\n```\n\n\nSuppose we assumes that e consist of uncorrelated naive variables with identical variance, then W = sigma^(-2) * I,\nand The objective \n\n```math\nmin (g(A * x) - b) * W * (g(A * x) - b).transpose\n```\n \nis reduced to the OLS form:\n\n```math\nmin || g(A * x) - b ||^2\n```\n\n\n### Iteratively Re-weighted Least Squares estimation (IRLS)\n\nIn regressions, we tried to find a set of model coefficient such for:\n\n```math\nA * x = b + e\n```\n\n\nA * x is known as the model matrix, b as the response vector, e is the error terms.\n\nIn OLS (Ordinary Least Square), we assumes that the variance-covariance \n\n```math\nmatrix V(e) = sigma^2 * W\n```\n\n, where:\n  W is a symmetric positive definite matrix, and is a diagonal matrix\n  sigma is the standard error of e\n\nIn OLS (Ordinary Least Square), the objective is to find x_bar such that e.transpose * W * e is minimized (Note that since W is positive definite, e * W * e is alway positive)\nIn other words, we are looking for x_bar such as (A * x_bar - b).transpose * W * (A * x_bar - b) is minimized\n\nLet \n\n```math\ny = (A * x - b).transpose * W * (A * x - b)\n```\n\nNow differentiating y with respect to x, we have\n\n```math\ndy / dx = A.transpose * W * (A * x - b) * 2\n```\n\n\nTo find min y, set dy / dx = 0 at x = x_bar, we have\n\n```math\nA.transpose * W * (A * x_bar - b) = 0\n```\n\nTransform this, we have\n\n```math\nA.transpose * W * A * x_bar = A.transpose * W * b\n```\n\n\nMultiply both side by (A.transpose * W * A).inverse, we have\n\n```math\nx_bar = (A.transpose * W * A).inverse * A.transpose * W * b\n```\n\nThis is commonly solved using IRLS\n\nThe implementation of Glm based on iteratively re-weighted least squares estimation (IRLS)\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchen0040%2Fjava-glm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchen0040%2Fjava-glm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchen0040%2Fjava-glm/lists"}