{"id":13467172,"url":"https://github.com/mbok/elasticsearch-linear-regression","last_synced_at":"2025-03-26T01:30:25.560Z","repository":{"id":91966521,"uuid":"84003083","full_name":"mbok/elasticsearch-linear-regression","owner":"mbok","description":"A machine learning plugin for Elasticsearch providing aggregations to compute multiple linear regression on search results in real-time for predictive analytics.","archived":false,"fork":false,"pushed_at":"2018-10-07T20:31:19.000Z","size":242,"stargazers_count":64,"open_issues_count":5,"forks_count":21,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-08-01T15:07:56.957Z","etag":null,"topics":["elasticsearch","elasticsearch-plugin","linear-regression","machine-learning","predictive-analytics"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbok.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-05T21:21:08.000Z","updated_at":"2024-04-14T09:39:07.000Z","dependencies_parsed_at":"2024-01-16T06:09:13.082Z","dependency_job_id":"836720e2-b553-4fad-a08c-5c2946168bfd","html_url":"https://github.com/mbok/elasticsearch-linear-regression","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbok%2Felasticsearch-linear-regression","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbok%2Felasticsearch-linear-regression/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbok%2Felasticsearch-linear-regression/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbok%2Felasticsearch-linear-regression/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbok","download_url":"https://codeload.github.com/mbok/elasticsearch-linear-regression/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":222100827,"owners_count":16931670,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elasticsearch","elasticsearch-plugin","linear-regression","machine-learning","predictive-analytics"],"created_at":"2024-07-31T15:00:53.801Z","updated_at":"2024-10-29T19:32:14.589Z","avatar_url":"https://github.com/mbok.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"# A multiple linear regression plugin for Elasticsearch\n\nimage:https://travis-ci.org/mbok/elasticsearch-linear-regression.svg?branch=master[\"Build Status\", link=\"https://travis-ci.org/mbok/elasticsearch-linear-regression\"]\n\nLinear regression model has been a mainstay of statistics and machine learning\nin the past decades and remains one of the most important tools in context of supervised learning algorithms.\nIt's a powerful technique for prediction of the value of a dependent variable `y` (called response variable) given the values of another independent\nvariables `x = (x~1~, x~2~,...,x~C~)` (called explanatory variables) based on a training data set. Prediction of the response variable with respect to the input values\n for the explanatory variables is described by the linear hypothesis function ``h(x)`` with\n\nimage:http://latex.codecogs.com/gif.latex?h(x)%20=%20\\theta_{0}%20+%20\\sum_{j=1}^C%20\\theta_{j}%20x_{j}[]\n\nThis plugin enhances Elasticsearch's query engine by two new aggregations, which utilize the index data during search\nas training data for estimating a linear regression model in order to expose information like prediction of a value for the target variable,\nanomaly detection and measuring the accuracy or rather predictiveness of the model.\nEstimation is performed regarding the https://en.wikipedia.org/wiki/Ordinary_least_squares[OLS]\n(ordinary least-squares) approach over the search result set.\n\n\n## Aggregations\nBoth aggregations are numeric aggregations that estimate the linear regression coefficients\nimage:http://latex.codecogs.com/gif.latex?\\theta_0,%20\\theta_1,%20\\theta_2,.%20.%20.,%20\\theta_C%20[]\nbased on document results of a search query. Each search result\ndocument is handled as an observation and the numerical fields as variables (explanatory and response)\nfor the linear model.\n\n=== Aggregation for prediction\n\nThe `linreg_predict` aggregation computes the predicted outcome for the response variable\nregarding the estimated model with respect to a set of given input values for the explanatory variables.\n\n[horizontal]\n`value`:: The predicted value for the response variable computed using the estimated linear hypothesis\n          function ``h(x)`` with `x` given by `C` input values for the explanatory variables\n          `x = [x~1~, x~2~,...,x~C~]`.\n`coefficients`:: Estimated coefficients\n  image:http://latex.codecogs.com/gif.latex?\\theta_0,%20\\theta_1,%20\\theta_2,%20\\theta_3,.%20.%20.,%20\\theta_C%20[]\n    of the linear linear hypothesis function ``h(x)``.\n\nAssuming the data consists of documents representing sold house prices with features\n like number of bedrooms, bathrooms and size etc. we can let predict or validate\n the price for our house in Morro Bay with 2000 square feet, 4 bedrooms and 2 bathrooms by:\n\n[source,js]\n--------------------------------------------------\n/houses/_search?size=0\n{\n    \"query\": {\n        \"match\" : {\n            \"location\" : \"Morro Bay\"\n        }\n    },\n    \"aggs\": {\n        \"house_prices\": {\n            \"linreg_predict\": {\n                \"fields\": [\"size\", \"bedrooms\", \"bathrooms\", \"price\"],\n                \"inputs\": [2000, 4, 2]\n            }\n        }\n    }\n}\n--------------------------------------------------\n\n\u003c1\u003e `fields` instructs this aggregation to use for the linear regression model the house feature fields `size`, `bedrooms` and `bathrooms`\n    as explanatory variables and the `price` field as the response variable. The size of the `fields` array is `C + 1`\n    with `C` entries for the explanatory variables and one entry for the response variable.\n\u003c2\u003e `inputs` passes the feature values of our house we like to predict the price for. The numeric input values\n    have to be passed in array form in the order corresponding to the features listed in the `fields` attribute.\n    The size of the `inputs` array is `C` equivalent to the number of the explanatory variables.\n\nAnd the following may be the response with the estimated price of around $ 581,458 for our house:\n\n[source,js]\n--------------------------------------------------\n{\n    ...\n    \"aggregations\": {\n        \"my_house_price\": {\n            \"value\": 581458.3087492324,\n            \"coefficients\": [\n                227990.63952712028,\n                248.92285661317254,\n                -68297.7720278421,\n                64406.52205356777\n            ]\n        }\n    }\n}\n--------------------------------------------------\n\n\n=== Aggregation for linear regression statistics\n\nThe `linreg_stats` aggregation computes statistics for the estimated linear regression model.\n\n[horizontal]\n`rss`:: Residual sum of squares as a measure of the discrepancy between the data and the estimated model.\n        The lower the `rss` number, the smaller the error of the prediction, and the better the model.\n`mse`:: Mean squared error or rather `rss` divided by the number of documents consumed for model estimation.\n`r2`:: Coefficient of determination, denoted R², as a statistical measure of how well the regression model\n        approximates the real data points. R² ranges from 0 to 1, where 1 indicates that the estimated hypothesis function perfectly fits the data.\n        (Available since 5.5.1.2)\n`coefficients`:: Estimated coefficients\n  image:http://latex.codecogs.com/gif.latex?\\theta_0,%20\\theta_1,%20\\theta_2,%20\\theta_3,.%20.%20.,%20\\theta_C%20[]\n    of the linear linear hypothesis function ``h(x)``.\n\nAssuming the data consists of documents representing house prices we can compute statistics for\nthe estimated best fitting linear hypothesis function which predicts house prices based on number of\nbedrooms, bathrooms and size with\n[source,js]\n--------------------------------------------------\n/houses/_search?size=0\n{\n    \"aggs\": {\n        \"house_prices\": {\n            \"linreg_stats\": {\n                \"fields\": [\"bedrooms\", \"bathrooms\", \"size\", \"price\"]\n            }\n        }\n    }\n}\n--------------------------------------------------\n\nThe aggregation type is `linreg_stats` and the `fields` setting defines the set of fields (as an array)\nto be used for building the linear model. The first one to many fields stand for the explanatory variables\nand the last for the response variable. The above request returns the following response:\n\n[source,js]\n--------------------------------------------------\n{\n    ...\n    \"aggregations\": {\n        \"house_prices\": {\n            \"rss\": 49523788338938.75,\n            \"mse\": 63410740510.80505,\n            \"r2\": 0.4788369924642064,\n            \"coefficients\": [\n                47553.1873756476,\n                -100544.07258945837,\n                45981.15827544975,\n                309.6013051477474\n            ]\n        }\n    }\n}\n--------------------------------------------------\n\n=== Data conditions\nDue to algorithmic constraints both aggregations result an empty response, if\n\n* the search result size is less or equal than the number of indicated explanatory variables,\n* values of the explanatory variables in the search result set is linearly dependent (that means\n  that a column can be written as a linear combination of the other columns).\n\n\n## Algorithm\nThis implementation is based on a new parallel, single-pass OLS estimation algorithm for multiple linear regression\n(not yet published). By aggregating\nover the data only once and in parallel the algorithm is ideally suited for large-scale, distributed data sets and\nin this respect surpasses the majority of existing multi-pass analytical OLS estimators or iterative optimization algorithms.\n\nThe overall complexity of the implemented algorithm to estimate the regression coefficients is `O(N C² + C³)`, where\n`N` denotes the size of the training data set (the number of documents in the search result set) and `C` the number\nof the indicated explanatory variables (fields).\n\n## Installation\n\n### Elasticsearch 5.x\nFor installing this plugin please choose first the proper version under the compatible\nmatrix which matches your Elasticsearch version and use the download link for the following command.\n\n[source]\n----\n./bin/elasticsearch-plugin install https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.5.2.1/elasticsearch-linear-regression-5.5.2.1.zip\n----\nThe plugin will be installed under the name \"linear-regression\".\nDo not forget to restart the node after installing.\n\n.Compatibility matrix\n[frame=\"all\"]\n|===\n| Plugin version | Elasticsearch version | Release date\n| https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.5.2.1/elasticsearch-linear-regression-5.5.2.1.zip[5.5.2.1]        | 5.5.2 | Aug  29, 2017\n| https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.5.1.2/elasticsearch-linear-regression-5.5.1.2.zip[5.5.1.2]        | 5.5.1 | Aug  29, 2017\n| https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.5.1.1/elasticsearch-linear-regression-5.5.1.1.zip[5.5.1.1]        | 5.5.1 | Jul  27, 2017\n| https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.5.0.1/elasticsearch-linear-regression-5.5.0.1.zip[5.5.0.1]        | 5.5.0 | Jul  18, 2017\n| https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.3.0.2/elasticsearch-linear-regression-5.3.0.2.zip[5.3.0.2]        | 5.3.0 | Jul  16, 2017\n| https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.3.0.1/elasticsearch-linear-regression-5.3.0.1.zip[5.3.0.1]        | 5.3.0 | Jun  30, 2017\n|===\n\n## Examples\n### Predicting house prices\nThe idea is very simple. We have data in our Elasticsearch index representing\nsold house prices in our region with some features like square footage of\nthe house, # of bathrooms, # of bedrooms etc. Now we want to find out which\nprice we have to pay for a house of our dreams.\n\nIn this example we use test data from: http://wiki.csc.calpoly.edu/datasets/attachment/wiki/Houses/RealEstate.csv?format=raw\n\nTo import the data into Elasticsearch we use logstash and this pipeline config\nhttps://github.com/scaleborn/elasticsearch-linear-regression/tree/master/examples/houseprices/house-prices-import.conf[house-prices-import.conf]:\n....\n./bin/logstash -f house-prices-import.conf\n....\n\nThe indexed documents will have this form:\n[source,js]\n--------------------------------------------------\n{\n  \"_index\": \"houses\",\n  \"_type\": \"prices\",\n  \"_id\": \"AV0zjVhTomRh2LZNgmfJ\",\n  \"_source\": {\n      \"bathrooms\": 3,\n      \"bedrooms\": 4,\n      \"size\": 4168,\n      \"mls\": \"140077\",\n      \"price\": 1100000,\n      \"location\": \"Morro Bay\",\n      \"price_sq_ft\": 263.92,\n      \"status\": \"Short Sale\"\n  }\n}\n--------------------------------------------------\n\nWe can now query the index for houses in \"Morro Bay\" and let predict the price\nfor our dream house with respect to the desired features like 3 bedrooms,\n2 bathrooms and at least 2000 square feet:\n[source,js]\n--------------------------------------------------\n/houses/_search?size=0\n{\n    \"query\": {\n        \"match\" : {\n            \"location\" : \"Morro Bay\"\n        }\n    },\n    \"aggs\": {\n        \"dream_house_price\": {\n            \"linreg_predict\": {\n                \"fields\": [\"size\", \"bedrooms\", \"bathrooms\", \"price\"],\n                \"inputs\": [2000, 3, 2]\n            }\n        }\n    }\n}\n--------------------------------------------------\n\nRegarding the following prediction response we have to expect about\n$ 650,000 to pay for the desired house in \"Morro Bay\".\n[source,js]\n--------------------------------------------------\n{\n    \"aggregations\": {\n        \"dream_house_price\": {\n            \"value\": 649918.0709489314,\n            \"coefficients\": [\n                228318.6161854365,\n                249.02340193904183,\n                -68314.4830871133,\n                64248.05007337558\n            ]\n        }\n    }\n}\n--------------------------------------------------\n\nBy using sub aggregations we are able to find out the estimated prices per location:\n[source,js]\n--------------------------------------------------\n/houses/_search?size=0\n{\n    \"aggs\": {\n        \"locations\": {\n            \"terms\": {\n                \"field\": \"location.keyword\",\n                \"size\": 15\n            },\n            \"aggs\": {\n                \"dream_house_price\": {\n                    \"linreg_predict\": {\n                        \"fields\": [\"size\", \"bedrooms\", \"bathrooms\", \"price\"],\n                        \"inputs\": [2000, 3, 2]\n                    }\n                }\n            }\n        }\n    }\n}\n--------------------------------------------------\n\nThe response uncovers that \"Arroyo Grande\" would be\nthe most expensive region for our dream house:\n\n[source,js]\n--------------------------------------------------\n{\n    \"aggregations\": {\n        \"locations\": {\n            \"buckets\": [\n                {\n                    \"key\": \"Santa Maria-Orcutt\",\n                    \"doc_count\": 265,\n                    \"dream_house_price\": {\n                        \"value\": 256251.9105297585,\n                        \"coefficients\": [\n                            26437.192829649313,\n                            81.19071633227178,\n                            6825.9128627023265,\n                            23477.773223729317\n                        ]\n                    }\n                },\n                {\n                    \"key\": \"Paso Robles\",\n                    \"doc_count\": 85,\n                    \"dream_house_price\": {\n                        \"value\": 365620.0386191703,\n                        \"coefficients\": [\n                            42958.257094706176,\n                            151.7000907380368,\n                            6486.477078139843,\n                            -98.91559301451247\n                        ]\n                    }\n                },\n                ...\n                {\n                    \"key\": \" Arroyo Grande\",\n                    \"doc_count\": 12,\n                    \"dream_house_price\": {\n                        \"value\": 1140196.791331573,\n                        \"coefficients\": [\n                            728566.7474390095,\n                            1956.6474540196602,\n                            -706891.620925945,\n                            -690495.0006844609\n                        ]\n                    }\n                }\n                ...\n            ]\n        }\n    }\n}\n--------------------------------------------------\n\n\n## License\n\nLicensed under the Apache License 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbok%2Felasticsearch-linear-regression","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbok%2Felasticsearch-linear-regression","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbok%2Felasticsearch-linear-regression/lists"}