{"id":34057262,"url":"https://github.com/bschulz81/robustregression","last_synced_at":"2026-04-06T02:02:26.262Z","repository":{"id":191588962,"uuid":"684969424","full_name":"bschulz81/robustregression","owner":"bschulz81","description":"a c++ library with statistical machine learning algorithms for linear and non-linear robust regression that can be used with python.","archived":false,"fork":false,"pushed_at":"2024-01-10T21:35:17.000Z","size":239,"stargazers_count":6,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-15T22:59:14.939Z","etag":null,"topics":["cmake","cpp","forward-search","gaussian-elimination","least-squares","levenberg-marquardt","linear-algebra","linear-regression","machine-learning","nonlinear-regression","open-mp","outlier-detection","outlier-removal","python3","ransac","regression","robust-regression","robust-statistics","s-estimator","statistics"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bschulz81.png","metadata":{"files":{"readme":"README.md","changelog":"Changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-08-30T08:24:09.000Z","updated_at":"2025-06-30T21:55:17.000Z","dependencies_parsed_at":"2024-01-02T05:22:46.949Z","dependency_job_id":"b8736e40-32d2-4674-8c70-3cf154dcce88","html_url":"https://github.com/bschulz81/robustregression","commit_stats":null,"previous_names":["bschulz81/robustregression"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/bschulz81/robustregression","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bschulz81%2Frobustregression","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bschulz81%2Frobustregression/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bschulz81%2Frobustregression/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bschulz81%2Frobustregression/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bschulz81","download_url":"https://codeload.github.com/bschulz81/robustregression/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bschulz81%2Frobustregression/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31456664,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T21:22:52.476Z","status":"online","status_checked_at":"2026-04-06T02:00:07.287Z","response_time":112,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cmake","cpp","forward-search","gaussian-elimination","least-squares","levenberg-marquardt","linear-algebra","linear-regression","machine-learning","nonlinear-regression","open-mp","outlier-detection","outlier-removal","python3","ransac","regression","robust-regression","robust-statistics","s-estimator","statistics"],"created_at":"2025-12-14T03:24:12.235Z","updated_at":"2026-04-06T02:02:26.256Z","avatar_url":"https://github.com/bschulz81.png","language":"C++","readme":"# RobustregressionLib\nThis is a c++ library with statistical machine learning algorithms for linear and non-linear robust regression.\n\nIt implements the statistical algorithms that were originally developed by the author for an autofocus application for telescopes\n\nand published in \tarXiv:2201.12466 [astro-ph.IM], https://doi.org/10.1093/mnras/stac189\n\nIn addition to these, two other robust algorithms were added and the curve fitting library has been brought into a form of a\nclear and simply API that can be easily used for very broad and general applications.\n\nThe library offers Python bindings for most functions. So the programmer has the choice between c++ and Python. In order to \ncompile the library with Python bindings Pybind11 and Python3 should be installed and be found by CMake. \nOtherwise, only the C++ standard template library is used, together with OpenMP. \n\nThe documentation of the functions that the library uses are written in the C++header files and in the __doc__ methods of the Python bindings.\n\nIn addition, a c++ application and a Python script is provided that show the functions of the library with very simple data.\n\nThe Library is released under MIT License.\n\nApart from his own publication, the author has not found the main robust curve fitting algorithms from this library in the statistical literature.\n\nOne of the algorithms presented here is a modification of the forward search algorithms by  Hadi and Simonoff, Atkinson and Riani and the least trimmed squares\nmethod of Rousseeuw. The modification of the author is to use various estimators to include data after the algorithm tried a random initial combination.\n\nThe algorithm was originally developed for physical problems, where one has outliers but also data, which is often subject to random fluctuations, like astronomical seeing.\nAs we observed during trials with the astronomy application, including the S estimator in the forward search removes large outliers but allows for small random fluctuations \nin the data, which resulted in more natural curve fits than if we would simply select the \"best\" model that we would get from the forward search. If some degree of randomness is present,\nthe \"best\" model chosen by such a method would have the smallest error almost certainly by accident and would not include enough points for a precise curve fit.\nThe usage of the statistical estimators in the forward search appeared to prevent this problem.\n\nThe modified least trimmed squares method has also been used by the author in arXiv:2201.12466 with the various estimators to judge the quality of measurement data, which was \ndefined as \"Better\" when the algorithm, if used sucessively with several different estimators, comes to a more similar result. \n\nAnother algorithm presented in this library is an iterative method which also employs various estimators. It has the advantage that it should work with larger datasets but its statistical \nproperties have not been extensively tested yet.\n\nBecause of the use of various statistical estimators and methods, the library builds on previous articles from the statistical literature. \nSome references are:\n\n1. Smiley W. Cheng, James C. Fu, Statistics \u0026 Probability Letters 1 (1983), 223-227, for the t distribution\n2. B. Peirce,  Astronomical Journal II 45 (1852) for the peirce criterion\n3. Peter J. Rousseeuw, Christophe Croux, J. of the Amer. Statistical Assoc. (Theory and Methods), 88 (1993), p. 1273, for the S, Q, and T estimator\n5. T. C. Beers,K. Flynn and K. Gebhardt,  Astron. J. 100 (1),32 (1990), for the Biweight Midvariance\n6. Transtrum, Mark K, Sethna, James P (2012). \"Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization\". arXiv:1201.5885, for the Levenberg Marquardt Algorithm,\n7. Rousseeuw, P. J. (1984).Journal of the American Statistical Association. 79 (388): 871–880. doi:10.1080/01621459.1984.10477105. JSTOR 2288718.\n   Rousseeuw, P. J.; Leroy, A. M. (2005) [1987]. Robust Regression and Outlier Detection. Wiley. doi:10.1002/0471725382. ISBN 978-0-471-85233-9, for the least trimmed squares algorithm\n8. Hadi and Simonoff, J. Amer. Statist. Assoc. 88 (1993) 1264-1272, Atkinson and Riani,Robust Diagnostic Regression Analysis (2000), Springer, for the forward search\n9. Croux, C., Rousseeuw, P.J. (1992). Time-Efficient Algorithms for Two Highly Robust Estimators of Scale. In: Dodge, Y., Whittaker, J. (eds) Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-662-26811-7_58 \n (For the faster version of the S and Q estimators.) The versions of the S and Q estimators in this library are now adapted from Croux and Rousseeuw to the C language. Note that it is not the same Code because of some optimizations. Since many variables act on array indices in this algorithm, it was actually non-trivial to convert from Fortran to C. Especially for the Q estimator, the naive algorithm is faster for less than 100 datapoints. For the S estimator this is the case for less than 10 datapoints. Therefore, in these cases the naive versions are still used.\n10. Andrew F. Siegel. Robust regression using repeated medians. Bionaetrika, 69(1):242–244, 1982,Andrew Stein and Michael Werman. 1992. Finding the repeated median regression line. In Proceedings of the third annual ACM-SIAM symposium on Discrete algorithms (SODA '92). Society for Industrial and Applied Mathematics, USA, 409–413. https://dl.acm.org/doi/10.5555/139404.139485\n\n# Compiling and Installing the library:\n\nThe Library needs CMake and a C compiler that is at least able to generate code according to the C14 standard\n(per default, if one does not use Clang or MacOs, it uses C17, but with a CXX_STANDARD 14 switch set in the \nCMakeLists.txt for the library, it can use C14, which is the the default for Clang and MacOS.) \n\nThe library also makes use of OpenMP and needs Python in version 3 and pybind11 to compile. \n\nBy default, the library also containts two test applications. \n\nIf the variable $WITH_TESTAPP is set to ON, a c++ test application is compiled and put in an /output directory. \n\nThe library also shipps with a Python module. By default, the CMake variable $With_Python is set to ON and a Python module will\nbe generated in addition to a c++ library.\n\nIf $WITH_TESTAPP and $WITH_PYTHON are set, which is the default, then a Python test application will be generated in addition to\nthe C++ test application.\n\n## Installing with CMake\nOne can compile the library also traditionally via CMake. Typing \n\n\u003e cmake . \n\nin the package directory will generate the files necessary to compile the library, depending on the CMake generator set by the user.\n\nUnder Linux, the command\n\n\u003e make \n\nwill then compile the library.\n\nAfter compilation, an /output directory will appear with the library in binary form. \nBy default, the library also containts two test applications. \n\nIf the variable $WITH_TESTAPP is set, a c++ test application is compiled and put in an /output directory. \n\nThe library also ships with a Python module. By default, the CMake variable $WITH_PYTHON is ON and a Python module will\nbe generated in addition to a c++ library and copied into the /output directory.\n\nIf $WITH_TESTAPP and $WITH_PYTHON are set to ON, which is the default, then a Python test application will be generated and compiled into the /output directory.\n\nBy compiling with CMake, the Python module is just compiled into the /output directory. It is not installed in a path for system libraries\nor python packages. So if one wants to use the Python module, one has either a) to write the script in the same folder where the module is, or b) load\nit by pointing Python to the explicit path of the module, or c) copy the module to a place where Python can find it.\n\n\nIf one does not want the Python module to be compiled, one may set the cmake variable $WITH_PYTHON to OFF.\n\n## Installing with PIP (This option is mostly for Windows since Linux distributions have their own package managers)\n\nIf one wants that the module is installed into a library path, where Python scripts may be able to find it easily, one can compile and install the module also \nwith pip instead of CMake by typing\n\n\u003e pip install .\n\nin the package directory. \n\nAfter that, the module is compiled and the binaries are copied into a path for Python site-packages, where Python scripts should be able to find it.\n\nNote that the binary code which is installed by pip is compiled with python 3.12. If your default python interpreter has a different version, the module will not load, unless you use a python 3.12 interpreter.\nIf you do not have a python 3.12 interpreter, pip can attempt to compile the module if a c++ compiler, cmake and pybind11 is found on the system.\n\nThe preferred way, especially under linux, is to compile the library with CMake.\n\nIf the module was compiled by pip, one can uninstall it by typing\n\n\u003e pip uninstall pyRobustRegressionLib\n\nUnder Linux, compiling with cmake should be preferred. Not at least because linux package managers (e.g. portage from gentoo) sometimes have conflicts with pip.\n\nAdditionally, the python environment will select ninja as a default generator, which will require to clean the build files\nif an earlier generation based on cmake was done that may have used a different generator.\n\n\n# Documentation of the library functions\n\n## For the Python language\n\n### Calling the documentation\nDocumentation of the API is provided in C++ header files in the /library/include directory and the docstrings for the Python module in the src/pyRobustRegressionLib\nModule. The latter It can be loaded in Python scripts with \n\n\u003e import pyRobustRegressionLib as rrl\n\nThe command \n\n\u003e print(rrl.\\_\\_doc__)\n\nWill list the sub-modules of the library, which are \n\n- StatisticFunctions, \n- LinearRegression, \n- MatrixCode, \n- NonLinearRegression and \n- RobustRegression\n- LossFunctions\n\nAnd their docstrings can be called e.g. by\n\u003e print(rrl.*SubModuleName*.\\_\\_doc__)\n\ne.g.\n\n\u003e print(rrl.StatisticFunctions.\\_\\_doc__).\n\nWill list the functions and classes of the sub-module StatisticFunctions. The free functions and classes all have more detailed doc\nstrings that can be called as below for example\n\n\u003e print (rrl.MatrixCode.Identity.\\_\\_doc__)\n\nMore convenient documentation is provided in the header files of the C++ source code of the package,\nwhich can be found in the /library/include directory.\n\nThe header files can be found in the include subdirectory of the package.\n\nIn the testapp folder, two example programs, one in Python and one in C++ is provided.\nThese test applications have extensive comments and call many functions of the librarym which show the basic usage. \n\nThe curve fits that are done in the provided example programs are, however, very simple of course.\nThis was done in order to keep the demonstration short and simple.\nThe library is of course intended to be used with larger and much more complicated data.\n\n### Simple linear regression\n\nLet us now define some vector for data X and \u003e which we want to fit to a line.\n\n\u003e print(\"\\nDefine some arrays X and Y\")\n\u003e \n\u003e X=[-3.0,5.0,7.0,10.0,13.0,16.0,20.0,22.0]\n\u003e \n\u003e Y=[-210.0,430.0,590.0,830.0,1070.0,1310.0,1630.0,1790.0]\n\nA simple linear fit can be called as follows:\nAt first we create an instance of the result structure, where the result is stored.\n\n\u003e res=rrl.LinearRegression.result()\n\nThen, we call the linear regression function\n\u003e rrl.LinearRegression.linear_regression(X, Y, res)\n\n\n\nAnd finally, we print out the slope and intercept\n\u003e print(\"Slope\")\n\u003e \n\u003e print(res.main_slope)\n\u003e \n\u003e print(\"Intercept\")\n\u003e \n\u003e print(res.main_intercept)\n\n### Robust regression\nThe robust regression is just slightly more complicated. Let us first add two outliers into the data:\n\n\u003e X2=[-3.0, 5.0,7.0, 10.0,13.0,15.0,16.0,20.0,22.0,25.0]\n\u003e \n\u003e Y2=[ -210.0, 430.0, 590.0,830.0,1070.0,20.0,1310.0,1630.0,1790.0,-3.0]\n\n\n#### Siegel's repeated Median regression\nFor linear regression, the library also has the median linear regression from Siegel, which can be called in the same way\n\n\u003e rrl.LinearRegression.median_linear_regression(X2, Y2, res)\n\nbut is slightly more robust.\n\n\n#### Modified forward search/modified Lts regression\nMedian linear regression is a bit slower as simple linear regression and can get wrong if many outliers are present.\n\nTherefore, the library has two methods for robust regression that can remove outliers.\n\nThe  structure that stores the result for robust linear regression now includes the indices of the used and rejected point.\n\nWe instantiate it with\n\n\u003e res= rrl.RobustRegression.linear_algorithm_result()\n\nAdditionally, there is a struct that determines the behavior of the algorithm.  \nUpon construction without arguments, it gets filled with default values.\n\n\u003e ctrl= rrl.RobustRegression.modified_lts_control_linear()\n\nBy default, the S-estimator is used with an outlier_tolerance parameter of 3, and the method can find 30% of the points as outliers at maximum. \nBut all this can be changed, of course\n\nNow we call the modified forward search/ modified lts algorithm\n\u003e rrl.RobustRegression.modified_lts_regression_linear(X2, Y2, ctrl, res)\n\nand then print the result, first the slope and intercept \n\u003e print(\"Slope\")\n\u003e \n\u003e print(res.main_slope)\n\u003e \n\u003e print(\"Intercept\")\n\u003e \n\u003e print(res.main_intercept)\n\nThen the indices of the outliers\n\n\u003e print(\"\\nOutlier indices\")\n\u003e \n\u003e for ind in res.indices_of_removedpoints:\n\u003e \n\u003e \u0026emsp;print(ind)\n\nWhen we want to change the estimators, or the outlier tolerance parameter, the loss function, or the maximum number of outliers we can find, or other details, we can simply\nset this in the control struct.\n\nBy default, the S estimator is used with an outlier_tolerance of 3 in the same formula i.e. one has\n\n\u003e ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_decision_in_S_ESTIMATION\n\nand a point is an outlier if \n\u003e |err-median(errs)|/S_estimator(errs)\u003eoutlier_tolerance\n\nwhere err is the residuum of the point and errs is the array of residuals. \nThey are measured by a specified loss function. If none was given, squared errors are used by default. \n\nBy default, the outlier_tolerance parameter is set to 3.\n\nIf we want to have a different value, e.g. 3.5, for the outlier_tolerance parameter, we can easily set e.g.\n\n\u003e ctrl.outlier_tolerance=3.5\n\nThe command below would imply that the Q estimator is used:\n\n\u003e ctrl= rrl.RobustRegression.modified_lts_control_linear()\n\u003e \n\u003e ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_decision_in_Q_ESTIMATION\n\nThen a point is an outlier if, for its residual with the curve fit err, we have,given the array of all residuals errs:\n\u003e |err-median(errs)|/Q_estimator(errs)\u003eoutlier_tolerance\n\n\nThe command below would change the estimator to the interquartile range method when the modified lts/modified forward search algorithm is used. \n\nWith the setting below, a point is an outlier if its residual is below Q1 − tolerance* IQR or above Q3 + tolerance IQR.\nIf we do not change the loss function, least squares is used by default.\n\n\u003e ctrl= rrl.RobustRegression.modified_lts_control_linear()\n\u003e \n\u003e ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_interquartile_range\n\n For the interquartile range estimator, we should set the tolerance usually to 1.5\n\n\u003e ctrl.outlier_tolerance=1.5\n\nbefore we call the regression function.\n\nSimilarly, the loss function can be changed. For example, the absolute value of the residuals is usually more statistically robust than the square of the residuals\n\n\u003e ctrl.lossfunction=rrl.LossFunctions.absolutevalue\n\nchanges the lossfunction to the absolute value. \n\nOne can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the\nfunction starts its linear behavior.\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\nFinally, one can define a custom loss function with a callback mechanism.\n\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\nis then only used for outlier removal.\n\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \nas for the outlier removal procedures.\n\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\nremoval, one has to use the non-linear algorithm with a linear call back function. \n\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \nbecause of its asymmetry.\n\n\n\n\nNote that the forward search can be very time consuming, as its performance is given by the binomial koefficient of the pointnumber over the maximum number of \noutliers it can find (which per default is 30% of the pointnumber).\n\n#### Iterative outlier removal\n\nA faster algorithm is the iterative outlier removal method, which makes a curve fit with the entire points, then removes the points whose residuals are outliers and\nthen makes another curve fit with the remaining point set until no outliers are found anymore.\n\nIt can be called similarly:\n\nWe define the control structure. Note that it is different from the modified forward search/modified lts control structure. \n\u003e ctrl= rrl.RobustRegression.linear_algorithm_control()\n\nAnd again create a structure for the result\n\u003e res= rrl.RobustRegression.linear_algorithm_result()\nThen we start the algorithm\n\u003e rrl.RobustRegression.iterative_outlier_removal_regression_linear(X2, Y2, ctrl, res)\n\nAnd print the result\n\u003e print(\"Slope\")\n\u003e \n\u003e print(res.main_slope)\n\u003e \n\u003e print(\"Intercept\")\n\u003e \n\u003e print(res.main_intercept)\n\u003e \n\u003e print(\"\\nOutlier indices\")\n\u003e \n\u003e for ind in res.indices_of_removedpoints:\n\u003e \n\u003e \u0026emsp;print(ind)\n\n\n\n### Non Linear Regression\nNon-linear regression uses an implementation of the Levenberg-Marquardt algorithm\n\nThe Levenberg-Marquardt algorithm needs an initial guess for an array of parameters beta , a function f(X,beta) to be fitted and a Jacobi matrix J(X,beta)\nFor example, a curve fit can be made by  initialising the result, control and initdata structures as follows:\nAfter a call of the constructors:\n\n\u003e res=rrl.NonLinearRegression.result()\n\u003e \n\u003e ctrl=rrl.NonLinearRegression.control()\n\u003e \n\u003e init=rrl.NonLinearRegression.initdata()\n\nWe  supply a Jacobi matrix, a function and an initial guess for the parameters to be found (assuming the function has just 2 curve fitting parameters):\n\n\u003e init.Jacobian=Jacobi\n\u003e\n\u003e init.f=linear\n\u003e \n\u003e init.initialguess = [1,1]\n\nWhere  Jacobi and linear are two user defined functions. \n\nIf we would want to fit a line, we would have to implement the function f(X,beta) and the jacobian J(X,beta) as follows\n\n\u003edef linear(X, beta):\n\u003e\n\u003e \u0026emsp;   Y=[]\n\u003e\n\u003e \u0026emsp;   for i in range(0,len(X)):\n\u003e\n\u003e \u0026emsp;   \u0026emsp;    Y.append(beta[0] * X[i] + beta[1])\n\u003e\n\u003e \u0026emsp;    return Y\n\n\n\u003e def Jacobi(X, beta):\n\u003e \n\u003e\u0026emsp;\tm=rrl.MatrixCode.Matrix (len(X), len(beta))\n\u003e \n\u003e\u0026emsp;\tfor i in range(0,len(X)):\n\u003e \n\u003e\u0026emsp;\t\u0026emsp;\tm[i, 0] = X[i]\n\u003e \n\u003e\u0026emsp;\u0026emsp;\tm[i, 1] = 1\n\u003e \n\u003e\u0026emsp;\treturn m\n\nThen we can call the Levenberg-Marquardt algorithm\n\n\u003e rrl.NonLinearRegression.non_linear_regression(X2, Y2, init, ctrl, res)\n\nand then print the result:\n\u003e print(\"Slope\")\n\u003e \n\u003e print(res.beta[0])\n\u003e \n\u003e print(\"Intercept\")\n\u003e \n\u003e print(res.beta[1])\n\nThe class NonLinearRegression.control has various parameters that control the behavior of the Levenberg-Marquardt algorithm, among them are various\nconditions when the algorithm should stop (e.g. after some time, or after there is no improvement or after the error is below a certain margin). \nThey may be set as desired.\n\nAdditionally, there are parameters that control the step size of the algorithm. These parameters have the same names as described at\nhttps://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm and if not otherwise specified,\ndefaults are used for them usually work.\n\n\n### Non-linear robust regression\nFor non-linear curve fits the library also has the modified forward search/modified lts algorithm and the iterative outlier removal as for linear regression.\nFor a modified forward search/lts algorithm with default parameters (S estimator, outlier_tolerance=3, loss function as least squares, 30% of the points are outliers at maximum), a call looks as follows:\n\nFirst the initialisation :\n\u003e res=rrl.RobustRegression.nonlinear_algorithm_result()\n\u003e \n\u003e ctrl=rrl.RobustRegression.modified_lts_control_nonlinear()\n\u003e \n\u003e init=rrl.NonLinearRegression.initdata()\n\u003e \n\u003e init.Jacobian=Jacobi\n\u003e \n\u003e init.f=linear\n\u003e \n\u003e init.initialguess = [1,1]\n\nThen the function call:\n\u003e rrl.RobustRegression.modified_lts_regression_nonlinear(X2, Y2, init, ctrl, res)\n\nFinally, we print the result:\n\u003e print(\"Slope\")\n\u003e \n\u003e print(res.beta[0])\n\u003e \n\u003e print(\"Intercept\")\n\u003e \n\u003e print(res.beta[1])\n\u003e \n\u003e print(\"\\nOutlier indices\")\n\u003e \n\u003e for ind in res.indices_of_removedpoints:\n\u003e \n\u003e \u0026emsp;  print(ind)\n\n\n\n\nFor the interative outlier removal algorithm, a call to the regression function with default parameters (S estimator, outlier_tolerance=3, loss function as least squares, 30% of the points are outliers at maximum) would look as follows:\n\n\u003e res=rrl.RobustRegression.nonlinear_algorithm_result()\n\u003e \n\u003e ctrl=rrl.RobustRegression.nonlinear_algorithm_control()\n\u003e \n\u003e init=rrl.NonLinearRegression.initdata()\n\u003e \n\u003e init.Jacobian=Jacobi\n\u003e \n\u003e init.f=linear\n\u003e \n\u003e init.initialguess = [1,1]\n\n\u003e rrl.RobustRegression.iterative_outlier_removal_regression_nonlinear(X2, Y2, init, ctrl, res)\n\n### Custom error functions\nBy default, the library uses the sum of the squared residuals divided by the pointnumber as a loss function.\nOne can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the loss function starts its linear behavior.\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\n\nFinally, one can define a custom loss function with a callback mechanism.\n\nWe may define user defined loss functions. This is done in two steps. A function\n\n\u003e def err_pp(Y,fY,pointnumber):\n\u003e \u0026emsp;   return (Y-fY)*(Y-fY)/pointnumber\n\nComputes a residual between the data and the curve fit for a single point.\nAnother function\n\n\u003e def aggregate_err(errs):\n\u003e \u0026emsp;    res=0 \n\u003e \u0026emsp;   for i in range(0,len(errs)):\n\u003e \u0026emsp;  \u0026emsp;     res+=errs[i]\n\u003e \u0026emsp;   return res\n\ncomputes an entire error from a list of residuals generated by the function err_pp.\nNote that if the data is such that it does not correspond perfectly to the curve, this should at best be some kind of average error instead of a simple sum.\nSince otherwise, removing a point will always reduce the error. Since the robust methods delete points based on the aggregate error, this would usually lead to\ncurve fits which do not have enough points taken into consideration. The division by the pointnumber can be done in err_pp (as in this example) or in aggregate_err.\n\nThe following call will then make a robust curve fit with the custom error function\n\n\u003e res9=rrl.RobustRegression.nonlinear_algorithm_result() \n\u003e ctrl9=rrl.RobustRegression.modified_lts_control_nonlinear()\n\u003e ctrl9.lossfunction=rrl.LossFunctions.custom\n\u003e ctrl9.loss_perpoint=err_pp\n\u003e ctrl9.aggregate_error=aggregate_err\n\n\u003e init9=rrl.NonLinearRegression.initdata() \n\u003e init9.Jacobian=Jacobi\n\u003e init9.f=linear\n\u003e init9.initialguess = [1,1]\n\u003e rrl.RobustRegression.modified_lts_regression_nonlinear(X2, Y2, init9, ctrl9, res9)\n\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\nis then only used for outlier removal.\n\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \nas for the outlier removal procedures.\n\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\nremoval, one has to use the non-linear algorithm with a linear call back function. \n\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \nbecause of its asymmetry.\n\n\n\n\n## For the C++ language:\n\nIn general, one has to include the library headers as follows:\n\n\u003e #include \"statisticfunctions.h\"\n\u003e\n\u003e #include \"Matrixcode.h\"\n\u003e \n\u003e #include \"linearregression.h\"\n\u003e \n\u003e #include \"robustregression.h\"\n\u003e \n\u003e #include \"nonlinearregression.h\"\n\u003e\n\u003e #include \"lossfunctions.h\" \n\u003e\n\n### Simple Linear Regression\nThe usage of the library in C++ is essentially similar as in Python. the testapplication.cpp demonstrates the same function calls.\nThe the X and Y data are stored in C++ valarrays. The control, result and initdata are not classes, but structs.\n\nFor example, if we define some X,Y data:\n\n\u003e valarray\u003cdouble\u003e X = { -3, 5,7, 10,13,16,20,22 };\n\u003e \n\u003e valarray\u003cdouble\u003e Y = { -210, 430, 590,830,1070,1310,1630,1790 };\n\nand initialize the struct where we store the result,\n\n\u003e Linear_Regression::result res;\n\nwe can call a linear regression as follows:\n\n\u003e Linear_Regression::linear_regression(X, Y, res);\n\nand we may print the result:\n\n\u003e printf(\" Slope \");\n\u003e \n\u003e printf(\"%f\", res.main_slope);\n\u003e \n\u003e printf(\"\\n Intercept \");\n\u003e \n\u003e printf(\"%f\", res.main_intercept);\n\n### Robust Regression methods\n\nLet us first define X and Y data with two outliers added.\n\u003e\tvalarray\u003cdouble\u003e X2 = { -3, 5,7, 10,13,15,16,20,22,25 };\n\u003e\n\u003e\tvalarray\u003cdouble\u003e Y2 = { -210, 430, 590,830,1070,20,1310,1630,1790,-3 };\n\n#### Median Linear Regression: \nMedian regression is slower but more robust as simple linear regression.\nThis command calls a robust curve fit with median regression\n\n\u003e Linear_Regression::result res;\n\u003e \n\u003e Linear_Regression::median_linear_regression(X2, Y2, res);\n\nand then we print the result\n\n\u003e printf(\" Slope \");\n\u003e \n\u003e printf(\"%f\", res.main_slope);\n\u003e \n\u003e printf(\"\\n Intercept \");\n\u003e \n\u003e printf(\"%f\", res.main_intercept);\n\n#### Modified forward search\nWhen many and large outliers are present, median regression does sometimes not deliver the desired results.\n\nThe library therefore has the modified forward search/modified lts algorithm and the iterative outlier removal algorithm that can\nfind outliers and remove them from the curve fit entirely.\n\nBelow is a call for a robust modified forward search/modified lts algorithm initialized with default\nparameters:\n\nFirst the structs for controlling the algorithm and storing the results are initialised,\n\n\u003e Robust_Regression::modified_lts_control_linear ctrl;\n\u003e \n\u003e Robust_Regression::linear_algorithm_result res;\n\nThen, we call the functions for the curve fit:\n\u003e\tRobust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);\n\nThen we print the result:\n\n\u003e\tprintf(\" Slope \");\n\u003e\n\u003e\tprintf(\"%f\", res.main_slope);\n\u003e\n\u003e\tprintf(\"\\n Intercept \");\n\u003e\n\u003e\tprintf(\"%f\", res.main_intercept);\n\u003e\n\u003e\tprintf(\"\\n Indices of outliers \");\n\u003e\n\u003e\tfor (size_t i = 0; i \u003c res.indices_of_removedpoints.size(); i++)\n\u003e\t{\n\u003e\t\u0026emsp; \tsize_t w = res.indices_of_removedpoints[i];\n\u003e\n\u003e\t\u0026emsp;\tprintf(\"%lu\", (unsigned long)w);\n\u003e\n\u003e\t\u0026emsp;\tprintf(\", \");\n\u003e\n\u003e\t}\n\n\nThe default parameters are as follows:\n\nThe S estimator is used, outlier_tolerance=3,  30% of the pointnumber are outliers at maximum, loss function is given by least squares of the residuals.\n\nAs in the Python documentation, a point with residual err is then an outlier if \n\n\u003e |err-median(errs)/S_estimator\u003e3\n\nwhere errs is the array of residuals.\n\n\nIf the Q-estimator is used instead, the initialisation for the modified forward search/lts algorithm looks like\n\n\u003e\tRobust_Regression::modified_lts_control_linear ctrl;\n\u003e\n\u003e\tRobust_Regression::linear_algorithm_result res;\n\u003e\n\u003e\tctrl.rejection_method = Robust_Regression::tolerance_is_decision_in_Q_ESTIMATION;\n\nThen we call the regression function:\n\u003e\tRobust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);\n\nIf the interquartile range estimator should be used, so that a point is removed if it is below Q1 − outlier_tolerance* IQR or above Q3 + outlier_tolerance IQR, \nwe would have to set:\n\n\u003e ctrl.rejection_method = Robust_Regression::tolerance_is_interquartile_range;\n\nFor the interquartile range estimator, outlier_tolerance should be set to 1.5, so we additionally have to write:\n\u003e\tctrl.outlier_tolerance = 1.5;\n\nbefore we call the regression function:\n\u003e\tRobust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);\n\nSimilarly, some may prefer to set the outlier_tolerance parameter to 3.5 when the S,Q, or MAD or other estimators are used.\n\nThe loss function can also be changed. For example, the absolute value |err| of the residuals is usually more robust than the square err^2.\nThe command\n\n\u003e ctrl.lossfunction = LossFunctions::absolutevalue;\n\nchanges the lossfunction to the absolute value. \n\nOne can also specify Huber's loss function, but then one also has to supply a border parameter  beyond which the function starts its linear behavior.\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\nFinally, one can define a custom loss function with a callback mechanism.\n\n\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\nis then only used for outlier removal.\n\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \nas for the outlier removal procedures.\n\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\nremoval, one has to use the non-linear algorithm with a linear call back function. \n\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \nbecause of its asymmetry.\n\n#### Iterative outlier removal\nThe modified forward search/modified lts algorithm can be slow since its complexity is given by the binomial coefficient of the pointnumber over the maximum\nnumber of outliers to be found. The iterative outlier removal algorithm is faster. It makes a curve fit of all points, then removes the outliers based on the \nloss function and estimator and makes another curve fit and repeats the procedure until no outliers are found.\n\nA call to this method with the default parameters S estimator, outlier_tolerance=3, the loss function as least_squares and at most 30% of the points\ndesignated as outliers, would look as follows:\n\n\u003e\tRobust_Regression::linear_algorithm_result res;\n\u003e\n\u003e\tRobust_Regression::linear_algorithm_control ctrl;\n\u003e\n\u003e\tRobust_Regression::iterative_outlier_removal_regression_linear(X2, Y2, ctrl, res);\n\n### Non-linear Regression\nThe non-linear curve fitting algorithm implements a Levenberg-Marquardt algorithm.\n\nFor it to work, we must first supply initialisation data in form of a valarray of initial parameters beta,\na function f(X,beta) to be found and a Jacobian J(X,beta).\n\nif we wanted to fit a line, we would have do define the function f(X,beta) to be fit\n\n\u003e valarray\u003cdouble\u003e linear(const valarray\u003cdouble\u003e\u0026X,const  valarray\u003cdouble\u003e\u0026 beta)\n\u003e {\n\u003e \n\u003e\t\u0026emsp;valarray\u003cdouble\u003e Y(X.size());\n\u003e \n\u003e\t\u0026emsp;for (size_t i = 0; i \u003c X.size(); i++)\n\u003e \n\u003e\t\u0026emsp;\u0026emsp;\tY[i] = beta[0] * X[i] + beta[1];\n\u003e \n\u003e\t\u0026emsp;return Y;\n\u003e \n\u003e }\n\u003e\n\nand its Jacobian Matrix J(X,beta)\n\u003e Matrix Jacobi(const valarray\u003cdouble\u003e\u0026 X, const  valarray\u003cdouble\u003e\u0026 beta)\n\u003e {\n\u003e \n\u003e\t\u0026emsp;Matrix ret(X.size(), beta.size());\n\u003e \n\u003e\t\u0026emsp;for (size_t i = 0; i \u003c X.size(); i++)\n\u003e \n\u003e\t\u0026emsp;{\n\u003e \n\u003e\t\u0026emsp;\u0026emsp;\tret(i, 0) = X[i];\n\u003e \n\u003e\t\u0026emsp;\u0026emsp;\tret(i, 1) = 1;\n\u003e \n\u003e\t\u0026emsp;}\n\u003e \n\u003e\t\u0026emsp;return ret;\n\u003e \n\u003e }\n\nA non-linear fit can the be initialised with default control parameters for the Levenberg-Marquardt algorithm like this:\n\n\n\u003e\tNon_Linear_Regression::result res;\n\u003e\n\u003e\tNon_Linear_Regression::control ctrl;\n\u003e\n\u003e\tNon_Linear_Regression::initdata init;\n\u003e\n\u003e\tinit.f = linear;\n\u003e\n\u003e\tinit.J = Jacobi;\n\u003e\n\u003e\tvalarray\u003cdouble\u003ebeta = { 1,1 };\n\u003e\n\u003e\tinit.initialguess = beta;\n\nThen one can call the function:\n\u003e\tNon_Linear_Regression::non_linear_regression(X, Y, init, ctrl, res);\n\nThen we may print the result:\n\n\u003e\tprintf(\"\\n Slope \");\n\u003e\n\u003e\tprintf(\"%f\", res.beta[0]);\n\u003e\n\u003e\tprintf(\"\\n intercept \");\n\u003e\n\u003e\tprintf(\"%f\", res.beta[1]);\n\nThe struct Non_Linear_Regression::control has various control parameters that control the behavior of the Levenberg-Marquardt algorithm, among them are various\nconditions when the algorithm should stop (e.g. after some time, or after there is no improvement or after the error is below a certain margin).\nThey may be set as desired.\n\nAdditionally, there are parameters that control the step size of the algorithm. These parameters have the same names as described at\nhttps://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm and if not otherwise specified,\ndefaults are used for them  usually work.\n\n\n### Non-linear robust curve fits\n\nAs for the linear regression, the library has the same modified forward search/lts and iterative outlier removal algorithms for the non-linear case\n\nFor a modified forward search/lts algorithm, a call looks as follows:\n\nFirst the initialisation, here with default parameters for the algorithm control:\n\n\u003e Robust_Regression::modified_lts_control_nonlinear ctrl;\n\u003e \n\u003e Robust_Regression::nonlinear_algorithm_result res;\n\u003e \n\u003e Non_Linear_Regression::initdata init;\n\u003e \n\u003e init.f = linear;\n\u003e \n\u003e init.J = Jacobi;\n\u003e \n\u003e valarray\u003cdouble\u003ebeta = { 1,1 };\n\u003e \n\u003e init.initialguess = beta;\n\nThen the call:\n\n\u003e Robust_Regression::modified_lts_regression_nonlinear(X2, Y2, init, ctrl, res);\n\nThen we print the result:\n\u003e \tprintf(\" Slope \");\n\u003e \n\u003e\tprintf(\"%f\", res.beta[0]);\n\u003e \n\u003e\tprintf(\"\\n Intercept \");\n\u003e\n\u003e\tprintf(\"%f\", res.beta[1]);\n\u003e \n\u003e\tprintf(\"\\n Indices of outliers \");\n\u003e\n\u003e\tfor (size_t i = 0; i \u003c res.indices_of_removedpoints.size(); i++){\n\u003e\t\u0026emsp;\tsize_t w = res.indices_of_removedpoints[i];\n\u003e  \t\u0026emsp;\tprintf(\"%lu\", (unsigned long)w);\n\u003e   \u0026emsp;  printf(\", \");\n\u003e\t}\n\n\nFor the interative outlier removal algorithm, the call to the regression function would look as follows:\nFirst the initialisation, here again with default parameters for the algorithm control:\n\n\u003e Non_Linear_Regression::initdata init;\n\u003e \n\u003e Robust_Regression::nonlinear_algorithm_control ctrl;\n\u003e \n\u003e init.f = linear;\n\u003e \n\u003e init.J = Jacobi;\n\u003e \n\u003e valarray\u003cdouble\u003ebeta = { 1,1 };\n\u003e \n\u003e init.initialguess = beta;\n\nThen the call,\n\n\u003e Robust_Regression::iterative_outlier_removal_regression_nonlinear(X2, Y2, init, ctrl, res);\n\nThe printing of the result would work similar as above.\n\n### Custom error functions\nBy default, the library uses the sum of the squared residuals divided by the pointnumber as a loss function.\nOne can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the loss function becomes linear.\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\n\nFinally, one can define a custom loss function with a callback mechanism.\n\nWe may define user defined loss functions. This is done in two steps. A function\n\n\u003e double err_pp(const double Y, double fY, const size_t pointnumber) {\n\u003e \u0026emsp;\treturn ((Y - fY)* (Y - fY)) /(double) pointnumber;\n\u003e }\n\nComputes a residual between the data and the curve fit for a single point.\n\nAnother function\n\n\u003e double aggregate_err(valarray\u003cdouble\u003e\u0026 err){\n\u003e \u0026emsp;\treturn err.sum();\n\u003e }\n\ncomputes an entire error from a list of residuals generated by the function err_pp.\nNote that if the data is such that it does not correspond perfectly to the curve, this should at best be some kind of average error instead of a simple sum.\nSince otherwise, removing a point will always reduce the error. Since the robust methods delete points based on the aggregate error, this would usually lead to\ncurve fits which do not have enough points taken into consideration. The division by the pointnumber can be done in err_pp (as in this example) or in aggregate_err.\n\nThe following will then make a robust curve fit with the custom error function:\nAt first the usual initialisation\n\u003e Non_Linear_Regression::initdata init13;\n\u003e\tinit13.f = linear;\n\u003e\tinit13.J = Jacobi;\n\u003e\tinit13.initialguess = beta;\n\u003e  Robust_Regression::nonlinear_algorithm_result res13;\n\nThen the set of the custom loss function\n\n\u003e Robust_Regression::modified_lts_control_nonlinear ctrl13;\n\u003e ctrl13.lossfunction = LossFunctions::custom;\n\u003e ctrl13.loss_pp = err_pp;\n\u003e ctrl13.agg_err = aggregate_err;\n\nNote that if the aggregate error would not be defined here, the results of the calls of the loss functions per point would just be summed.\n\nFinally, we can make the function call\t\n\n\u003e Robust_Regression::modified_lts_regression_nonlinear(X2, Y2, init13, ctrl13, res13);\n\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\nis then only used for outlier removal.\n\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \nas for the outlier removal procedures.\n\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\nremoval, one has to use the non-linear algorithm with a linear call back function. \n\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \nbecause of its asymmetry.\n\n# Further documentation\nThe library has an online repository at https://github.com/bschulz81/robustregression where the source code can be accessed. \n\nThe detailed documentation of all the various control parameters of the curve fiting algorithms is in the docstrings of the Python module and in the c++ header file. \n\nAlso, the C++/Python test applications in the folder testapp are documented and show many function calls\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbschulz81%2Frobustregression","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbschulz81%2Frobustregression","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbschulz81%2Frobustregression/lists"}