{"id":15008492,"url":"https://github.com/gyrdym/ml_algo","last_synced_at":"2025-04-06T22:08:33.268Z","repository":{"id":21325725,"uuid":"87011370","full_name":"gyrdym/ml_algo","owner":"gyrdym","description":"Machine learning algorithms in Dart programming language","archived":false,"fork":false,"pushed_at":"2024-09-07T15:34:20.000Z","size":9924,"stargazers_count":193,"open_issues_count":7,"forks_count":32,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-30T20:11:13.815Z","etag":null,"topics":["algorithm","batch-gradient-descent","classifier","dart","dartlang","data-science","hyperparameters","lasso-regression","linear-regression","logistic-regression","machine-learning","machine-learning-algorithms","mini-batch-gradient-descent","regression","sgd","softmax","softmax-algorithm","softmax-classifier","softmax-regression","stochastic-gradient-descent"],"latest_commit_sha":null,"homepage":"https://gyrdym.github.io/ml_algo/","language":"Dart","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gyrdym.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["gyrdym"]}},"created_at":"2017-04-02T19:45:35.000Z","updated_at":"2025-03-19T08:47:47.000Z","dependencies_parsed_at":"2023-01-13T21:24:33.091Z","dependency_job_id":"ff08a4e3-8944-46b7-9f09-b1a85bd6f366","html_url":"https://github.com/gyrdym/ml_algo","commit_stats":{"total_commits":958,"total_committers":3,"mean_commits":319.3333333333333,"dds":0.03340292275574108,"last_synced_commit":"c118e968308b42f218c19d7b1b326f4b2f987792"},"previous_names":[],"tags_count":83,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_algo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_algo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_algo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_algo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gyrdym","download_url":"https://codeload.github.com/gyrdym/ml_algo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247557767,"owners_count":20958047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","batch-gradient-descent","classifier","dart","dartlang","data-science","hyperparameters","lasso-regression","linear-regression","logistic-regression","machine-learning","machine-learning-algorithms","mini-batch-gradient-descent","regression","sgd","softmax","softmax-algorithm","softmax-classifier","softmax-regression","stochastic-gradient-descent"],"created_at":"2024-09-24T19:19:06.300Z","updated_at":"2025-04-06T22:08:33.243Z","avatar_url":"https://github.com/gyrdym.png","language":"Dart","funding_links":["https://github.com/sponsors/gyrdym"],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://github.com/gyrdym/ml_algo/workflows/CI%20pipeline/badge.svg)](https://github.com/gyrdym/ml_algo/actions?query=branch%3Amaster+)\n[![pub package](https://img.shields.io/pub/v/ml_algo.svg)](https://pub.dartlang.org/packages/ml_algo)\n\n# Machine learning algorithms for Dart developers - ml_algo library\n\nThe library is a part of the ecosystem:\n\n- [ml_algo library](https://github.com/gyrdym/ml_algo) - implementation of popular machine learning algorithms \n- [ml_preprocessing library](https://github.com/gyrdym/ml_preprocessing) - a library for data preprocessing\n- [ml_linalg library](https://github.com/gyrdym/ml_linalg) - a library for linear algebra \n- [ml_dataframe library](https://github.com/gyrdym/ml_dataframe)- a library for storing and manipulating data \n\n**Table of contents**\n\n- [What is ml_algo for](#what-is-ml_algo-for)\n- [The library content](#the-library-content)\n- [Examples](#examples)\n    - [Logistic regression](#logistic-regression)\n    - [Linear regression](#linear-regression)\n    - [Decision tree-based classification](#decision-tree-based-classification)\n    - [KDTree-based data retrieval](#kdtree-based-data-retrieval)\n- [Models retraining](#models-retraining)\n- [Notes on gradient-based optimisation algorithms](#a-couple-of-words-about-linear-models-which-use-gradient-optimisation-methods)\n- [Helpful articles on algorithms standing behind the library](#helpful-articles-on-algorithms-standing-behind-the-library)\n- [Contacts](#contacts)\n\n\n\n## What is ml_algo for?\n\nThe main purpose of the library is to give native Dart implementation of machine learning algorithms to those who are \ninterested both in Dart language and data science. This library aims at Dart VM and Flutter. It is also possible to use \nits core features in web applications using web assembly.\n\n## The library content\n\n- #### Model selection\n    - [CrossValidator](https://pub.dev/documentation/ml_algo/latest/ml_algo/CrossValidator-class.html). \n    A factory that creates instances of cross validators. Cross-validation allows researchers to fit different \n    [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) of machine learning algorithms \n    assessing prediction quality on different parts of a dataset. \n\n- #### Classification algorithms\n    - [LogisticRegressor](https://pub.dev/documentation/ml_algo/latest/ml_algo/LogisticRegressor-class.html). \n    A class that performs linear binary classification of data. To use this kind of classifier your data has to be \n    [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).\n    \n        - [LogisticRegressor.SGD](https://pub.dev/documentation/ml_algo/latest/ml_algo/LogisticRegressor/LogisticRegressor.SGD.html). \n    Implementation of the logistic regression algorithm based on stochastic gradient descent with L2 regularisation. \n    To use this kind of classifier your data has to be [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).\n    \n        - [LogisticRegressor.BGD](https://pub.dev/documentation/ml_algo/latest/ml_algo/LogisticRegressor/LogisticRegressor.BGD.html). \n    Implementation of the logistic regression algorithm based on batch gradient descent with L2 regularisation. \n    To use this kind of classifier your data has to be [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).\n        \n        - [LogisticRegressor.newton](https://pub.dev/documentation/ml_algo/latest/ml_algo/LogisticRegressor/LogisticRegressor.newton.html). \n    Implementation of the logistic regression algorithm based on Newton-Raphson method with L2 regularisation. \n    To use this kind of classifier your data has to be [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).\n\n    - [SoftmaxRegressor](https://pub.dev/documentation/ml_algo/latest/ml_algo/SoftmaxRegressor-class.html). \n    A class that performs linear multiclass classification of data. To use this kind of classifier your data has to be \n    [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).\n        \n    - [DecisionTreeClassifier](https://pub.dev/documentation/ml_algo/latest/ml_algo/DecisionTreeClassifier-class.html)\n    A class that performs classification using decision trees. May work with data with non-linear patterns.\n    \n    - [KnnClassifier](https://pub.dev/documentation/ml_algo/latest/ml_algo/KnnClassifier-class.html)\n    A class that performs classification using `k nearest neighbours algorithm` - it makes predictions based on \n    the first `k` closest observations to the given one.\n\n- #### Regression algorithms\n    - [LinearRegressor](https://pub.dev/documentation/ml_algo/latest/ml_algo/LinearRegressor-class.html). \n    A general class for finding a linear pattern in training data and predicting outcomes as real numbers.\n    \n        - [LinearRegressor.lasso](https://pub.dev/documentation/ml_algo/latest/ml_algo/LinearRegressor/LinearRegressor.lasso.html)\n    Implementation of the linear regression algorithm based on coordinate descent with lasso regularisation\n    \n        - [LinearRegressor.SGD](https://pub.dev/documentation/ml_algo/latest/ml_algo/LinearRegressor/LinearRegressor.SGD.html)\n    Implementation of the linear regression algorithm based on stochastic gradient descent with L2 regularisation\n    \n        - [LinearRegressor.BGD](https://pub.dev/documentation/ml_algo/latest/ml_algo/LinearRegressor/LinearRegressor.BGD.html)\n        Implementation of the linear regression algorithm based on batch gradient descent with L2 regularisation\n    \n        - [LinearRegressor.newton](https://pub.dev/documentation/ml_algo/latest/ml_algo/LinearRegressor/LinearRegressor.newton.html)\n        Implementation of the linear regression algorithm based on Newton-Raphson method with L2 regularisation\n     \n    - [KnnRegressor](https://pub.dev/documentation/ml_algo/latest/ml_algo/KnnRegressor-class.html)\n    A class that makes predictions for each new observation based on the first `k` closest observations from \n    training data. It may catch non-linear patterns of the data.\n    \n- #### Clustering and retrieval algorithms\n    - [KDTree](https://pub.dev/documentation/ml_algo/latest/kd_tree/KDTree-class.html) An algorithm for\n    efficient data retrieval.\n    - **Locality sensitive hashing.** A family of algorithms that randomly partition all reference data points into \n    different bins, which makes it possible to perform efficient K Nearest Neighbours search, since there is no need \n    to search for the neighbours through the entire data. The family is represented by the following classes:\n        - [RandomBinaryProjectionSearcher](https://pub.dev/documentation/ml_algo/latest/ml_algo/RandomBinaryProjectionSearcher-class.html)\n    \nFor more information on the library's API, please visit the [API reference](https://pub.dev/documentation/ml_algo/latest/ml_algo/ml_algo-library.html) \n\n## Examples\n\n### Logistic regression\n\nLet's classify records from a well-known dataset - [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database)\nvia [Logistic regressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/logistic_regressor/logistic_regressor.dart)\n\n**Important note:**\n\nPlease pay attention to problems that classifiers and regressors exposed by the library solve. For e.g., \n[Logistic regressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/logistic_regressor/logistic_regressor.dart)\nsolves only **binary classification** problems, and that means that you can't use this classifier with a dataset \nwith more than two classes, keep that in mind - in order to find out more about regressors and classifiers, please refer to\nthe [API documentation](https://pub.dev/documentation/ml_algo/latest/ml_algo/ml_algo-library.html) of the package\n\nImport all necessary packages. First, it's needed to ensure if you have `ml_preprocessing` and `ml_dataframe` packages \nin your dependencies:\n\n````\ndependencies:\n  ml_dataframe: ^1.5.0\n  ml_preprocessing: ^7.0.2\n````\n\nWe need these repos to parse raw data in order to use it further. For more details, please\nvisit [ml_preprocessing](https://github.com/gyrdym/ml_preprocessing) repository page. \n\n**Important note:**\n\nRegressors and classifiers exposed by the library do not handle strings, booleans and nulls, they can only deal with \nnumbers! You necessarily need to convert all the improper values of your dataset to numbers, please refer to [ml_preprocessing](https://github.com/gyrdym/ml_preprocessing)\nlibrary to find out more about data preprocessing.\n\n````dart  \nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\nimport 'package:ml_preprocessing/ml_preprocessing.dart';\n````\n\n### Read a dataset's file\n\nWe have 2 options here:\n\n- Download the dataset from [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).\n\n\u003cdetails\u003e\n\u003csummary\u003eInstructions\u003c/summary\u003e\n\n#### For a desktop application: \n\nJust provide a proper path to your downloaded file and use a function-factory `fromCsv` from `ml_dataframe` package to \nread the file:\n\n````dart\nfinal samples = await fromCsv('datasets/pima_indians_diabetes_database.csv');\n````\n\n#### For a flutter application:\n\nIt's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:\n\n````\nflutter:\n  assets:\n    - assets/datasets/pima_indians_diabetes_database.csv\n````\n\nYou need to create the assets directory in the file system and put the dataset's file there. After that you \ncan access the dataset:\n\n```dart\nimport 'package:flutter/services.dart' show rootBundle;\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() async {\n  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');\n  final samples = DataFrame.fromRawCsv(rawCsvContent);\n}\n```\n\u003c/details\u003e\n\n- Or we may simply use [getPimaIndiansDiabetesDataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/getPimaIndiansDiabetesDataFrame.html) function\nfrom [ml_dataframe](https://pub.dev/packages/ml_dataframe) package. The function returns a ready to use [DataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/DataFrame-class.html) instance\nfilled with `Pima Indians Diabetes Database` data.\n\n\u003cdetails\u003e\n\u003csummary\u003eInstructions\u003c/summary\u003e\n\n```dart\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() {\n  final samples = getPimaIndiansDiabetesDataFrame();\n}\n```\n\n\u003c/details\u003e\n\n### Prepare datasets for training and testing\n\nData in this file is represented by 768 records and 8 features. The 9th column is a label column, it contains either 0 or 1 \non each row. This column is our target - we should predict a class label for each observation. The column's name is\n`Outcome`. Let's store it:\n\n````dart\nfinal targetColumnName = 'Outcome';\n````\n\nNow it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford to\nsplit the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross-Validation. \nAccording to this, let's split the data in the following way using the library's [splitData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/split_data.dart) \nfunction:\n\n```dart\nfinal splits = splitData(samples, [0.7]);\nfinal validationData = splits[0];\nfinal testData = splits[1];\n```\n\n`splitData` accepts a `DataFrame` instance as the first argument and ratio list as the second one. Now we have 70% of our\ndata as a validation set and 30% as a test set for evaluating generalization errors.\n\n### Set up a model selection algorithm \n\nThen we may create an instance of `CrossValidator` class to fit the [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))\nof our model. We should pass validation data (our `validationData` variable), and a number of folds into CrossValidator \nconstructor.\n \n````dart\nfinal validator = CrossValidator.kFold(validationData, numberOfFolds: 5);\n````\n\nLet's create a factory for the classifier with desired hyperparameters. We have to decide after the cross-validation \nif the selected hyperparameters are good enough or not:\n\n```dart\nfinal createClassifier = (DataFrame samples) =\u003e\n  LogisticRegressor(\n    samples\n    targetColumnName,\n  );\n```\n\nIf we want to evaluate the learning process more thoroughly, we may pass `collectLearningData` argument to the classifier\nconstructor:\n\n```dart\nfinal createClassifier = (DataFrame samples) =\u003e\n  LogisticRegressor(\n    ...,\n    collectLearningData: true,\n  );\n```\n\nThis argument activates collecting costs per each optimization iteration, and you can see the cost values right after \nthe model creation.\n\n### Evaluate the performance of the model\n\nAssume, we chose perfect hyperparameters. In order to validate this hypothesis, let's use CrossValidator instance \ncreated before:\n\n````dart\nfinal scores = await validator.evaluate(createClassifier, MetricType.accuracy);\n````\n\nSince the CrossValidator instance returns a [Vector](https://github.com/gyrdym/ml_linalg/blob/master/lib/vector.dart) of scores as a result of our predictor evaluation, we may choose\nany way to reduce all the collected scores to a single number, for instance, we may use Vector's `mean` method:\n\n```dart\nfinal accuracy = scores.mean();\n```  \n\nLet's print the score:\n````dart\nprint('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');\n````\n\nWe can see something like this:\n\n````\naccuracy on k fold validation: 0.75\n````\n\nLet's assess our hyperparameters on the test set in order to evaluate the model's generalization error:\n\n```dart\nfinal testSplits = splitData(testData, [0.8]);\nfinal classifier = createClassifier(testSplits[0]);\nfinal finalScore = classifier.assess(testSplits[1], MetricType.accuracy);\n```\n\nThe final score is like:\n\n```dart\nprint(finalScore.toStringAsFixed(2)); // approx. 0.75\n```\n\nIf we specified `collectLearningData` parameter, we may see costs per each iteration in order to evaluate how our cost \nchanged from iteration to iteration during the learning process:\n\n```dart\nprint(classifier.costPerIteration);\n```\n\n### Write the model to a json file\n\nSeems, our model has a good generalization ability, and that means we may use it in the future.\nTo do so we may store the model in a file as JSON:\n\n```dart\nawait classifier.saveAsJson('diabetes_classifier.json');\n```\n\nAfter that we can simply read the model from the file and make predictions:\n\n```dart\nimport 'dart:io';\n\nvoid main() {\n  // ...\n  final fileName = 'diabetes_classifier.json';\n  final file = File(fileName);\n  final encodedModel = await file.readAsString();\n  final classifier = LogisticRegressor.fromJson(encodedModel);\n  final unlabelledData = await fromCsv('some_unlabelled_data.csv');\n  final prediction = classifier.predict(unlabelledData);\n\n  print(prediction.header); // ('class variable (0 or 1)')\n  print(prediction.rows); // [ \n                        //   (1),\n                        //   (0),\n                        //   (0),\n                        //   (1),\n                        //   ...,\n                        //   (1),\n                        // ]\n  // ...\n}\n```\n\nPlease note that all the hyperparameters that we used to generate the model are persisted as the model's read-only \nfields, and we can access them anytime:\n\n```dart\nprint(classifier.iterationsLimit);\nprint(classifier.probabilityThreshold);\n// and so on\n``` \n\n\u003cdetails\u003e\n\u003csummary\u003eAll the code for a desktop application:\u003c/summary\u003e\n\n````dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\nimport 'package:ml_preprocessing/ml_preprocessing.dart';\n\nvoid main() async {\n  // Another option - to use a toy dataset:\n  // final samples = getPimaIndiansDiabetesDataFrame();\n  final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv', headerExists: true);\n  final targetColumnName = 'Outcome';\n  final splits = splitData(samples, [0.7]);\n  final validationData = splits[0];\n  final testData = splits[1];\n  final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);\n  final createClassifier = (DataFrame samples) =\u003e\n    LogisticRegressor(\n      samples\n      targetColumnName,\n    );\n  final scores = await validator.evaluate(createClassifier, MetricType.accuracy);\n  final accuracy = scores.mean();\n  \n  print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');\n\n  final testSplits = splitData(testData, [0.8]);\n  final classifier = createClassifier(testSplits[0], targetNames);\n  final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);\n  \n  print(finalScore.toStringAsFixed(2));\n\n  await classifier.saveAsJson('diabetes_classifier.json');\n}\n````\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAll the code for a flutter application:\u003c/summary\u003e\n\n````dart\nimport 'package:flutter/services.dart' show rootBundle;\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\nimport 'package:ml_preprocessing/ml_preprocessing.dart';\n\nvoid main() async {\n  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');\n  // Another option - to use a toy dataset:\n  // final samples = getPimaIndiansDiabetesDataFrame();\n  final samples = DataFrame.fromRawCsv(rawCsvContent);\n  final targetColumnName = 'Outcome';\n  final splits = splitData(samples, [0.7]);\n  final validationData = splits[0];\n  final testData = splits[1];\n  final validator = CrossValidator.kFold(validationData, numberOfFolds: 5);\n  final createClassifier = (DataFrame samples) =\u003e\n    LogisticRegressor(\n      samples\n      targetColumnName,\n    );\n  final scores = await validator.evaluate(createClassifier, MetricType.accuracy);\n  final accuracy = scores.mean();\n  \n  print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');\n\n  final testSplits = splitData(testData, [0.8]);\n  final classifier = createClassifier(testSplits[0], targetNames);\n  final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);\n  \n  print(finalScore.toStringAsFixed(2));\n\n  await classifier.saveAsJson('diabetes_classifier.json');\n}\n````\n\u003c/details\u003e\n\n### Linear regression\n\nLet's try to predict house prices using linear regression and the famous [Boston Housing](https://www.kaggle.com/c/boston-housing) dataset.\nThe dataset contains 13 independent variables and 1 dependent variable - `medv` which is the target one (you can find\nthe dataset in [e2e/_datasets/housing.csv](https://github.com/gyrdym/ml_algo/blob/master/e2e/_datasets/housing.csv)).\n\nAgain, first we need to download the file and create a dataframe. The dataset is headless, we may either use autoheader or provide our own header. \nLet's use autoheader in our example:\n\n#### For a desktop application: \n\nJust provide a proper path to your downloaded file and use a function-factory `fromCsv` from `ml_dataframe` package to \nread the file:\n\n```dart\nfinal samples = await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' ');\n``` \n\n#### For a flutter application:\n\nIt's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:\n\n````\nflutter:\n  assets:\n    - assets/datasets/housing.csv\n````\n\nYou need to create the assets directory in the file system and put the dataset's file there. After that you \ncan access the dataset:\n\n```dart\nimport 'package:flutter/services.dart' show rootBundle;\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nfinal rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');\nfinal samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ');\n```\n\n### Prepare the dataset for training and testing\n\nData in this file is represented by 505 records and 13 features. The 14th column is a target. Since we use autoheader, the\ntarget's name is autogenerated and it is `col_13`. Let's store it in a variable:\n\n````dart\nfinal targetName = 'col_13';\n````\n\nthen let's shuffle the data:\n\n```dart\nfinal shuffledSamples = samples.shuffle();\n```\n\nNow it's the time to prepare data splits. Let's split the data into train and test subsets using the library's [splitData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/split_data.dart) \nfunction:\n\n```dart\nfinal splits = splitData(samples, [0.8]);\nfinal trainData = splits[0];\nfinal testData = splits[1];\n```\n\n`splitData` accepts a `DataFrame` instance as the first argument and ratio list as the second one. Now we have 80% of our\ndata as a train set and 20% as a test set.\n\nLet's train the model:\n\n```dart\nfinal model = LinearRegressor(trainData, targetName);\n```\n\nBy default, `LinearRegressor` uses a closed-form solution to train the model. One can also use a different solution type,\ne.g. stochastic gradient descent algorithm: \n\n```dart\nfinal model = LinearRegressor.SGD(\n  shuffledSamples\n  targetName,\n  iterationLimit: 90,\n);\n```\n\nor linear regression based on coordinate descent with Lasso regularization:\n\n```dart\nfinal model = LinearRegressor.lasso(\n  shuffledSamples,\n  targetName,\n  iterationLimit: 90,\n);\n```\n\nNext, we should evaluate performance of our model:\n\n```dart\nfinal error = model.assess(testData, MetricType.mape);\n\nprint(error);\n``` \n\nIf we are fine with the error, we can save the model for the future use:\n\n```dart\nawait model.saveAsJson('housing_model.json');\n```\n\nLater we may use our trained model for prediction:\n\n```dart\nimport 'dart:io';\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() async {\n  final file = File('housing_model.json');\n  final encodedModel = await file.readAsString();\n  final model = LinearRegressor.fromJson(encodedModel);\n  final unlabelledData = await fromCsv('some_unlabelled_data.csv');\n  final prediction = model.predict(unlabelledData);\n    \n  print(prediction.header);\n  print(prediction.rows);\n}\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eAll the code for a desktop application:\u003c/summary\u003e\n\n````dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() async {\n  final samples = (await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' ')).shuffle();\n  final targetName = 'col_13';\n  final splits = splitData(samples, [0.8]);\n  final trainData = splits[0];\n  final testData = splits[1];\n  final model = LinearRegressor(trainData, targetName);\n  final error = model.assess(testData, MetricType.mape);\n  \n  print(error);\n\n  await classifier.saveAsJson('housing_model.json');\n}\n````\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAll the code for a flutter application:\u003c/summary\u003e\n\n````dart\nimport 'package:flutter/services.dart' show rootBundle;\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() async {\n  final rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');\n  final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ').shuffle();\n  final targetName = 'col_13';\n  final splits = splitData(samples, [0.8]);\n  final trainData = splits[0];\n  final testData = splits[1];\n  final model = LinearRegressor(trainData, targetName);\n  final error = model.assess(testData, MetricType.mape);\n    \n  print(error);\n  \n  await classifier.saveAsJson('housing_model.json');\n}\n````\n\u003c/details\u003e\n\n### Decision tree-based classification\n\nLet's try to classify data from a well-known [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset using a non-linear algorithm - [decision trees](https://en.wikipedia.org/wiki/Decision_tree)\n\nFirst, you need to download the data and place it in a proper place in your file system. To do so you should follow the\ninstructions which are given in the [Logistic regression](#logistic-regression) section. Or you may use [getIrisDataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/getIrisDataFrame.html)\nfunction that returns ready to use [DataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/DataFrame-class.html) instance filled with `Iris`dataset. \n\nAfter loading the data, it's needed to preprocess it. We should drop the `Id` column since the column doesn't make sense. \nAlso, we need to encode the 'Species' column - originally, it contains 3 repeated string labels, to feed it to the classifier\nit's needed to convert the labels into numbers:\n\n```dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\nimport 'package:ml_preprocessing/ml_preprocessing.dart';\n\nvoid main() async {\n    final samples = getIrisDataFrame()\n      .shuffle()\n      .dropSeries(names: ['Id']);\n    \n    final pipeline = Pipeline(samples, [\n      toIntegerLabels(\n        columnNames: ['Species'], // Here we convert strings from 'Species' column into numbers\n      ),\n    ]);\n}\n```\n\nNext, let's create a model:\n\n```dart\nfinal model = DecisionTreeClassifier(\n  processed,\n  'Species',\n  minError: 0.3,\n  minSamplesCount: 5,\n  maxDepth: 4,\n);\n``` \n\nAs you can see, we specified 3 hyperparameters: `minError`, `minSamplesCount` and `maxDepth`. Let's look at the \nparameters in more detail:\n\n- `minError`. A minimum error on a tree node. If the error is less than or equal to the value, the node is considered a leaf.\n- `minSamplesCount`. A minimum number of samples on a node. If the number of samples is less than or equal to the value, the node is considered a leaf.\n- `maxDepth`. A maximum depth of the resulting decision tree. Once the tree reaches the `maxDepth`, all the level's nodes are considered leaves.\n\nAll the parameters serve as stopping criteria for the tree building algorithm.\n\nNow we have a ready to use model. As usual, we can save the model to a JSON file:\n\n```dart\nawait model.saveAsJson('path/to/json/file.json');\n```\n\nUnlike other models, in the case of a decision tree, we can visualise the algorithm result - we can save the model as an SVG file:\n\n```dart\nawait model.saveAsSvg('path/to/svg/file.svg');\n```\n\nOnce we saved it, we can open the file through any image viewer, e.g. through a web browser. An example of the \nresulting SVG image:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg height=\"600\" src=\"https://raw.github.com/gyrdym/ml_algo/master/e2e/decision_tree_classifier/iris_tree.svg?sanitize=true\"\u003e \n\u003c/p\u003e\n\n### KDTree-based data retrieval\n\nLet's take a look at another field of machine learning - data retrieval. The field is represented by a family of algorithms,\none of them is `KDTree` which is exposed by the library.\n\n`KDTree` is an algorithm that divides the whole search space into partitions in form of the binary tree which makes it \nefficient to retrieve data.\n\nLet's retrieve some data points through a kd-tree built on the [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset.\n\nFirst, we need to prepare the data. To do so, it's needed to load the dataset. For this purpose, we may use \n[getIrisDataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/getIrisDataFrame.html) function from [ml_dataframe](https://pub.dev/packages/ml_dataframe). The function returns prefilled with the Iris data DataFrame instance:\n\n```dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() {\n  final originalData = getIrisDataFrame();\n}\n```\n\nSince the dataset contains `Id` column that doesn't make sense and `Species` column that contains text data, we need to\ndrop these columns:\n\n```dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() {\n  final originalData = getIrisDataFrame();\n  final data = originalData.dropSeries(names: ['Id', 'Species']);\n}\n```\n\nNext, we can build the tree:\n\n```dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() {\n  final originalData = getIrisDataFrame();\n  final data = originalData.dropSeries(names: ['Id', 'Species']);\n  final tree = KDTree(data);\n}\n```\n\nAnd query nearest neighbours for an arbitrary point. Let's say, we want to find 5 nearest neighbours for the point `[6.5, 3.01, 4.5, 1.5]`:\n\n```dart\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\nimport 'package:ml_linalg/vector.dart';\n\nvoid main() {\n  final originalData = getIrisDataFrame();\n  final data = originalData.dropSeries(names: ['Id', 'Species']);\n  final tree = KDTree(data);\n  final neighbourCount = 5;\n  final point = Vector.fromList([6.5, 3.01, 4.5, 1.5]);\n  final neighbours = tree.query(point, neighbourCount);\n \n  print(neighbours);\n}\n```\n\nThe last instruction prints the following:\n\n```\n(Index: 75, Distance: 0.17349341930302867), (Index: 51, Distance: 0.21470911402365767), (Index: 65, Distance: 0.26095956499211426), (Index: 86, Distance: 0.29681616124778537), (Index: 56, Distance: 0.4172527193942372))\n```\n\nThe nearest point has an index 75 in the original data. Let's check a record at the index:\n\n```dart\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() {\n  final originalData = getIrisDataFrame();\n \n  print(originalData.rows.elementAt(75));\n}\n```\n\nIt prints the following:\n\n```\n(76, 6.6, 3.0, 4.4, 1.4, Iris-versicolor)\n```\n\nRemember, we dropped `Id` and `Species` columns which are the very first and the very last elements in the output, so the\nrest elements, `6.6, 3.0, 4.4, 1.4` look quite similar to our target point - `6.5, 3.01, 4.5, 1.5`, so the query result makes \nsense. \n\nIf you want to use `KDTree` outside the ml_algo ecosystem, meaning you don't want to use [ml_linalg](https://pub.dev/packages/ml_linalg) and [ml_dataframe](https://pub.dev/packages/ml_dataframe)\npackages in your application, you may import only [KDTree](https://pub.dev/documentation/ml_algo/latest/kd_tree/kd_tree-library.html) library and use [fromIterable](https://pub.dev/documentation/ml_algo/latest/kd_tree/KDTree/KDTree.fromIterable.html) constructor and [queryIterable](https://pub.dev/documentation/ml_algo/latest/kd_tree/KDTree/queryIterable.html)\nmethod to perform the query: \n\n```dart\nimport 'package:ml_algo/kd_tree.dart';\n\nvoid main() async {\n  final tree = KDTree.fromIterable([\n    // some data here\n  ]);\n  final neighbourCount = 5;\n  final neighbours = tree.queryIterable([/* some point here */], neighbourCount);\n \n  print(neighbours);\n}\n```\n\nAs usual, we can persist our tree by saving it to a JSON file:\n\n```dart\nimport 'dart:io';\nimport 'package:ml_algo/ml_algo.dart';\nimport 'package:ml_dataframe/ml_dataframe.dart';\n\nvoid main() {\n  final originalData = getIrisDataFrame();\n  final data = originalData.dropSeries(names: ['Id', 'Species']);\n  final tree = KDTree(data);\n \n  // ...\n\n  await tree.saveAsJson('path/to/json/file.json');\n \n  // ...\n\n  final file = await File('path/to/json/file.json').readAsString();\n  final encodedTree = jsonDecode(file) as Map\u003cString, dynamic\u003e;\n  final restoredTree = KDTree.fromJson(encodedTree);\n\n  print(restoredTree);\n}\n```\n\n## Models retraining\n\nSomeday our previously shining model can degrade in terms of prediction accuracy - in this case, we can retrain it. \nRetraining means simply re-running the same learning algorithm that was used to generate our current model\nkeeping the same hyperparameters but using a new data set with the same features:\n\n```dart\nimport 'dart:io';\n\nfinal fileName = 'diabetes_classifier.json';\nfinal file = File(fileName);\nfinal encodedModel = await file.readAsString();\nfinal classifier = LogisticRegressor.fromJson(encodedModel);\n\n// ... \n// here we do something and realize that our classifier performance is not so good\n// ...\n\nfinal newData = await fromCsv('path/to/dataset/with/new/data/to/retrain/the/classifier');\nfinal retrainedClassifier = classifier.retrain(newData);\n\n```\n\nThe workflow with other predictors (SoftmaxRegressor, DecisionTreeClassifier and so on) is quite similar to the described\nabove for LogisticRegressor, feel free to experiment with other models.\n\n## A couple of words about linear models which use gradient optimisation methods\n\nSometimes you may get NaN or Infinity as a value of your score, or it may be equal to some inconceivable value \n(extremely big or extremely low). To prevent so, you need to find a proper value of the initial learning rate, and also \nyou may choose between the following learning rate strategies: `constant`, `timeBased`, `stepBased` and `exponential`:\n\n```dart\nfinal createClassifier = (DataFrame samples) =\u003e\n    LogisticRegressor(\n      ...,\n      initialLearningRate: 1e-5,\n      learningRateType: LearningRateType.timeBased,\n      ...,\n    );\n```\n\n## Helpful articles on algorithms standing behind the library\n\n- [Linear Regression in Dart](https://medium.com/mlearning-ai/a-gentle-introduction-to-linear-regression-the-dart-way-9750214e6fa2?source=friends_link\u0026sk=e199d8f5b0bb71c97525be2ee7f5819b)\n- [Ordinary Least Squares (OLS) problem](https://medium.com/mlearning-ai/linear-regression-ordinary-least-squares-in-a-nutshell-c2e0d7ed260f?source=friends_link\u0026sk=5c8bc0228d29bc67ebe524a91d687619)\n- [Closed-Form solution for OLS in Dart](https://medium.com/mlearning-ai/ordinary-least-squares-closed-form-solution-the-dart-way-d7c0ee0e0d02?source=friends_link\u0026sk=9ba5a9da7fd3160b28c450ff6dc446a4)\n- [Gradient Descent in Dart](https://medium.com/mlearning-ai/gradient-descent-the-dart-way-2d6c39416a8a?source=friends_link\u0026sk=992b52c85a51ecea1c1e9e4afe2a8c1e)\n\n### Contacts\nIf you have questions, feel free to text me on\n - [X](https://x.com/ilgyrd) \n - [Telegram](https://t.me/Gyrdym)\n - [Facebook](https://www.facebook.com/ilya.gyrdymov)\n - [Linkedin](https://www.linkedin.com/in/gyrdym/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgyrdym%2Fml_algo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgyrdym%2Fml_algo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgyrdym%2Fml_algo/lists"}