{"id":16647441,"url":"https://github.com/gyrdym/ml_preprocessing","last_synced_at":"2025-03-21T16:31:02.990Z","repository":{"id":34050736,"uuid":"167606603","full_name":"gyrdym/ml_preprocessing","owner":"gyrdym","description":"Implementation of popular data preprocessing algorithms for Machine learning","archived":false,"fork":false,"pushed_at":"2022-04-22T13:40:21.000Z","size":5701,"stargazers_count":20,"open_issues_count":6,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-18T02:51:38.854Z","etag":null,"topics":["data-preprocessing","data-science","machine-learning","machine-learning-algorithms","onehot-encoder","ordinal-encoder"],"latest_commit_sha":null,"homepage":"","language":"Dart","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gyrdym.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-25T20:06:17.000Z","updated_at":"2024-10-05T01:03:50.000Z","dependencies_parsed_at":"2022-08-08T00:00:15.433Z","dependency_job_id":null,"html_url":"https://github.com/gyrdym/ml_preprocessing","commit_stats":null,"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_preprocessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_preprocessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_preprocessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gyrdym%2Fml_preprocessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gyrdym","download_url":"https://codeload.github.com/gyrdym/ml_preprocessing/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244829421,"owners_count":20517303,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-preprocessing","data-science","machine-learning","machine-learning-algorithms","onehot-encoder","ordinal-encoder"],"created_at":"2024-10-12T08:44:45.114Z","updated_at":"2025-03-21T16:31:01.321Z","avatar_url":"https://github.com/gyrdym.png","language":"Dart","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://github.com/gyrdym/ml_preprocessing/workflows/CI%20pipeline/badge.svg)](https://github.com/gyrdym/ml_preprocessing/actions?query=branch%3Amaster+)\n[![Coverage Status](https://coveralls.io/repos/github/gyrdym/ml_preprocessing/badge.svg)](https://coveralls.io/github/gyrdym/ml_preprocessing)\n[![pub package](https://img.shields.io/pub/v/ml_preprocessing.svg)](https://pub.dartlang.org/packages/ml_preprocessing)\n[![Gitter Chat](https://badges.gitter.im/gyrdym/gyrdym.svg)](https://gitter.im/gyrdym/)\n\n# ml_preprocessing\nData preprocessing algorithms\n\n## What is data preprocessing?\n*Data preprocessing* is a set of techniques for data preparation before one can use the data in Machine Learning algorithms.\n\n## Why is it needed?\nLet's say, you have a dataset:\n\n````\n    ----------------------------------------------------------------------------------------\n    | Gender | Country | Height (cm) | Weight (kg) | Diabetes (1 - Positive, 0 - Negative) |\n    ----------------------------------------------------------------------------------------\n    | Female | France  |     165     |     55      |                    1                  |\n    ----------------------------------------------------------------------------------------\n    | Female | Spain   |     155     |     50      |                    0                  |\n    ----------------------------------------------------------------------------------------\n    | Male   | Spain   |     175     |     75      |                    0                  |\n    ----------------------------------------------------------------------------------------\n    | Male   | Russia  |     173     |     77      |                   N/A                 |\n    ----------------------------------------------------------------------------------------\n````\n\nEverything seems good for now. Say, you're about to train a classifier to predict if a person has diabetes. \nBut there is an obstacle - how can it be possible to use the data in mathematical equations with string-value columns \n(`Gender`, `Country`)? And things are getting even worse because of an empty (N/A) value in the `Diabetes` column. There \nshould be a way to convert this data to a valid numerical representation. Here data preprocessing techniques come to play. \nYou should decide, how to convert string data (aka *categorical data*) to numbers and how to treat empty values. Of \ncourse, you can come up with your unique algorithms to do all of these operations, but there are a lot of well-known \ntechniques for doing all the conversions.      \n\nThe aim of the library is to give data scientists, who are interested in Dart programming language, these preprocessing \ntechniques.\n\n## Prerequisites\n\nThe library depends on [DataFrame class](https://github.com/gyrdym/ml_dataframe/blob/master/lib/src/data_frame/data_frame.dart) \nfrom the [repo](https://github.com/gyrdym/ml_dataframe). It's necessary to use it as a dependency in your project,\nbecause you need to pack data into [DataFrame](https://github.com/gyrdym/ml_dataframe/blob/master/lib/src/data_frame/data_frame.dart)\nbefore doing preprocessing. An example with a part of pubspec.yaml:\n\n````\ndependencies:\n  ...\n  ml_dataframe: ^1.0.0\n  ...\n````\n\n## Usage examples\n\n### Getting started\n\nLet's download some data from [Kaggle](https://www.kaggle.com) - let it be amazing [black friday](https://www.kaggle.com/datasets/sdolezel/black-friday) \ndataset. It's pretty interesting data with huge amount of observations (approx. 538000 rows) and a good number of \ncategorical features.\n\nFirst, import all necessary libraries:\n\n````dart\nimport 'package:ml_dataframe/ml_dataframe.dart';\nimport 'package:ml_preprocessing/ml_preprocessing.dart';\n````\n\nThen, we should read the csv and create a data frame:\n\n````dart\nfinal dataFrame = await fromCsv('example/black_friday/black_friday.csv', \n  columns: [2, 3, 5, 6, 7, 11]);\n````\n\n### Categorical data\n\nAfter we get a dataframe, we may encode all the needed features. Let's analyze the dataset and decide, what features \nshould be encoded. In our case these are:\n\n````dart\nfinal featureNames = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status'];\n````\n\n### One-hot encoding\n\nLet's fit the one-hot encoder. \n\nWhy should we fit it? Categorical data encoder fitting - a process, when all the unique category values are being \nsearched for in order to create an encoded labels list. After the fitting is complete, one may use the fitted encoder for \nthe new data of the same source. \n\nIn order to fit the encoder, it's needed to create the instance of the `Encoder` class and pass the fitting data as an \nargument to the constructor, along with the features to be encoded:\n\n \n````dart\nfinal encoder = Encoder.oneHot(\n  dataFrame,\n  columnNames: featureNames,\n);\n\n````\n\nLet's encode the features:\n\n````dart\nfinal encoded = encoder.process(dataFrame);\n````\n\nWe used the same dataframe here - it's absolutely normal since when we created the encoder, we just fit it with the \ndataframe, and now is the time to apply the dataframe to the fitted encoder.\n\nIt's time to take a look at our processed data. Let's read it:\n\n````dart\nfinal data = encoded.toMatrix();\n\nprint(data);\n```` \n\nIn the output we will see just numerical data, that's exactly what we wanted to reach.\n\n### Label encoding\n\nAnother well-known encoding method. The technique is the same - first, we should fit the encoder and after that, we\nmay use this \"trained\" encoder in some applications:\n\n````dart\n// fit encoder\nfinal encoder = Encoder.label(\n  dataFrame,\n  columnNames: featureNames,\n);\n\n// apply fitted encoder to data\nfinal encoded = encoder.process(dataFrame);\n````\n\n### Numerical data normalization\n\nSometimes we need to have our numerical features normalized, which means we need to treat every dataframe row as a \nvector and divide this vector element-wise by its norm (Euclidean, Manhattan, etc.). To do so the library exposes\n`Normalizer` class:\n\n````dart\nfinal normalizer = Normalizer(); // by default Euclidean norm will be used\nfinal transformed = normalizer.process(dataFrame);\n```` \n\nPlease, notice, that if your data has raw categorical values, the normalization will fail as it requires only numerical \nvalues. In this case, you should encode data (e.g. using one-hot encoding) before normalization.\n\n### Data standardization\n\nA lot of machine learning algorithms require normally distributed data as their input. Normally distributed data \nmeans that every column in the data has zero mean and unit variance. One may reach this requirement using the \n`Standardizer` class. During the creation of the class instance, all the columns' mean values and deviation values are \nbeing extracted from the passed data and stored as fields of the class, in order to apply them to standardize the \nother (or the same that was used for the creation of the Standardizer) data:\n\n````dart\nfinal dataFrame = DataFrame([\n  [  1,   2,   3],\n  [ 10,  20,  30],\n  [100, 200, 300],\n], headerExists: false);\n\n// fit standardizer\nfinal standardizer = Standardizer(dataFrame);\n\n// apply fitted standardizer to data\nfinal transformed = standardizer.process(dataFrame);\n````      \n\n### Pipeline\n\nThere is a convenient way to organize a sequence of data preprocessing operations - `Pipeline`:\n\n````dart\nfinal pipeline = Pipeline(dataFrame, [\n  toOneHotLabels(columnNames: ['Gender', 'Age', 'City_Category']),\n  toIntegerLabels(columnNames: ['Stay_In_Current_City_Years', 'Marital_Status']),\n  normalize(),\n  standardize(),\n]);\n````\n\nOnce you create (or rather fit) a pipeline, you may use it further in your application:\n\n````dart\nfinal processed = pipeline.process(dataFrame);\n````\n\n`toOneHotLabels`, `toIntegerLabels`, `normalize` and `standardize` are pipeable operator functions. \nThe pipeable operator function is a factory that takes fitting data and creates a fitted pipeable entity (e.g., \n`Normalizer` instance)  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgyrdym%2Fml_preprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgyrdym%2Fml_preprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgyrdym%2Fml_preprocessing/lists"}