{"id":15017561,"url":"https://github.com/pharo-ai/data-preprocessing","last_synced_at":"2026-02-09T15:35:54.452Z","repository":{"id":168994406,"uuid":"617050953","full_name":"pharo-ai/data-preprocessing","owner":"pharo-ai","description":"Project including data pre-processing algo. We aim to include scaling, centering, normalization, binarization methods.","archived":false,"fork":false,"pushed_at":"2025-09-29T11:33:23.000Z","size":33,"stargazers_count":1,"open_issues_count":4,"forks_count":1,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-09-29T13:25:35.346Z","etag":null,"topics":["data","pharo","pharo-smalltalk","preprocessing","smalltalk"],"latest_commit_sha":null,"homepage":"","language":"Smalltalk","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pharo-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-21T15:50:48.000Z","updated_at":"2025-09-29T11:33:27.000Z","dependencies_parsed_at":"2024-12-19T03:37:09.991Z","dependency_job_id":null,"html_url":"https://github.com/pharo-ai/data-preprocessing","commit_stats":{"total_commits":22,"total_committers":1,"mean_commits":22.0,"dds":0.0,"last_synced_commit":"162743f3c8392da34cf0765e438bff8fa5fdb5bf"},"previous_names":["pharo-ai/data-preprocessing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pharo-ai/data-preprocessing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2Fdata-preprocessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2Fdata-preprocessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2Fdata-preprocessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2Fdata-preprocessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pharo-ai","download_url":"https://codeload.github.com/pharo-ai/data-preprocessing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pharo-ai%2Fdata-preprocessing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29271044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-09T13:47:44.167Z","status":"ssl_error","status_checked_at":"2026-02-09T13:47:43.721Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","pharo","pharo-smalltalk","preprocessing","smalltalk"],"created_at":"2024-09-24T19:50:41.069Z","updated_at":"2026-02-09T15:35:54.424Z","avatar_url":"https://github.com/pharo-ai.png","language":"Smalltalk","funding_links":[],"categories":[],"sub_categories":[],"readme":"# data-preprocessing\n\n[![CI](https://github.com/pharo-ai/data-preprocessing/actions/workflows/ci.yml/badge.svg)](https://github.com/pharo-ai/data-preprocessing/actions/workflows/ci.yml)\n[![Coverage Status](https://coveralls.io/repos/github/pharo-ai/data-preprocessing/badge.svg?branch=master)](https://coveralls.io/github/pharo-ai/data-preprocessing?branch=master)\n[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/PharoAI/data-imputers/master/LICENSE)\n[![Pharo version](https://img.shields.io/badge/Pharo-11-%23aac9ff.svg)](https://pharo.org/download)\n[![Pharo version](https://img.shields.io/badge/Pharo-10-%23aac9ff.svg)](https://pharo.org/download)\n[![Pharo version](https://img.shields.io/badge/Pharo-9.0-%23aac9ff.svg)](https://pharo.org/download)\n\nProject including data pre-processing algo. We aim to include scaling, centering, normalization, binarization methods.\n\n- [data-preprocessing](#data-preprocessing)\n\t- [How to install it?](#how-to-install-it)\n\t- [How to depend on it?](#how-to-depend-on-it)\n\t- [Quick Start](#quick-start)\n\t\t- [Encoding](#encoding)\n\t\t- [Normalization](#normalization)\n\t- [Encoding](#encoding-1)\n\t\t- [Ordinal Encoder](#ordinal-encoder)\n\t- [Normalization](#normalization-1)\n\t\t- [Min-Max Normalization (a.k.a. Rescaling)](#min-max-normalization-aka-rescaling)\n\t\t- [Standardization](#standardization)\n\t\t- [Usage](#usage)\n\t\t- [How to define new normalization strategies?](#how-to-define-new-normalization-strategies)\n\n\n\n## How to install it?\n\nTo install the project, go to the Playground (Ctrl+OW) in your [Pharo](https://pharo.org/) image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):\n\n```Smalltalk\nMetacello new\n  baseline: 'AIDataPreProcessing';\n  repository: 'github://pharo-ai/data-preprocessing/src';\n  load.\n```\n\n## How to depend on it?\n\nIf you want to add a dependency on this project to your project, include the following lines into your baseline method:\n\n```Smalltalk\nspec\n  baseline: 'AIDataPreProcessing'\n  with: [ spec repository: 'github://pharo-ai/data-preprocessing/src' ].\n```\n\nIf you are new to baselines and Metacello, check out this wonderful [Baselines](https://github.com/pharo-open-documentation/pharo-wiki/blob/master/General/Baselines.md) tutorial on Pharo Wiki.\n\n## Quick Start\n\n### Encoding\n\nIt is possible to encode a 2D collection with numerical categories easily like this:\n\n```st\n| collection |\ncollection := #( #( 'Female' 3 ) #( 'Male' 1 ) #( 'Female' 2 ) ).\n\t\nAIOrdinalEncoder new\n\tfitAndTransform: collection.  \"#(#(2 3) #(1 1) #(2 2))\"\n```\n\nFor more details check the documentation bellow.\n\n### Normalization\n\nIt is possible to normalize your collections like this:\n\n```st\nnormalizer := AIStandardizationNormalizer new.\nnumbers := #(10 -3 4 2 -7 1000 0.1 -4.05).\nnomalizer normalize: numbers. \"#(-0.3261 -0.3628 -0.343 -0.3487 -0.3741 2.475 -0.3541 -0.3658)\"\n```\n\nFor more details check the documentation bellow.\n\n## Encoding \n\n### Ordinal Encoder\n\n`AIOrdinalEncoder` is a simple encoder whose responsibility is to associate a number to each unique value of a 2D collection. (Can be a DataFrame)\n\nI can be fitted with a collection to compute the categories to use and then transform another collection (possibily the same one).\nAll values of the collection to transform must be present in the collection used to fit the datas or an AIMissingCategory exception will be raised.\n\nI can be use like this:\n\n```st\n| collection |\ncollection := #( #( 'Female' 3 ) #( 'Male' 1 ) #( 'Female' 2 ) ).\n\t\nAIOrdinalEncoder new\n\tfit: collection;\n\ttransform: collection.  \"#(#(2 3) #(1 1) #(2 2))\"\n```\n\nOr\n\n```st\n| collection |\ncollection := #( #( 'Female' 3 ) #( 'Male' 1 ) #( 'Female' 2 ) ).\n\t\nAIOrdinalEncoder new\n\tfitAndTransform: collection.  \"#(#(2 3) #(1 1) #(2 2))\"\n```\n\nI can also be used on a `DataFrame` in the same way:\n\n```st\n| collection |\ncollection := DataFrame withRows: #( #( 'Female' 3 ) #( 'Male' 1 ) #( 'Female' 2 ) ).\n\t\nAIOrdinalEncoder new\n\tfitAndTransform: collection.  \"#(#(2 3) #(1 1) #(2 2))\"\n```\n\nThe user can also give directly the categories to use like this:\n\n```st\n| collection |\ncollection := #( #( 'Female' 3 ) #( 'Male' 1 ) #( 'Female' 2 ) ).\n\t\nAIOrdinalEncoder new\n\tcategories: #( #( 'Female' 'Male' ) #( 3 1 2 ) );\n\ttransform: collection. \t\"#(#(1 1) #(2 2) #(1 3))\"\n```\n\nIn that case, the index of each elements will be used as a category.\n\n## Normalization\n\nNormalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.\n\nFor example, consider that you have two collections, `ages` and `salaries`:\n\n```Smalltalk\nages := #(25 19 30 32 41 50 24).\nsalaries := #(1600 1000 2500 2400 5000 3500 2500).\n```\n\nThose collections are on a very different scale. The differences in salaries have larger magnitude than differences in age. Which can confuse some machine learning algorithms and force them to \"think\" that if the difference salaries is 600 (euros) and the difference in age is 6 (years), then salary difference is 100 times greater than age difference. Such algorithms require data to be normalized - for example, both ages and salaries can be transformed to a scale of [0, 1].\n\nThere are different types of normalization. In this repository, we implement two most commonly used strategies: [Min-Max Normalization](https://en.wikipedia.org/wiki/Feature_scaling) and [Standardization](https://en.wikipedia.org/wiki/Standard_score). You can easily define your own strategy by adding a subclass of `AINormalizer`.\n\n### Min-Max Normalization (a.k.a. Rescaling)\n\n[Min-Max or Rescaling](https://en.wikipedia.org/wiki/Feature_scaling) is the type of normalization, every element of the numeric collection is transformed to a scale of [0, 1]:\n\n```\nx'[i] = (x[i] - x min) / (x max - x min)\n```\n\n### Standardization\n\n[Standardization](https://en.wikipedia.org/wiki/Standard_score) is the type of normalization, every element of the numeric collection is by scaled to be centered around the mean with a unit standard deviation:\n\n```\nx'[i] = (x[i] - x mean) / x std\n```\n\n### Usage\n\nYou can normalize any numeric collection by calling the `normalized` method on it:\n\n```Smalltalk\nnumbers := #(10 -3 4 2 -7 1000 0.1 -4.05).\nnumbers normalized. \"#(0.0169 0.004 0.0109 0.0089 0.0 1.0 0.007 0.0029)\"\n```\n\nBy default, it will use the `AIMinMaxNormalizer`. If you want to use a different normalization strategy, you can call `normalizedUsing:` on a collection:\n\n```Smalltalk\nnormalizer := AIStandardizationNormalizer new.\nnumbers normalizedUsing: normalizer. \"#(-0.3261 -0.3628 -0.343 -0.3487 -0.3741 2.475 -0.3541 -0.3658)\"\n```\n\nOr you can also give the collection to normalize to the normalizer:\n\n```st\nnormalizer := AIStandardizationNormalizer new.\nnomalizer normalize: numbers. \"#(-0.3261 -0.3628 -0.343 -0.3487 -0.3741 2.475 -0.3541 -0.3658)\"\n```\n\nFor the two normalization strategies that are defined in this package, we provide alias methods:\n\n```Smalltalk\nnumbers rescaled.\n\n\"is the same as\"\nnumbers normalizedUsing: AIMinMaxNormalizer new.\n```\n```Smalltalk\nnumbers standardized.\n\n\"is the same as\"\nnumbers normalizedUsing: AIStandardizationNormalizer new.\n```\n\nEach normalizer remembers the parameters of the original collection (e.g., min/max or mean/std) and can use them to restore the normalized collection to its original state:\n\n```Smalltalk\nnumbers := #(10 -3 4 2 -7 1000 0.1 -4.05).\n\nnormalizer := AIMinMaxNormalizer new.\nnormalizedNumbers := normalizer normalize: numbers. \"#(0.0169 0.004 0.0109 0.0089 0.0 1.0 0.007 0.0029)\"\nrestoredNumbers := normalizer restore: normalizedNumbers. \"#(10 -3 4 2 -7 1000 0.1 -4.05)\"\n```\n\n### How to define new normalization strategies?\n\nNormalization is implemented using a [strategy design pattern](https://en.wikipedia.org/wiki/Strategy_pattern). The `AI-Normalization` defines an abstract class `AINormalizer` which has two abstract methods `AINormalizer class \u003e\u003e normalize: aCollection` and `AINormalizer class \u003e\u003e restore: aCollection`. To define a normalization strategy, please implement a subclass of `AINormalizer` and provide your own definitions of `normalize:` and `restore:` methods. Keep in mind that those methods must not modify the given collection but return a new one.\n\nTo normalize a collection using your own strategy, call:\n\n```Smalltalk\nnormalizer := YourCustomNormalizer new.\nnumbers normalizedUsing: normalizer.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpharo-ai%2Fdata-preprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpharo-ai%2Fdata-preprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpharo-ai%2Fdata-preprocessing/lists"}