Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/juliaml/mllabelutils.jl

Utility package for working with classification targets and label-encodings
https://github.com/juliaml/mllabelutils.jl

classification julia machine-learning preprocessing

Last synced: about 1 month ago
JSON representation

Utility package for working with classification targets and label-encodings

Host: GitHub
URL: https://github.com/juliaml/mllabelutils.jl
Owner: JuliaML
License: other
Created: 2016-12-21T19:09:18.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2021-12-31T13:22:32.000Z (almost 3 years ago)
Last Synced: 2024-10-28T00:13:27.081Z (about 2 months ago)
Topics: classification, julia, machine-learning, preprocessing
Language: Julia
Homepage: http://mllabelutilsjl.readthedocs.io/
Size: 170 KB
Stars: 31
Watchers: 9
Forks: 13
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        # MLLabelUtils

_Utility package for working with classification targets. As such, this package provides the necessary functionality for interpreting class-predictions, as well as converting classification targets from one encoding to another._

| **Package Status** | **Package Evaluator** | **Build Status**  |

|:------------------:|:---------------------:|:-----------------:|

| [![License][license-img]][license-url] [![Documentation Status][docs-dev-img]][docs-dev-url] | [![PkgEval][pkgeval-img]][pkgeval-url] | [![CI][github-action-img]][github-action-url] [![Coverage Status][coveralls-img]][coveralls-url] |

## Introduction

In a classification setting, one usually treats the desired

output variable (also called *ground truths*, or *targets*) as a

discrete categorical variable. That is true even if the values

themself are of numerical type, which they often are for

practical reasons.

In fact, it is a common requirement in Machine Learning related

experiments to encode the classification targets of some

supervised dataset in a very specific way.

There are multiple conventions that all have their own merits

and reasons to exist. Some models, such as the probabilistic

version of logistic regression, require the targets in the form

of numbers in the set {1,0}. On the other hand, margin-based

classifier, such as SVMs, expect the targets to be in the set

{1,−1}.

This package provides the functionality needed to deal will these

different scenarios in an efficient, consistent, and convenient

manner. In particular, this library is designed with package

developers in mind, that require their classification-targets to

be in a specific format. To that end, the core focus of this

package is to provide all the tools needed to deal with

classification targets of arbitrary format. This includes

asserting if the targets are of a desired encoding, inferring the

concrete encoding the targets are in and how many classes they

represent, and converting from their native encoding to the

desired one.

## Example

The following code snippets show a simple "hello world" scenario

of how this package can be used to work with classification targets.

```julia

using MLLabelUtils

```

We can automatically derive the used encoding from the targets using

`labelenc`. This function looks at all elements and tries to determine

which specific encoding best describes the target array.

```julia

julia> true_targets = Int8[0, 1, 0, 1, 1];

julia> le = labelenc(true_targets)

# MLLabelUtils.LabelEnc.ZeroOne{Int8,Float64}(0.5)

```

To just determine if a specific encoding is approriate one can use

the function `islabelenc`.

```julia

julia> islabelenc(true_targets, LabelEnc.ZeroOne)

# true

julia> islabelenc(true_targets, LabelEnc.MarginBased)

# false

```

Furthermore we can compute a label map, which computes the indices

of all elements that belong to each class. This information is useful

for resampling strategies, such as stratified sampling

```julia

julia> true_targets = [:yes,:no,:maybe,:yes];

julia> labelmap(true_targets)

# Dict{Symbol,Array{Int64,1}} with 3 entries:

#   :yes   => [1,4]

#   :maybe => [3]

#   :no    => [2]

```

If need be we can convert to other encodings. Note that unless

explicitly specified, we try to preserve the `eltype` of the

input. However, this behaviour only comes to play in the case of

numbers.

```julia

julia> true_targets = Int8[0, 1, 0, 1, 1];

julia> convertlabel([:yes,:no], true_targets) # Equivalent to LabelEnc.NativeLabels([:yes,:no])

# 5-element Array{Symbol,1}:

#  :no

#  :yes

#  :no

#  :yes

#  :yes

julia> convertlabel(LabelEnc.MarginBased, true_targets) # Preserves eltype

# 5-element Array{Int8,1}:

#  -1

#   1

#  -1

#   1

#   1

julia> convertlabel(LabelEnc.MarginBased(Float32), true_targets) # Force new eltype

# 5-element Array{Float32,1}:

#  -1.0

#   1.0

#  -1.0

#   1.0

#   1.0

```

For encodings that can be multiclass, the number of classes can

be inferred from the targets, or specified explicitly.

```julia

julia> convertlabel(LabelEnc.Indices{Int}, true_targets) # number of classes inferred

# 5-element Array{Int64,1}:

#  2

#  1

#  2

#  1

#  1

julia> convertlabel(LabelEnc.Indices(Int,2), true_targets)

# 5-element Array{Int64,1}:

#  2

#  1

#  2

#  1

#  1

julia> convertlabel(LabelEnc.OneOfK{Bool}, true_targets)

# 2×5 Array{Bool,2}:

#  false   true  false   true   true

#   true  false   true  false  false

```

Note that the `OneOfK` encoding is special in that as a matrix-based

encoding it supports `ObsDim`, which can be used to specify which

dimension of the array denotes the observations.

```julia

julia> convertlabel(LabelEnc.OneOfK{Int}, true_targets, obsdim = 1)

# 5×2 Array{Int64,2}:

#  0  1

#  1  0

#  0  1

#  1  0

#  1  0

```

We also provide a `OneVsRest` encoding, which allows to transform

a multiclass problem into a binary one

```julia

julia> true_targets = [:yes,:no,:maybe,:yes];

julia> convertlabel(LabelEnc.OneVsRest(:yes), true_targets)

# 4-element Array{Symbol,1}:

#  :yes

#  :not_yes

#  :not_yes

#  :yes

julia> convertlabel(LabelEnc.TrueFalse, true_targets, LabelEnc.OneVsRest(:yes))

# 4-element Array{Bool,1}:

#   true

#  false

#  false

#   true

```

`NativeLabels` maps between data of an arbitary type (e.g. Strings) and

the other label types (Normally `LabelEnc.Indices{Int}` for an integer index).

When using it, you should always save the encoding in a variable,

and pass it as an argument to `convertlabel`; as otherwise the encoding will

be inferred each time, so will normally encode differently for different inputs.

```julia

julia> enc = LabelEnc.NativeLabels(["copper", "tin", "gold"])

# MLLabelUtils.LabelEnc.NativeLabels{String,3}(...)

julia> convertlabel(LabelEnc.Indices, ["gold", "copper"], enc)

# 2-element Array{Int64,1}:

#  3

#  1

```

Encodings such as `ZeroOne`, `MarginBased`, and `OneOfK` also provide

a `classify` function.

`ZeroOne` has a threshold parameter which represents the decision

boundary.

```julia

julia> classify(0.3, 0.5)

# 0.0

julia> classify(0.3, LabelEnc.ZeroOne) # equivalent to before

# 0.0

julia> classify(0.3, LabelEnc.ZeroOne(0.2)) # custom threshold

# 1.0

julia> classify(0.3, LabelEnc.ZeroOne(Int,0.2)) # custom type

# 1

julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4)) # broadcast support

# 2-element Array{Int64,1}:

#  0

#  1

```

`MarginBased` uses the sign to determine the class.

```julia

julia> classify(-5, LabelEnc.MarginBased)

# -1

julia> classify(0.2, LabelEnc.MarginBased)

# 1.0

julia> classify(-5, LabelEnc.MarginBased(Float64))

# -1.0

julia> classify.([-5,5], LabelEnc.MarginBased(Float64))

# 2-element Array{Float64,1}:

#  -1.0

#   1.0

```

`OneOfK` determines which index is the largest element.

```julia

julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]

# 3×4 Array{Float64,2}:

#  0.1  0.4  0.3  0.2

#  0.8  0.3  0.6  0.2

#  0.1  0.3  0.1  0.6

julia> classify(pred_output, LabelEnc.OneOfK)

# 4-element Array{Int64,1}:

#  2

#  1

#  2

#  3

julia> classify(pred_output', LabelEnc.OneOfK, obsdim = 1) # note the transpose

# 4-element Array{Int64,1}:

#  2

#  1

#  2

#  3

julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation

# 3

```

## Documentation

For a much more detailed treatment check out the

[latest documentation](http://mllabelutilsjl.readthedocs.io/en/latest/)

Additionally, you can make use of Julia's native docsystem. The

following example shows how to get additional information on

`convertlabel` within Julia's REPL:

```

?convertlabel

```

## Installation

This package is registered in `METADATA.jl` and can be installed

as usual. Just start up Julia and type the following code-snipped

into the REPL. It makes use of the native Julia package manger.

```julia

Pkg.add("MLLabelUtils")

```

Additionally, for example if you encounter any sudden issues, or

in the case you would like to contribute to the package, you can

manually choose to be on the latest (untagged) version.

```Julia

Pkg.checkout("MLLabelUtils")

```

## License

This code is free to use under the terms of the MIT license

[pkgeval-img]: https://juliaci.github.io/NanosoldierReports/pkgeval_badges/M/MLLabelUtils.svg

[pkgeval-url]: https://juliaci.github.io/NanosoldierReports/pkgeval_badges/report.html

[license-img]: https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat

[license-url]: LICENSE.md

[github-action-img]: https://github.com/JuliaML/MLLabelUtils.jl/actions/workflows/CI.yml/badge.svg

[github-action-url]: https://github.com/JuliaML/MLLabelUtils.jl/actions/workflows/CI.yml

[coveralls-img]: https://coveralls.io/repos/JuliaML/MLLabelUtils.jl/badge.svg?branch=master&service=github

[coveralls-url]: https://coveralls.io/github/JuliaML/MLLabelUtils.jl?branch=master

[docs-dev-img]: https://img.shields.io/badge/docs-latest-blue.svg?style=flat

[docs-dev-url]: https://mllabelutilsjl.readthedocs.io/en/latest/?badge=latest