https://github.com/paulbrodersen/entropy_estimators

Estimators for the entropy and other information theoretic quantities of continuous distributions
https://github.com/paulbrodersen/entropy_estimators

density-estimation entropy

Last synced: 2 months ago
JSON representation

Estimators for the entropy and other information theoretic quantities of continuous distributions

Host: GitHub
URL: https://github.com/paulbrodersen/entropy_estimators
Owner: paulbrodersen
License: gpl-3.0
Created: 2016-03-10T13:44:27.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2024-05-14T11:09:29.000Z (about 2 years ago)
Last Synced: 2025-09-27T21:14:48.576Z (9 months ago)
Topics: density-estimation, entropy
Language: Python
Homepage:
Size: 51.8 KB
Stars: 145
Watchers: 7
Forks: 28
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          > [!CAUTION]

> This package implements the KL-estimator and the KSG-estimator for the entropy and mutual information of continuous variables (and a few variants thereof).

> These estimators have issues, and for most purposes, should likely no longer be the first choice.

> Please read [this comment](https://github.com/paulbrodersen/entropy_estimators/issues/11#issuecomment-2109577671) for a better explanation than I could give, as well as a link to an alternative implementation.

# Entropy estimators

This module implements estimators for the entropy and other

information theoretic quantities of continuous distributions, including:

* entropy / Shannon information (`get_h`),

* mutual information (`get_mi`),

* partial mutual information & transfer entropy (`get_pmi`),

* specific information (`get_imin`), and

* partial information decomposition (`get_pid`).

The estimators derive from the

[Kozachenko and Leonenko (1987)](https://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=ppi&paperid=797&option_lang=eng)

estimator, which uses k-nearest neighbour distances to compute the entropy of distributions, and extension thereof developed by

[Kraskov et al. (2004)](https://arxiv.org/abs/cond-mat/0305641),

and

[Frenzel and Pombe (2007)](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.99.204101).

For **multivariate normal distributions**, the following quantities can be computed analytically from the covariance matrix.

* entropy (`get_h_mvn`),

* mutual information (`get_mi_mvn`), and

* partial mutual information & transfer entropy (`get_pmi_mvn`).

## Installation

Easiest via pip:

``` shell

pip install entropy_estimators

```

## Examples

```python

import numpy as np

from entropy_estimators import continuous

# create some normal test data

X = np.random.randn(10000, 2)

# compute the entropy from the determinant of the multivariate normal distribution:

analytic = continuous.get_h_mvn(X)

# compute the entropy using the k-nearest neighbour approach

# developed by Kozachenko and Leonenko (1987):

kozachenko = continuous.get_h(X, k=5)

print(f"analytic result: {analytic:.5f}")

print(f"K-L estimator: {kozachenko:.5f}")

```

## Frequently asked questions

#### Why is the estimate of the mutual information negative? Shouldn't it always be positive?

Mutual information is a strictly positive quantity. However, its *estimate* need not be, and in fact, the nearest neighbour estimators are known to be biased estimators (Kraskov et al. 2004). Unfortunately, the bias appears to depend on multiple factors, primarily the number of samples and the choice of the `k` parameter, and thus cannot be known *a priori*. However, the bias itself can be estimated using a straightforward permutation / bootstrap approach:

1. Compute the mutual information estimate between two variables, X and Y.

2. Permute either variable (or both), and re-compute the estimate. The mutual information between randomised variables is zero, so this estimate represents the bias.

3. Repeat the previous step many times to obtain a robust estimate of the bias.

``` python

import numpy as np

from scipy.stats import multivariate_normal

from entropy_estimators import continuous

# create two variables with a mutual information that can be computed analytically

means = [0, 1]

covariance = np.array([[1, 0.5], [0.5, 1]])

def get_entropy(covariance):

    """Compute the entropy of multivariate normal distribution from the covariance matrix."""

    if np.size(covariance) > 1:

        dim = covariance.shape[0]

        det = np.linalg.det(covariance)

    else: # scalar

        dim = 1

        det = covariance

    return 0.5 * np.log((2 * np.pi * np.e)**dim * det)

hx  = get_entropy(covariance[0, 0])

hy  = get_entropy(covariance[1, 1])

hxy = get_entropy(covariance)

analytic_result = hx + hy - hxy

# compute the mutual information from samples using the KSG estimator

distribution = multivariate_normal(means, covariance)

X, Y = distribution.rvs(1000).T

k = 5

ksg_estimate = continuous.get_mi(X, Y, k=k)

print(f"Analytic result: {analytic_result:.3f} nats")

print(f"KSG estimate: {ksg_estimate:.3f} nats")

print(f"Difference: {analytic_result - ksg_estimate:.3f} nats")

# Analytic result: 0.144

# KSG estimate: 0.113 nats

# Difference: 0.031 nats

# bootstrap to determine the bias

total_repeats = 100

bias = 0

Y_shuffled = Y.copy()

for ii in range(total_repeats):

    np.random.shuffle(Y_shuffled) # shuffling occurs in-place!

    bias += continuous.get_mi(X, Y_shuffled, k=k)

bias /= total_repeats

print("--------------------------------------------------------------------------------")

print(f"Bias estimat: {bias:.3f} nats")

print(f"Corrected KSG estimate: {ksg_estimate - bias:.3f}")

print(f"Difference to analytic result: {analytic_result - (ksg_estimate - bias):.3f} nats")

# Bias estimat: -0.020 nats

# Corrected KSG estimate: 0.132

# Difference to analytic result: 0.012 nats

```

## Alternative Implementations

### Scipy

[`scipy.stats.entropy`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html) : entropy of a categorical variable

### Scikit-learn

 * [`sklearn.metrics.mutual_info_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html#sklearn.metrics.mutual_info_score) : mutual information between two categorical variables

 * [`skelarn.metrics.mutual_info_regression`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html) :

 mutual information between two continuous variables; note that their implementation does not report negative mutual information scores and thus makes it impossible to compute bias corrections using the bootstrap approach outlined above.

### Non-parametric Entropy Estimation Toolbox (NPEET)

Alternative python implementations of the nearest-neighbour estimators for the entropy of continuous variables, the mutual information and the partial/conditioned mutual information ([link](https://github.com/gregversteeg/NPEET)). In principle, there are no major differences between their implementation and this repository. However, for large samples, their implementation may run a little slower as it uses lists as the primary data structure and doesn't support parallelisation. The implementation in this repository mostly uses numpy arrays, which allows vectorization of many calculations, and supports running operations on multiple cores by setting the `workers` argument to valus larger than one.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paulbrodersen/entropy_estimators

Awesome Lists containing this project

README