Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tuxxy/entropy

A utility that calculates the Shannon entropy of a given input file
https://github.com/tuxxy/entropy

Last synced: 5 days ago
JSON representation

A utility that calculates the Shannon entropy of a given input file

Awesome Lists containing this project

README

        

# entropy

`entropy` is a tiny utility for calculating Shannon entropy of a given file.

```
tuxⒶlattice:[~] => ./entropy --help
entropy 1.0.0
tux
A utility to calculate Shannon entropy of a given file

USAGE:
entropy [FLAGS]

ARGS:
The target file to measure

FLAGS:
-h, --help Prints help information
-m, --metric-entropy Returns metric entropy instead of Shannon entropy
-V, --version Prints version information
```

## Usage

To calculate the Shannon entropy of a given file, simply:
```
tuxⒶlattice:[~] => ./entropy path/to/file.bin
4.142214
```

To calculate the metric entropy of a given file, add the `--metric-entropy` flag:
```
tuxⒶlattice:[~] => ./entropy path/to/file.bin --metric-entropy
0.5177767
```

## What is Shannon entropy?

Shannon entropy can be described as the amount of "information" in a string.
It can be calculated from the following equation:
![Shannon Entropy Equation](https://wikimedia.org/api/rest_v1/media/math/render/svg/527fa6ed7da2d6fcfb64cc71b4fc09b4c248887a)

The output of this equation (when performed in `log_2`) can tell you the minimum
number of bits required to encode a piece of "information" or "symbol" in binary form.

Metric entropy is calculated by dividing the Shannon entropy with the length of
the symbol. Since we are calculating Shannon entropy in bits (via `log_2`) and
counting bytes, we divide the Shannon entropy by eight (the number of bits in a byte).

The output of metric entropy is number between 0 and 1, where 1 indicates that
the information (or symbols) are uniformly distributed across the string. This
can be used to assess how "random" or "uncertain" a particular string is. It can
also be an indicator that data may be effectively compressed when metric entropy
is closer to 0.

## Demonstration

Let's calculate the Shannon entropy and metric entropy of a _really_ random file from `/dev/urandom`:
```
tuxⒶlattice:[~] => cat /dev/urandom | head -c 1000000 > random.bin
```

So we filled a 1MB file of random data from `/dev/urandom`. The data inside
should be uniformly distributed, but let's verify this:
```
tuxⒶlattice:[~] => ./entropy random.bin
7.9998097
tuxⒶlattice:[~] => ./entropy random.bin --metric-entropy
0.9999762
```

As you can see above, the Shannon entropy indicates that we need to encode each
symbol in the file with eight bits. The metric entropy indicates that the information
in the `random.bin` file is uniformly distributed; it's chock-full of information!

Now what happens if we do the same thing but from a file filled with all zeros? Let's find out:
```
tuxⒶlattice:[~] => cat /dev/zero | head -c 1000000 > zero.bin
tuxⒶlattice:[~] => ./entropy zero.bin
0
tuxⒶlattice:[~] => ./entropy zero.bin --metric-entropy
0
```

The Shannon and metric entropy is zero! Why? Because there are no unique symbols in
the file. The probability of finding a zero in this file is exactly 1; it's impossible
to find a non-zero symbol in the file. Therefore, we don't need any extra information
to encode it in a binary sequence.

[For more information, see the excellent Wikipedia entry on this topic](https://en.wikipedia.org/wiki/Entropy_(information_theory)).

If this repo helped you at all, please reach out and tell me how! I'd love to hear it!