https://github.com/simonschoelly/informationdistances.jl
A small Julia library for calculating the normalized compression distance.
https://github.com/simonschoelly/informationdistances.jl
compression hacktoberfest information-distance kolmogorov-complexity normalized-compression-distance string-distance
Last synced: about 1 year ago
JSON representation
A small Julia library for calculating the normalized compression distance.
- Host: GitHub
- URL: https://github.com/simonschoelly/informationdistances.jl
- Owner: simonschoelly
- License: mit
- Created: 2021-01-06T23:32:55.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-05-24T20:13:07.000Z (about 5 years ago)
- Last Synced: 2025-06-19T22:06:27.436Z (about 1 year ago)
- Topics: compression, hacktoberfest, information-distance, kolmogorov-complexity, normalized-compression-distance, string-distance
- Language: Julia
- Homepage:
- Size: 268 KB
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# InformationDistances
[](https://simonschoelly.github.io/InformationDistances.jl/stable)
[](https://simonschoelly.github.io/InformationDistances.jl/dev)
[](https://github.com/simonschoelly/InformationDistances.jl/actions)
[](https://codecov.io/gh/simonschoelly/InformationDistances.jl)
This package contains methods to calculate the [Normalized Compression Distance (NCD)](https://en.wikipedia.org/wiki/Normalized_compression_distance) - a metric for measuring how similar two strings are using a real life compression algorithm such as [bzip2](https://en.wikipedia.org/wiki/Bzip2).
## Installation
InformationDistances.jl is registered in the [general registry](https://github.com/JuliaRegistries/General) and can therefore be simply installed from the REPL with
```julia
] add InformationDistances
```
## Quick example
```julia
julia> using InformationDistances
# Create three strings that we want to compare - we expect s1 and s2 to be more similar than any of them to s3
julia> s1 = repeat("ab", 100)
"abababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababab"
julia> s2 = repeat("ba", 100)
"babababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababababa"
julia> s3 = String(rand(('a', 'b'), 200))
"aabaaabaaababaabababbaaaaabaaaaaabbabbaaabbbabbbbaaaaababaabbbbaababbbbaaaaaaaaabababaaabbbbbbbabbbaabbabababbaababbbbabbbababaaaababaaababbababaaaaababbabbbbaabbaabbbaabaababbbaaaaaababbbabbbabbabbaa"
# Create a normalized compression distance with the default parameters
julia> d = NormalizedCompressionDistance();
julia> d(s1, s2)
0.125
julia> d(s1, s3)
0.4482758620689655
julia> d(s2, s3)
0.4482758620689655
# Create annother distance that uses Bzip2 for compression
julia> using CodecBzip2: Bzip2Compressor
julia> d_bzip2 = NormalizedCompressionDistance(CodecCompressor{Bzip2Compressor}(workfactor=250));
julia> d_bzip2(s1, s2)
0.1
julia> d_bzip2(s1, s3)
0.5903614457831325
julia> d_bzip2(s2, s3)
0.5783132530120482
```
## Example Notebooks
The examples folder contains an interactive notebook that can be run with [Pluto.jl](https://github.com/fonsp/Pluto.jl). To quickly view the notebook online there is also a static non-interactive version where it is currently not possible to choose different options.
* [mitochondrial-enome-phylogency.jl](https://github.com/simonschoelly/InformationDistances.jl/blob/master/examples/mitochondrial-genome-phylogency.jl) [non interactive version](https://simonschoelly.github.io/InformationDistances.jl/examples/mitochondrial-genome-phylogency.jl.html)
## References
[Li, Ming, Xin Chen, Xin Li, Bin Ma, and Paul MB Vitányi. "The similarity metric." IEEE transactions on Information Theory 50, no. 12 (2004): 3250-3264.](https://homepages.cwi.nl/~paulv/papers/similarity.pdf)