An open API service indexing awesome lists of open source software.

https://github.com/jxskiss/base62

Compact and high performace implementation of base62 algorithm for Golang.
https://github.com/jxskiss/base62

base62 go golang

Last synced: 5 months ago
JSON representation

Compact and high performace implementation of base62 algorithm for Golang.

Awesome Lists containing this project

README

          

# base62

[![GoDoc](https://img.shields.io/badge/api-Godoc-blue.svg)][godoc]
[![Go Report Card](https://goreportcard.com/badge/github.com/jxskiss/base62)][goreport]
[![Issues](https://img.shields.io/github/issues/jxskiss/base62.svg)][issues]
[![GitHub release](http://img.shields.io/github/release/jxskiss/base62.svg)][release]
[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg)][license]

[godoc]: https://pkg.go.dev/github.com/jxskiss/base62
[goreport]: https://goreportcard.com/report/github.com/jxskiss/base62
[issues]: https://github.com/jxskiss/base62/issues
[release]: https://github.com/jxskiss/base62/releases
[license]: https://github.com/jxskiss/base62/blob/master/LICENSE

base62 is a correctly implemented, compact and fast implementation of `Base62` encoding/decoding algorithm.
It is inspired by the [java implementation by glowfall](https://github.com/glowfall/base62).

This `Base62` implementation can encode/decode both integers and bytes of arbitrary length,
the correctness is tested using a large number of randomly generated bytes.

It is much faster than `big.Int` based implementation and is not much slower than typical `Base64`
implementations. See the benchmark results below.

## Why Base62

In comparison with Base64/Base32, Base62 is more friendly to human:

- it contains only alpha-numerical symbols, no special characters
- can be validated by eyes and more simple regexp
- can be fully selected by mouse double-click in any text editors and browser address bar
- it's compact and generates shorter strings than Base32
- it's the most natural and unambiguous way to encode your data in human-readable form :)

Variations of Base62 algorithm cann be widely used to represent authentication data in printable and
easy-copyable form, for example to encode OAuth 2.0 `access_token` data.

## Usage

```go
// Basic usage.
Encode(src []byte) []byte
EncodeToString(src []byte) string
Decode(src []byte) ([]byte, error)
DecodeString(src string) ([]byte, error)
FormatInt(num int64) []byte
FormatUint(num uint64) []byte
ParseInt(src []byte) (int64, error)
ParseUint(src []byte) (uint64, error)

// Providing a dst buffer, you may reuse buffers to reduce memory allocation.
EncodeToBuf(dst []byte, src []byte) []byte
DecodeToBuf(dst []byte, src []byte) ([]byte, error)
AppendInt(dst []byte, num int64) []byte
AppendUint(dst []byte, num uint64) []byte

// Or you may use a custom encoding alphabet.
enc := NewEncoding("...my-62-byte-string-alphabet...")
enc.XXX()
```

## Benchmark

```text
Benchmark_Encode-12 11654754 97.41 ns/op
Benchmark_Decode-12 15481666 73.60 ns/op
Benchmark_EncodeToString-12 11950086 99.54 ns/op
Benchmark_DecodeString-12 16301325 74.36 ns/op

Benchmark_EncodeToBuf-12 13855840 84.68 ns/op
Benchmark_DecodeToBuf-12 97695962 12.21 ns/op

Benchmark_EncodeInteger-12 29119437 41.30 ns/op
Benchmark_DecodeInteger-12 120328183 9.917 ns/op

Benchmark_Encode_BigInt-12 1000000 1048 ns/op

Benchmark_Base64_EncodeToString-12 17803440 70.12 ns/op
Benchmark_Base64_DecodeString-12 19884616 55.09 ns/op

Benchmark_Base64_Encode-12 68163142 17.93 ns/op
Benchmark_Base64_Decode-12 41990004 28.25 ns/op
```

## How it works

Encoding Base64 and Base32 both are bit-aligned, 6 bits for Base64 (2^6 = 64)
and 5 bits for Base32 (2^5 = 32), but Base62 is not bit-aligned, it's inefficient
to do divmod operation for non-bit-aligned integers, typical Base62 implementations
are BigInt based, which encodes input data block by block to get better performance,
(e.g. https://github.com/keybase/saltpack/blob/master/encoding/basex).

64 characters can fully represent 6 bits, but 62 characters can not, if each character
represents 5 bits, that is the Base32 encoding.

A naive BigInt based algorithm gives the shortest result, which is roughly like this
(e.g. https://github.com/eknkc/basex/blob/6baac8ea8b19cc66d125286d213770fec0691867/basex.go#L46):

```go
digits := []int{0}
for i := 0; i < len(src); i++ {
carry := int(src[i])

for j := 0; j < len(digits); j++ {
carry += digits[j] << 8
digits[j] = carry % e.base
carry = carry / e.base
}

for carry > 0 {
digits = append(digits, carry%e.base)
carry = carry / e.base
}
}
// map to encoding alphabet
_ = digits
```

This works but the time complexity is O(n^2) where n is the length of src. If the
input can be very large, the cost is unacceptable.

Improved algorithm splits the input into blocks, which reduce the time complexity to
O(kn) where k is the block size, and n is the length of src, it generates slightly
longer result than naive BigInt algorithm, but it is worth for the reduced time
complexity. When k is n, it degrades to the naive algorithm, when k is 1, the output
length is twice of the input, we don't want that. 32 is chosen as the block size
in library [saltpack/encoding/basex](https://github.com/keybase/saltpack/blob/master/encoding/basex).

Inspired by the [java implementation by glowfall](https://github.com/glowfall/base62),
this library is not BigInt based, it encodes and decodes any arbitrary bytes in O(n)
time complexity. Here is a brief description of the algorithm.

This library uses a variadic length encoding, for each 6 bits, if the value is in range
[0, 62), it can be directly map to the 62 characters, if the value is 62 or 63 which
exceeds 61, we turn to encoding just the lower 5 bits. If we find a way to recognize
the 5 bits pattern, then we can correctly decode it back to the source data.

The binary representation of 62 and 63 is:

62 - 0b_0011_1110
63 - 0b_0011_1111

They have a common mask `0b_0011_1110`, in range [0, 62), there are another two integers
have a similar mask `0b_0001_1110`, 30 and 31, which is:

30 - 0b_0001_1110
31 - 0b_0001_1111

The four integers 30, 31, 62, 63 share a common mask `0b_0001_1110`, while all other
integers in range [0, 64) don't share the mask, i.e. for all other integers, the
expression `value & 0b_0001_1110 == 0b_0001_1110` evaluates to false.

**This is the key point!**

We define a `compactMask` as `0b_0001_1110`.

When encoding, for each 6 bits integer x, if `x & compactMask == compactMask` is true,
it must be one of 30, 31, 62 or 63, we just encode the lower 5 bits, which are
`0b_0001_1110` (30) and `0b_0001_1111` (31), we leave the 6-th bit to next byte.

When decoding, for each encoded byte x, we check `x & compactMask == compactMask`, when
it is false, we know that the byte represents 6 bits in the raw data, it is in range
[0, 30) or [32, 62), else it represents 5 bits in the raw data, it is 30 or 31.

That is it, by using variadic length encoding, we successfully limit the value range
to [0, 62), we get very compact result and a simple O(n) time complexity for encoding
and decoding data of arbitrary length.

## Compatibility

This library guarantees that it can correctly decode data encoded by itself.

The encoded result is not compatible with BigInt based algorithm,
(e.g. [saltpack/encoding/basex, GMP and GnuPG](https://github.com/jxskiss/base62/issues/2)).

The algorithm may be ported to other languages in future (if someone does a porting,
I'm glad to link it here :-) ).

## Changelog

### v1.1.0 - 2022/1/3

1. Refactor the encoding code to be simpler and cleaner, as a bonus, it gives better
performance which is 1.5X~ faster than the old.
2. Add a brief description of the algorithm and state the compatibility with other
Base62 implementations.

### v1.0.0 - 2021/10/23

First stable release, the package has been used in several small and medium projects.

This release adds new APIs which help to reuse buffers to reduce memory allocation.