https://github.com/jxskiss/base62
Compact and high performace implementation of base62 algorithm for Golang.
https://github.com/jxskiss/base62
base62 go golang
Last synced: 5 months ago
JSON representation
Compact and high performace implementation of base62 algorithm for Golang.
- Host: GitHub
- URL: https://github.com/jxskiss/base62
- Owner: jxskiss
- License: mit
- Created: 2019-08-29T12:05:37.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-08-24T00:50:00.000Z (almost 4 years ago)
- Last Synced: 2024-06-18T16:54:11.745Z (about 2 years ago)
- Topics: base62, go, golang
- Language: Go
- Homepage:
- Size: 18.6 KB
- Stars: 82
- Watchers: 3
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# base62
[][godoc]
[][goreport]
[][issues]
[][release]
[][license]
[godoc]: https://pkg.go.dev/github.com/jxskiss/base62
[goreport]: https://goreportcard.com/report/github.com/jxskiss/base62
[issues]: https://github.com/jxskiss/base62/issues
[release]: https://github.com/jxskiss/base62/releases
[license]: https://github.com/jxskiss/base62/blob/master/LICENSE
base62 is a correctly implemented, compact and fast implementation of `Base62` encoding/decoding algorithm.
It is inspired by the [java implementation by glowfall](https://github.com/glowfall/base62).
This `Base62` implementation can encode/decode both integers and bytes of arbitrary length,
the correctness is tested using a large number of randomly generated bytes.
It is much faster than `big.Int` based implementation and is not much slower than typical `Base64`
implementations. See the benchmark results below.
## Why Base62
In comparison with Base64/Base32, Base62 is more friendly to human:
- it contains only alpha-numerical symbols, no special characters
- can be validated by eyes and more simple regexp
- can be fully selected by mouse double-click in any text editors and browser address bar
- it's compact and generates shorter strings than Base32
- it's the most natural and unambiguous way to encode your data in human-readable form :)
Variations of Base62 algorithm cann be widely used to represent authentication data in printable and
easy-copyable form, for example to encode OAuth 2.0 `access_token` data.
## Usage
```go
// Basic usage.
Encode(src []byte) []byte
EncodeToString(src []byte) string
Decode(src []byte) ([]byte, error)
DecodeString(src string) ([]byte, error)
FormatInt(num int64) []byte
FormatUint(num uint64) []byte
ParseInt(src []byte) (int64, error)
ParseUint(src []byte) (uint64, error)
// Providing a dst buffer, you may reuse buffers to reduce memory allocation.
EncodeToBuf(dst []byte, src []byte) []byte
DecodeToBuf(dst []byte, src []byte) ([]byte, error)
AppendInt(dst []byte, num int64) []byte
AppendUint(dst []byte, num uint64) []byte
// Or you may use a custom encoding alphabet.
enc := NewEncoding("...my-62-byte-string-alphabet...")
enc.XXX()
```
## Benchmark
```text
Benchmark_Encode-12 11654754 97.41 ns/op
Benchmark_Decode-12 15481666 73.60 ns/op
Benchmark_EncodeToString-12 11950086 99.54 ns/op
Benchmark_DecodeString-12 16301325 74.36 ns/op
Benchmark_EncodeToBuf-12 13855840 84.68 ns/op
Benchmark_DecodeToBuf-12 97695962 12.21 ns/op
Benchmark_EncodeInteger-12 29119437 41.30 ns/op
Benchmark_DecodeInteger-12 120328183 9.917 ns/op
Benchmark_Encode_BigInt-12 1000000 1048 ns/op
Benchmark_Base64_EncodeToString-12 17803440 70.12 ns/op
Benchmark_Base64_DecodeString-12 19884616 55.09 ns/op
Benchmark_Base64_Encode-12 68163142 17.93 ns/op
Benchmark_Base64_Decode-12 41990004 28.25 ns/op
```
## How it works
Encoding Base64 and Base32 both are bit-aligned, 6 bits for Base64 (2^6 = 64)
and 5 bits for Base32 (2^5 = 32), but Base62 is not bit-aligned, it's inefficient
to do divmod operation for non-bit-aligned integers, typical Base62 implementations
are BigInt based, which encodes input data block by block to get better performance,
(e.g. https://github.com/keybase/saltpack/blob/master/encoding/basex).
64 characters can fully represent 6 bits, but 62 characters can not, if each character
represents 5 bits, that is the Base32 encoding.
A naive BigInt based algorithm gives the shortest result, which is roughly like this
(e.g. https://github.com/eknkc/basex/blob/6baac8ea8b19cc66d125286d213770fec0691867/basex.go#L46):
```go
digits := []int{0}
for i := 0; i < len(src); i++ {
carry := int(src[i])
for j := 0; j < len(digits); j++ {
carry += digits[j] << 8
digits[j] = carry % e.base
carry = carry / e.base
}
for carry > 0 {
digits = append(digits, carry%e.base)
carry = carry / e.base
}
}
// map to encoding alphabet
_ = digits
```
This works but the time complexity is O(n^2) where n is the length of src. If the
input can be very large, the cost is unacceptable.
Improved algorithm splits the input into blocks, which reduce the time complexity to
O(kn) where k is the block size, and n is the length of src, it generates slightly
longer result than naive BigInt algorithm, but it is worth for the reduced time
complexity. When k is n, it degrades to the naive algorithm, when k is 1, the output
length is twice of the input, we don't want that. 32 is chosen as the block size
in library [saltpack/encoding/basex](https://github.com/keybase/saltpack/blob/master/encoding/basex).
Inspired by the [java implementation by glowfall](https://github.com/glowfall/base62),
this library is not BigInt based, it encodes and decodes any arbitrary bytes in O(n)
time complexity. Here is a brief description of the algorithm.
This library uses a variadic length encoding, for each 6 bits, if the value is in range
[0, 62), it can be directly map to the 62 characters, if the value is 62 or 63 which
exceeds 61, we turn to encoding just the lower 5 bits. If we find a way to recognize
the 5 bits pattern, then we can correctly decode it back to the source data.
The binary representation of 62 and 63 is:
62 - 0b_0011_1110
63 - 0b_0011_1111
They have a common mask `0b_0011_1110`, in range [0, 62), there are another two integers
have a similar mask `0b_0001_1110`, 30 and 31, which is:
30 - 0b_0001_1110
31 - 0b_0001_1111
The four integers 30, 31, 62, 63 share a common mask `0b_0001_1110`, while all other
integers in range [0, 64) don't share the mask, i.e. for all other integers, the
expression `value & 0b_0001_1110 == 0b_0001_1110` evaluates to false.
**This is the key point!**
We define a `compactMask` as `0b_0001_1110`.
When encoding, for each 6 bits integer x, if `x & compactMask == compactMask` is true,
it must be one of 30, 31, 62 or 63, we just encode the lower 5 bits, which are
`0b_0001_1110` (30) and `0b_0001_1111` (31), we leave the 6-th bit to next byte.
When decoding, for each encoded byte x, we check `x & compactMask == compactMask`, when
it is false, we know that the byte represents 6 bits in the raw data, it is in range
[0, 30) or [32, 62), else it represents 5 bits in the raw data, it is 30 or 31.
That is it, by using variadic length encoding, we successfully limit the value range
to [0, 62), we get very compact result and a simple O(n) time complexity for encoding
and decoding data of arbitrary length.
## Compatibility
This library guarantees that it can correctly decode data encoded by itself.
The encoded result is not compatible with BigInt based algorithm,
(e.g. [saltpack/encoding/basex, GMP and GnuPG](https://github.com/jxskiss/base62/issues/2)).
The algorithm may be ported to other languages in future (if someone does a porting,
I'm glad to link it here :-) ).
## Changelog
### v1.1.0 - 2022/1/3
1. Refactor the encoding code to be simpler and cleaner, as a bonus, it gives better
performance which is 1.5X~ faster than the old.
2. Add a brief description of the algorithm and state the compatibility with other
Base62 implementations.
### v1.0.0 - 2021/10/23
First stable release, the package has been used in several small and medium projects.
This release adds new APIs which help to reuse buffers to reduce memory allocation.