Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rainlanguage/rain.lib.hash
Solidity library for hashing without abi.encode (avoiding allocation)
https://github.com/rainlanguage/rain.lib.hash
Last synced: 17 days ago
JSON representation
Solidity library for hashing without abi.encode (avoiding allocation)
- Host: GitHub
- URL: https://github.com/rainlanguage/rain.lib.hash
- Owner: rainlanguage
- License: other
- Created: 2023-04-03T09:47:20.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-07T11:24:31.000Z (about 2 months ago)
- Last Synced: 2024-12-07T12:20:28.672Z (about 2 months ago)
- Language: Solidity
- Size: 1.11 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# rain.lib.hash
Docs at https://rainprotocol.github.io/rain.lib.hash
## Problem
When producing hashes of just about anything that isn't already `bytes` the
common suggestions look something like `keccak256(abi.encode(...))` or
`keccak256(abi.encodePacked(...))`. This appears reasonable as Solidity itself
does not provide an "any type" `keccak256` function but does provide one (almost)
for abi encoding.When I say "common suggestion" I mean literally the compiler itself gives outputs
like this for any type other than `bytes`.```
β keccak256(address(0))
Compiler errors:
error[7556]: TypeError: Invalid type for argument in function call. Invalid implicit conversion from address to bytes memory requested. This function requires a single bytes argument. Use abi.encodePacked(...) to obtain the pre-0.5.0 behaviour or abi.encode(...) to use ABI encoding.
--> ReplContract.sol:14:19:
|
14 | keccak256(address(0));
```This approach raises two questions for me:
- Why are we relying on interface encodings to satisfy cryptographic properties?
- Encoding requires complex recursive/nested data processing, memory expansion,
and making more than a full copy of the data to include headers, is that
significant gas cost strictly necessary?### Non goals
For the purpose of this document we are NOT attempting any specific compatibility
with external systems or standards, etc.The basic use case is that we are writing contracts that need to convince
themselves that they should authorize some state change.Often we find ourselves with a lot of state that informs the authorization
verification logic. Too much to store, sign, etc. so first we want to
"hash the data" and just store, compare, sign the hash.It doesn't really matter in this case what the hashing algorithm is, as long as
it gives us the security guarantees that the contract needs. If the hash is
needed to be known offchain, e.g. so it can be passed back to a future call on
the contract, then the contract can emit the hash into the logs etc.We are even fine with changing the patterns described in this document over time.
There's no requirement that a hash produced by one contract is compatible with
the hash produced by another. Our goal is that contracts implementing these
patterns can securely accept arbitrary inputs, NOT that the basic approach
ossifies due to unrelated contracts doing different things to each other.**That is to say, there's no "upgradeable contract" support.**
Further, while we do want to be able to support "any" data type in our pattern,
we do NOT need to support "every" data type in our implementations. It may be
relatively onerous to implement and maintain the required assembly logic to
safely hash some structure, relative to just slapping an abi encoding on the
problem and walking away. The intention is that this work would only be needed
for maybe 1 or 2 structs within a codebase, because there wouldn't be a large
variety of security sensitive hashing to be done for some given contract/context.It's also assumed that, because these structs are used on the critical security
path, there would be good reasons to design them for stability and simplicity
already. In this case, the maintainability concerns that always arise when
handling structs in assembly (adding/removing/reordering fields!) are naturally
less of a concern due to their context/usage.### Encoding and cryptography
An earlier version of the EIP712 spec outlined the difficulty in relying on
encoding formats to provide cryptographic guarantees.> A good hashing algorithm should satisfy security properties such as
> determinism, second pre-image resistance and collision resistance. The
> keccak256 function satisfies the above criteria when applied to bytestrings. If
> we want to apply it to other sets we first need to map this set to bytestrings.
> It is critically important that this encoding function is deterministic and
> injective. If it is not deterministic then the hash might differ from the
> moment of signing to the moment of verifying, causing the signature to
> incorrectly be rejected. If it is not injective then there are two different
> elements in our input set that hash to the same value, causing a signature to
> be valid for a different unrelated message.
>
> An illustrative example of the above breakage can be found in Ethereum.
> Ethereum has two kinds of messages, transactions π and bytestrings πΉβΈβΏ. These
> are signed using eth_sendTransaction and eth_sign respectively. Originally the
> encoding function encode : π βͺ πΉβΈβΏ β πΉβΈβΏ was defined as follows:
>
> - encode(t : π) = RLP_encode(t)
> - encode(b : πΉβΈβΏ) = b
>
> encode(b : πΉβΈβΏ) = "\x19Ethereum Signed Message:\n" β len(b) β b where len(b) is
> the ascii-decimal encoding of the number of bytes in b.
>
> This solves the collision between the legs since RLP_encode(t : π) never starts
> with \x19. There is still the risk of the new encoding function not being
> deterministic or injective. It is instructive to consider those in detail.
>
> As is, the definition above is not deterministic. For a 4-byte string b both
> encodings with len(b) = "4" and len(b) = "004" are valid. This can be solved by
> further requiring that the decimal encoding of the length has no leading zeros
> and len("") = "0".
>
> The above definition is not obviously collision free. Does a bytestring
> starting with "\x19Ethereum Signed Message:\n42aβ¦" mean a 42-byte string
> starting with a or a 4-byte string starting with 2a?. This was pointed out in
> Geth issue #14794 and motivated Trezor to not implement the standard as-is.
> Fortunately this does not lead to actual collisions as the total length of the
> encoded bytestring provides sufficient information to disambiguate the cases.
>
> Both determinism and injectiveness would be trivially true if len(b) was left
> out entirely. The point is, it is difficult to map arbitrary sets to
> bytestrings without introducing security issues in the encoding function. Yet
> the current design of eth_sign still takes a bytestring as input and expects
> implementors to come up with an encoding.This is good context that has sadly been removed from current versions of the
EIP.The key takeaways are:
- We need something determinstic and injective, which can probably be summarised
in a single word as "unambiguous"
- Hashing bytes is secure by default and any encoding scheme's security can only
be less than or equal to the security of the hash of the raw data before it is
encoded
- It is difficult to assess the cryptographic qualities of an encoding scheme and
high profile mistakes can be found in the wild, including formal standards#### Collisions with ABI encoding
Perhaps unsurprisingly we can find one of these issues in `abi.encodePacked` as
this encoding scheme simply concatenates bytes together.This means that `"abc" + "def"` and `"ab" + "cdef"` will pack to the same final
bytestring, `"abcdef"`.We can suggest potential workarounds like "only use packed encoding for fixed
length input data", but it's clear that packed encoding is situationally useful
at best, and dangerous at worst.The suggested fix is usually to use `abi.encode`, which adds additional
information to the raw data as part of the encoding. Something like
`3"abc" + 3"def"` with length prefixes and `2"ab" + 4"cdef"`, and then additional
head/tail structures that encode the offsets of the dynamic length data in an
overall prefix to the encoded data.https://docs.soliditylang.org/en/develop/abi-spec.html#formal-specification-of-the-encoding
Importantly, in light of the discussion in EIP712, the lengths are fixed length
themselves, always represented as a `uint256`, so the full `abi.encode` encoding
of the underlying data is probably safe.So `abi.encode` doesn't have the problems of `abi.encodePacked` nor early geth
implementations, but is that a strong proof that it doesn't introduce new
problems?#### Gas cost of encoding
If `abi.encode` was an efficient function, we could probably be happy that it has
sufficient adoption and time without exploit to use it. Even if there was some
issue, "nobody ever got fired for using `abi.encode`", right?Doing the same thing as everyone else, and as the security researchers recommend
in audits, is usually a good idea.The issue here is that `abi.encode` is not particularly gas efficient. This is a
fundamental issue and not at all the "fault" of Solidity. To encode anything with
any algorithm and not cause the original data to be corrupted/unsafe to use, the
EVM must allocate a new region of memory to house the encoded data. If we allow
for dynamic length data types, the UNAVOIDABLE runtime overhead of ANY
schemaless/uncompressed encoding algorithm is:- Calculate the size of memory to allocate for the encoded output by recursively
traversing the input data
- Allocate the memory and pay nonlinear gas for expansion costs
- Make a complete copy of the input data
- Write additional data for the encoding itself, e.g. type/length prefixes,
headers, magic numbers, etc.The Solidity type system can definitely make a lot of this more efficient,
especially the traversal bit, by generating the traversal process at compile time
but it can't hand wave away the need for allocating and copying.**Typically, my experience has shown that if some algorithm `f(x)` is implemented
in a functionally equivalent way, where one implementation internally encodes `x`
and another avoids it, the no-encode solution often costs 40-80%+ less gas.**
This saving is of course most noticeable when the algorithm is relatively
efficient, or involves a tight internal loop over encoding, such that the
encoding then starts to dominate the profile. Even in cases where that is not
true, such as comparing the reference SSTORE2 implementation to
[LibDataContract](https://github.com/rainprotocol/sol.lib.datacontract/blob/main/src/LibDataContract.sol) we still can see 1k+ gas savings per-write for common usage patterns, with
identical outcomes.It really just seems to come down to the fact that memory expansion and bulk
copying nested/dynamic is not a cheap thing to do. It's typically not millions of
gas, but it can easily be 1-10k+ gas for what is often unneccessary work.Note however that `keccak256` itself is non destructive, it can happily produce
a hash on the stack without modifying or allocating any memory at all. Even in
the case that some data is NOT in memory yet and we want to hash it
(e.g. on the stack), there is a dedicated region of memory from `0-0x40` called
"scratch space for hashing methods". We can put any two words in the scratch
space and hash them together without interacting with the allocator at all.What perhaps is the "fault" of Solidity is that they don't implement `keecak256`
for any type other than `bytes` so we are forced to go all the way to Yul and
write assembly the moment we want to do anything other than `abi.encode`.## Solution
- Define a pattern to hash any Solidity data structure without allocation and
minimal memory reads/writes, and is generally efficient
- Convince ourselves the pattern is unambiguous/secure, being both deterministic
and injective
- Provide a reference implementation of the pattern that can be fuzzed against
to show inline implementations of the pattern provide valid outputs### The pattern
The memory layout of data in Solidity is very regular across all data types.
https://docs.soliditylang.org/en/v0.8.19/internals/layout_in_memory.html
It is optimised so that the allocator always has the free memory pointer at a
multiple of 32.Note that the memory layout is completely different to e.g. the storage layout.
Everything discussed here is specific to data in memory and does not generalise
at all.All non-struct types end up in one of 3 buckets:
- 1 or more 32 byte words, of `length` defined by the type
- A 32 byte `length` followed by `length` 32 byte words (most dynamic types)
- A 32 byte `length` followed by `length` bytes (`bytes` and `string` only)Note that in the last case, the allocator will still move the free memory pointer
to a multiple of 32 bytes even if that points past the end of the data structure.The variables and structs that reference these things are pointers to them, or
even nested pointers in the case of structs. The pointers can be either on the
stack or in memory, depending on context.Consider the struct
```solidity
struct Foo {
uint256 a;
address b;
uint256[] c;
bytes d;
}
```If we had some `foo_` such that `Foo memory foo_ = Foo(...);` then `foo_` will
be a pointer, either on the stack or in memory, depending on compiler
optimisations.The thing it points to falls into the first bucket, a 4-word region of memory
defined by its type. This may not be intuitive but all of `uint256`, `address`,
`uint256[]` and `bytes`, and all other types, are all a full singular word in
the struct.Any types that are smaller than 1 word are padded with 0's such that they retain
the same `uint256` equivalent value.Any types that are larger, or potentially larger than 1 word are pointers to that
data, from the perspective of the struct.This logic is applied recursively.
This means that structs are NOT dynamic length, regardless of how nested or how
many dynamic types appear in their definition, or the definitions of types within
their fields. For example, a `Foo` is ALWAYS 4 words, i.e. 0x80 bytes long.Given the above, we can
- Define a pattern for hashing each of the 3 possible memory layouts
- Explain how to handle pointers across non-contigous regions of memory
- Discuss the security of the composition
- Provide a guide for implementation, maintenance and quality assurance#### Hashing contigious words
In all cases where the size of the data is a known number of words at compile
time we are free to simply hash the known memory region.For example, we could hash a `foo_` as above like so
```solidity
assembly ("memory-safe") {
let hash_ := keccak256(foo_, add(foo_, 0x80))
}
```Ignore for now that `c` and `d` are pointers, as that will be discussed later in
this document.The basic point is that the code example shows that Yul handles what we need
for known memory regions very naturally.Other than implementation bugs, there's no potential for
- Collisions
- Including data what we did not intend to in the hash input
- Failing to include some part of the structBecause the size of the data never changes, we can just hardcode it per-type.
#### Hashing dynamic length list of words
Most dynamic length types in Solidity are a list of 32 byte words. This includes
lists of pointers like `Foo[]`, single byte values `bytes1[]`, etc.The ONLY exceptions to the rule are `bytes` and `string` types.
Again, ignoring pointers for now, we can hash any dynamic length word list as
```solidity
assembly ("memory-safe") {
// Assume bar_ is some dynamic length list of words
let hash_ := keccak256(
// Skip the length prefix
add(bar_, 0x20),
// Read the length prefix and multiply by 0x20 to know how many _words_
// to hash
mul(mload(bar_), 0x20)
)
}
```Note that here we DO NOT include the length prefix in the bytes that we hash.
This gives us the same behaviour as the case of hashing static length data, but
with lengths known only at runtime.When it comes to composition we do not want to rely on length prefixes for safety
guarantees, as they are not always available. As EIP712 explained, length
prefixes can introduce ambiguity as easily as they can resolve them as in the
case of packed encoding. Ideally we can show security without the need for any
additional metadata about our words.#### Hashing dynamic length byte strings
The two byte length types `bytes` and `string` are the only types in Solidity
that may not have whole-world lengths. Even though the allocator retains a
multiple of 0x20 on the free memory pointer, we MUST respect the true length of
`bytes` and `string` in bytes, otherwise we introduce ambiguity.If we did not respect the length then `hex"01"` and `hex"0100"` would hash to
the same value, as they both have 0x20 bytes _allocated_ to them in memory, even
though the first is 1 byte and the second is 2 bytes in _length_.The assembly for this is actually simpler than dealing with words as we do not
need to convert between length/bytes. It is the same for `string` and `bytes`.```solidity
assembly ("memory-safe") {
// Assume baz_ is some bytes/string
let hash_ := keccak256(
// Skip the length prefix
add(baz_, 0x20),
// Read the length prefix to know how many _bytes_ to hash
mload(baz_)
)
}
```Note that pointers never appear in `bytes` nor `string`, or if they do, they are
not going to be dereferenced by our hashing logic. This assembly above is all
that is needed to hash `bytes` and `string` types.#### Handling pointers
It would be pointless to hash pointers (no pun intended). A pointer is merely an
offset in memory, which has very little to do with the data on the other side of
it, and is not even deterministic.We find pointers in Solidity wherever something that is potentially larger than
1 word needs to fit in a single word slot. For example, any time a struct or
dynamic type is an item or field in another struct or dynamic type.Solidity does not allow mixed type lists so all pointers are at least found in
predictable positions. We always know at compile time whether something is a
pointer or not, either because it's a field at a known offset, or we are dealing
with an individual or list or pointers directly.To reliably handle pointers without allocations:
- Compute the hash of all data up to the pointer
- Compute the hash of all data referenced by the pointer
- Hash these two hashes togetherUsing our `Foo` struct from above as an example this would look like:
- Hash the first two words as a contigious memory region of known size as `A`
- Hash the dynamic word list `foo_.c` as `B`
- Write `A` and `B` to scratch space at `0` and `0x20` respectively
- Hash the scratch space to produce `C`
- Hash the bytes `foo_.d` as `D`
- Write `C` and `D` to scratch space as above
- Hash the scratch space to produce `E`, which is our final hash of `Foo`As assembly it would look like
```solidity
assembly ("memory-safe") {
// hash foo_.a and foo_.b together to produce hash A
// store A in scratch
mstore(0, keccak256(foo_, add(foo_, 0x40)))// Follow the pointer to hash foo_.c into B
let deref_ := mload(add(foo_, 0x60))
// Store B in scratch
mstore(0x20, keccak256(add(deref_, 0x20), mul(mload(deref_), 0x20)))// Hash A and B to produce C which can be stored direct in scratch
mstore(0, keccak256(0, 0x40))// Follow the pointer to hash foo_.d
deref_ := mload(add(foo_, 0x80))
// Store D in scratch
mstore(0x20, keccak256(add(deref_, 0x20), mload(deref_)))// Write C and D to scratch to produce the final hash E
let E := keccak256(0, 0x40)
}
```If we had a list of pointers, such as a `Foo[]` then this would be modelled as
a simple fold/reduce style accumulator where each item is hashed as above
individually then hashed into the accumulator. I.e. Hash `foos_[0]` to hash A,
then hash `foos_[1]` to hash B, then write both to scratch and hash to produce C,
then hash `foos_[2]` to hash D, and hash C and D to produce E, etc.This process of iterating and accumulating a hash incrementally seems to be about
40% cheaper in gas terms than ABI encoding then hashing, based on some simple
testing with Foundry.##### Nil hash prefix
If we are handling a pointer and have no hash, e.g. we're directly hashing an
array of pointers, we start with the hash of nil bytes, i.e. `keccak256(0, 0)`.If the array is 0 length then the hash will be the nil hash, regardless of the
type behind the pointers.#### Security of composition
Assume that we're comfortable with concepts like blockchains and merkle trees,
that rely on hashes of hashes to iteratively build a single hash out of a system
of hashes.We need to convince ourselves that the hashing process above is unambiguous for
all permutations.This relies on us accepting that `hash(hash(a) + hash(b))` will not collide with
`hash(hash(c) + hash(d))` for all values of `a`, `b`, `c`, `d` except the trivial
case where `a` = `c` and `b` = `d`.As long as `hash(x)` and `hash(x) + hash(y)` don't produce collisions then this
is true. This guarantee seems to come down to the strength of `keccak256` itself.As `keccak256` always produces hashes exactly 32 bytes long for all inputs, we avoid
all the issues seen with both including and excluding length prefixes, such as
those seen in `abi.encodePacked`.I'm not sure this constitutes a formal mathematical proof, but I'm not sure if
one exists for `abi.encode` either :)## Dev stuff
### Local environment & CI
Uses nixos.
Install `nix develop` - https://nixos.org/download.html.
Run `nix develop` in this repo to drop into the shell. Please ONLY use the nix
version of `foundry` for development, to ensure versions are all compatible.Read the `flake.nix` file to find some additional commands included for dev and
CI usage.## Legal stuff
Everything is under DecentraLicense 1.0 (DCL-1.0) which can be found in `LICENSES/`.
This is basically `CAL-1.0` which is an open source license
https://opensource.org/license/cal-1-0The non-legal summary of DCL-1.0 is that the source is open, as expected, but
also user data in the systems that this code runs on must also be made available
to those users as relevant, and that private keys remain private.Roughly it's "not your keys, not your coins" aware, as close as we could get in
legalese.This is the default situation on permissionless blockchains, so shouldn't require
any additional effort by dev-users to adhere to the license terms.This repo is REUSE 3.2 compliant https://reuse.software/spec-3.2/ and compatible
with `reuse` tooling (also available in the nix shell here).```
nix develop -c rainix-sol-legal
```## Contributions
Contributions are welcome **under the same license** as above.
Contributors agree and warrant that their contributions are compliant.