https://github.com/erlang-punch/berty

A clean, safe and flexible BERT implementation
https://github.com/erlang-punch/berty
audit bert elixir erlang etf externaltermformat security serialization serializer
Last synced: 8 months ago
JSON representation
A clean, safe and flexible BERT implementation
Host: GitHub
URL: https://github.com/erlang-punch/berty
Owner: erlang-punch
License: mit
Created: 2023-08-09T19:17:43.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-11-02T07:52:31.000Z (almost 2 years ago)
Last Synced: 2024-12-31T14:22:24.816Z (10 months ago)
Topics: audit, bert, elixir, erlang, etf, externaltermformat, security, serialization, serializer
Language: Erlang
Homepage: https://www.erlang-punch.com
Size: 166 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project

README

          # berty

![Erlang Punch Berty License](https://img.shields.io/github/license/erlang-punch/berty)

![Erlang Punch Berty Top Language](https://img.shields.io/github/languages/top/erlang-punch/berty)

![Erlang Punch Berty Workflow Status (main branch)](https://img.shields.io/github/actions/workflow/status/erlang-punch/berty/test.yaml?branch=main)

![Erlang Punch Berty Last Commit](https://img.shields.io/github/last-commit/erlang-punch/berty)

![Erlang Punch Berty Code Size (bytes)](https://img.shields.io/github/languages/code-size/erlang-punch/berty)

![Erlang Punch Berty Repository File Count](https://img.shields.io/github/directory-file-count/erlang-punch/berty)

![Erlang Punch Berty Repository Size](https://img.shields.io/github/repo-size/erlang-punch/berty)

A clean, safe and flexible implementation of BERT, a data-structure

format inspired by Erlang ETF.

This project is in active development, and should not be used in

production yet.

## Features

Primary features:

 - [x] High level implementation of ETF in pure Erlang

 - [x] Atoms protection and limitation

 - [ ] Fine grained filtering based on type

 - [ ] Callback function or MFA

 - [ ] Fallback to `binary_to_term` function on demand

 - [ ] Drop terms on demande

 - [ ] Term size limitation

 - [ ] Custom options for term

 - [ ] Property based testing

 - [ ] BERT parser subset

 - [ ] Depth type protection

 - [ ] Fully documented

 - [ ] +90% coverage

 - [ ] 100% compatible with standard ETF

 - [ ] 100% compatible with BERT

Secondary features:

 - [ ] Global or fine grained statistics

 - [ ] Profiling and benchmarking facilities

 - [ ] Logging facilities

 - [ ] Tracing facilities

 - [ ] ETF path

 - [ ] ETF schema

 - [ ] Custom parser subset based on behaviors

 - [x] ETF as stream of data

 - [ ] Usage example with ETF, BERT and/or custom parser

 - [ ] Low level optimization (optimized module with merl)

## Usage

Berty was created to easily replace `binary_to_term/1` and

`binary_to_term/2` built-in functions. In fact, the implementation is

transparent in many cases. The big idea is to protect your system from

outside, in particular atom and memory exhaution.

```erlang

% create an atom from scratch

Atom = term_to_binary(test).

% An atom is automatically converted as binary

{ok, <<"test">>}

  = berty:decode(Atom).

% different methods can be used to deal with atoms.

{ok, test}

  = berty:decode(Atom, #{ atoms => {create, 0.2, warning} }).

% Other terms are supported

Terms = term_to_binary([{ok,1.0,"test",<<>>}]),

{ok, [{ok,1.0,"test",<<>>}]}

  = berty:decode(Terms).

```

More features are present, for example, dropping terms or creating

custom callbacks.

```erlang

Lists = term_to_binary([1024,<<>>,"test"]).

% let drop all integers

{ok, [<<>>, "test"]}

  = berty:decode(Lists, #{ integer_ext => drop

                         , small_integer_ext => drop

                         }).

% let create a custom callback

Callback = fun

  (_Term, Rest) ->

    {ok, doh, Rest}

end.

{ok, [doh, <<>>, "test"]}

  = berty:decode(Lists, #{ integer_ext => {callback, Callback}

                         , small_integer_ext => {callback, Callback}

                         }).

% let create another one.

Callback2 = fun

  (Term, Rest) when 1024 =:= Term ->

    logger:warning("catch term ~p", [1024]),

    {ok, Term, Rest};

  (Term, Rest) -> {ok, Term, Rest}

end.

{ok, [1024, <<>>, "test"]}

  = berty:decode(Lists, #{ integer_ext => {callback, Callback2}

                         , small_integer_ext => {callback, Callback2}

                         }).

```

Those are simple examples, more features are present and will be

added. Here the most important functions:

 - `berty:decode/1`: standard BERT decoder with default options

 - `berty:decode/2`: standard BERT decoder with custom options

 - `berty:decode/3`: custom decoder with custom options

 - `berty:encode/1`: standard BERT encoder with default options

 - `berty:encode/2`: standard BERT encoder with custom options

 - `berty:encode/3`: custom encoder with custom options

 - `berty:binary_to_term/1`: wrapper around `binary_to_term/1`

 - `berty:term_to_binary/1`: wrapper around `term_to_binary/1`

## Build

```sh

rebar3 compile

rebar3 shell

```

## Test

```sh

rebar3 as test eunit

rebar3 as test shell

```

# FAQ

## Why creating another BERT implementation?

Mainly because of atoms management. In fact, `binary_to_term/1` and

`term_to_binary/1` are not safe, if unknown data are coming from

untrusted source, it's quite easy to simply kill the node by

overflowing the number of atoms managed by the node itself, and

probably also a full cluster if this data is shared.

```erlang

% first erlang shell

file:write_file("atom1", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1,1_000_000) ])).

% second erlang shell

file:write_file("atom2", term_to_binary([ list_to_atom("$test-" ++ integer_to_list(X)) || X <- lists:seq(1_000_000,2_000_000) ])).

```

Now restore those 2 files on another node.

```erlang

% third erlang shell

f(D), {ok, D} = file:read_file("atom1"), binary_to_term(D).

f(D), {ok, D} = file:read_file("atom2"), binary_to_term(D).

no more index entries in atom_tab (max=1048576)

Crash dump is being written to: erl_crash.dump...done

```

Doh. Erlang VM crashed. We can fix that in many different way, here

few examples:

 - avoid using `binary_to_term/1` and `term_to_binary/1` functions,

   instead create our own parser based on ETF specification. When

   terms are deserialized, atoms can be (1) converted in existing atom

   (2) converted in binary or list (3) simply dropped or replaced with

   something to alert the VM this part of the data is dangerous.

 - keep our own local atom table containing all atom deserialized. A

   soft/hard limit can be set.

## Oh?  really? Is it serious?

In fact, a simple solution already exists, using the option `safe` or

`used` when using

[`binary_to_term/2`](https://www.erlang.org/doc/man/erlang.html#binary_to_term-2). It

will protect you from creating non-existing atoms, but how many

projects are using that?

- [`mojombo/bert.erl`](https://github.com/mojombo/bert.erl):

  https://github.com/mojombo/bert.erl/blob/master/src/bert.erl#L25

  ```erlang

  -spec decode(binary()) -> term().

  decode(Bin) ->

    decode_term(binary_to_term(Bin)).

  ```

- [`mojombo/ernie`](https://github.com/mojombo/ernie):

  https://github.com/mojombo/ernie/blob/master/elib/ernie_server.erl#L178

  ```erlang

  receive_term(Request, State) ->

    Sock = Request#request.sock,

      case gen_tcp:recv(Sock, 0) of

          {ok, BinaryTerm} ->

            logger:debug("Got binary term: ~p~n", [BinaryTerm]),

            Term = binary_to_term(BinaryTerm),

  ```

- [`sync/n2o`](https://github.com/synrc/n2o):

  https://github.com/synrc/n2o/blob/master/src/services/n2o_bert.erl#L8

  ```erlang

  encode(#ftp{}=FTP) -> term_to_binary(setelement(1,FTP,ftpack));

  encode(Term)       -> term_to_binary(Term).

  decode(Bin)        -> binary_to_term(Bin).

  ```

- [`ferd/bertconf`](https://github.com/ferd/bertconf):

  https://github.com/ferd/bertconf/blob/master/src/bertconf_lib.erl#L10

  ```erlang

  decode(Bin) ->

      try validate(binary_to_term(Bin)) of

        Terms -> {ok, Terms}

      catch

        throw:Reason -> {error, Reason}

      end.

  ```

- [`a13x/aberth`](https://github.com/a13x/aberth):

  https://github.com/a13x/aberth/blob/master/src/bert.erl#L25

  ```erlang

  -spec decode(binary()) -> term().

  decode(Bin) ->

    decode_term(binary_to_term(Bin)).

  ```

- [`yuce/bert.erl`](https://github.com/yuce/bert.erl):

  https://github.com/yuce/bert.erl/blob/master/src/bert.erl#L24

  ```erlang

  -spec decode(binary()) -> term().

  decode(Bin) ->

      decode_term(binary_to_term(Bin)).

  ```

- And probably many more like this search on

  [`searchcode.com`](https://searchcode.com/?lan=25&q=binary_to_term)

  or

  [`github.com`](https://github.com/search?q=binary_to_term+language%3AErlang&type=code&l=Erlang)

  suggest.

It's highly probable lot of those functions are hard to call, but it

could be the case. In situation where unknown data are coming,

`erlang:binary_to_term/1` and even `erlang:binary_to_term/2` should be

avoided or carefully used.

## Why am I not aware of that?

Few articles[^erlef-atom-exhaustion][^paraxial-atom-dos] have been

created in the past to explain these problems. On my side, if I was in

charge of fixing this issue, I would probably do something in two

times.

In the first step, I would probably create a workaround on atom

creation function, with a soft/hard limit. When we reach the soft

limit, warnings are displayed saying we reached the soft limit, but we

can still create new atoms. When reaching the hard limit, atoms can't

be created anymore, and exceptions are raised instead of crashing the

host.

In a second step, I would probably create a flexible interface to

deal with atoms and divide the problem in half:

 1. create fixed atom store containing only atoms from source code

    (Erlang release and project), this one can't be increased.

 2. create a second atom store containing dynamically created atoms

    during runtime, this one can be increased.

What I worry about is when dealing with mnesia. What could happen if

someone create more than 2M unwanted atoms added in Mnesia or DETS?

What kind of behavior the cluster will have? And how to fix that if

it's critical.

Unfortunately, I think it will totally break atom performance, but it

could be an interesting project to learn how Erlang BEAM works under

the hood.

[^erlef-atom-exhaustion]: https://erlef.github.io/security-wg/secure_coding_and_deployment_hardening/atom_exhaustion.html

[^paraxial-atom-dos]: https://paraxial.io/blog/atom-dos

## Are atoms the only issue there?

Well, it depends. If you are receving a (very) long string or list

containing terms, it will have a direct impact on the memory, and it

will eventually lead to memory exhaustion:

```erlang

% size of the list should be checked

% if not, memory exhaustion can happen

[ $1 || _ <- lists:seq(0,160_000_000) ].

% eheap_alloc: Cannot allocate 3936326656 bytes of memory (of type "heap").

% Crash dump is being written to: erl_crash.dump...

```

Same behavior can be generated using binaries:

```erlang

% big binaries can crash the BEAM

binary_to_term(<<131, 111, 4294967294:32/unsigned-integer, 0:8/integer, 255:8, 0:4294967280/unsigned-integer>>).

% binary_alloc: Cannot allocate 4294967293 bytes of memory (of type "binary").

% Crash dump is being written to: erl_crash.dump...

```

Generating ETF payload with very long binaries can also have

an impact on CPUs, the following code can generate DoS and if many process

```erlang

% big payload, high cpu usage, no crash.

% size of the big integer must be checked

% size: 2**18-1, binary byte size: 262_150 (~262kB)

_ = binary_to_term(<<131, 111, 262_143:32/unsigned-integer, 0:8/integer, 255:2_097_144/unsigned-integer>>).

% size: 2**19-1, binary byte size: 524_294 (~524kB)

_ = binary_to_term(<<131, 111, 524_287:32/unsigned-integer, 0:8/integer, 255:4_194_296/unsigned-integer>>).

% size: 2**20-1, binary byte size: 1_048_582 (~1MB)

_ = binary_to_term(<<131, 111, 1_048_575:32/unsigned-integer, 0:8/integer, 255:8_388_600/unsigned-integer>>).

```

Creating a long node name can crash the VM during startup, because the

name of the node is encoded using an `atom_ext` term, encoded on 255

bits. If the name of the node is greater than 255, it crashes.

```sh

erl -sname $(pwgen -A0 252 1)

# Crash dump is being written to: erl_crash.dump...done

erl -name $(pwgen -A0 246 1)@localhost

# Crash dump is being written to: erl_crash.dump...done

```

It's highly probable other terms can have a deadly impact on a node or

a cluster.

## How to fix the root cause?

The problem is from atoms, at least one

paper[^atom-garbage-collection] talked about that. Fixing the garbage

collection issue could help a lot, but if it's not possible for many

reason, using an high level implementation of ETF with some way to

control what kind of data are coming might be an "okayish" solution.

The "Let it crash" philosophy is quite nice when developing high level

application interacting in a safe place but this philosophy can't be

applied in a place where uncontrolled data is coming. Some functions,

like `binary_to_term/1` must be avoid at all cost.

[^atom-garbage-collection]: Atom garbage collection by Thomas Lindgren, https://dl.acm.org/doi/10.1145/1088361.1088369

## What about ETF schema?

This answer is a draft, a sandbox to design an Erlang ETF Schema

feature.

It might be great to have syntax to create ETF schema, a bit like

protobuf[^protobuf], json schema[^json-schema], XML[^xml] (with

XLST[^xlst]) or ASN.1[^asn.1]. In fact, when I started to find

something around this feature, I also found UBF[^ubf] project from Joe

Armstrong.

```erlang

schema1() ->

  integer().

schema2() ->

  tuple([[atom(ok), integer()]

        ,[atom(error), string(1024)]).

% fun ({ok, X}) when is_integer(X) -> true;

%     ({error, X) when is_list(X) andalso length(X) =< 1024 -> is_string(X);

%     (_) -> false.

schema3() ->

  tuple(

```

Here the final representation.

```erlang

[{tuple, [{atom, [ok]}, {integer, []}]}

,{tuple, [{atom, [error]}, {string, [1024]}]}

]

% or

[[tuple, [2]]

,[atom, [ok,error]]

,[integer, []]

,[string, [1024]]

].

```

[^protobuf]: https://protobuf.dev/overview/

[^json-schema]: https://json-schema.org/

[^xml]: https://en.wikipedia.org/wiki/XML

[^xlst]: https://en.wikipedia.org/wiki/XSLT

[^asn.1]: https://en.wikipedia.org/wiki/ASN.1

[^ubf]: https://ubf.github.io/ubf/ubf-user-guide.en.html

## What about an ETF path feature?

Another feature like xmlpath or jsonpath is also required as well, an

easy syntax and comprehensible one needs to be created. I would like

to include:

 1. pattern matching

```erlang

% how to create an etf path?

% first example

% ETF = #{ key => #{ key2 => { ok, "test"} } }.

"test" = path(ETF, "#key#key2{ok,@}")

% second example

% ETF = [{ok, "test"}, {error, badarg}, {ok, "data"}].

[{ok, "test"},{ok, "data"}] = path(ETF, "[{ok,_}]")

% or

[]{ok,_}

% third example

% ETF = {ok, #{ <<"data">> => [<<"test">>] }}.

[<<"test">>] = path(ETF, "{ok,@}#!data").

```

## Nothing to add?

When I wrote [Serialization series — Do you speak Erlang ETF or BERT?

(part

1)](https://medium.com/@niamtokik/serialization-series-do-you-speak-erlang-etf-or-bert-part-1-ff70096b50c0)

in 2017, someone told me to check another project called

[`jem.js`](https://github.com/inaka/jem.js) and read [Replacing JSON

when talking to Erlang](http://inaka.net/blog/2016/08/17/why-json/)

([archive](https://web.archive.org/web/20180301221900/http://inaka.net/blog/2016/08/17/why-json/))

blog post. What's funny here... Is that:

```erlang

handle_post(Req, State) ->

  {ok, Body, Req1} = cowboy_req:body(Req),

  Decoded = erlang:binary_to_term(Body),

  Reply = do_whatever(Decoded),

  {erlang:term_to_binary(Reply), Req1, State}.

```

Yes, "Faster and more efficient", but can destroy your whole platform

in few second. Don't do that. Please. Unfortunately,

[inaka.net](inaka.net) seems to be down, it would have been funny to

play with that.

## Is there a "risk analysis" for each terms somewhere?

Probably, but I did not find a lot on that. Here a short summary of

each terms is it safe or not and with the risk(s).

| Terms                 | Code |    Safe? | Risks

|:----------------------|-----:|---------:|--------------------------|

| `ATOM_CACHE_REF`      |   82 |       no | atom exhaustion

| `ATOM_EXT`            |  100 |       no | atom exhaustion

| `ATOM_UTF8_EXT`       |  118 |       no | atom exhaustion

| `BINARY_EXT`          |  109 |    maybe | dynamic binary length (32bits)

| `BIT_BINARY_EXT`      |   77 |    maybe | dynamic bitstring length (32bits)

| `EXPORT_EXT`          |  113 |       no | atom exhaustion

| `FLOAT_EXT`           |   99 |      yes | 31 bytes float fixed length

| `FUN_EXT`             |  117 |       no | atoms exhaution

| `INTEGER_EXT`         |   98 |      yes | 1 byte fixed length

| `LARGE_BIG_EXT`       |  111 |    maybe | dynamic integer length (32bits)

| `LARGE_TUPLE_EXT`     |  105 |    maybe | dynamic tuple length (32bits)

| `LIST_EXT`            |  108 |    maybe | dynamic list length (32bits)

| `LOCAL_EXT`           |  121 |      yes | atom exhaustion

| `MAP_EXT`             |  116 |    maybe | dynamic pair length (32bits)

| `NEWER_REFERENCE_EXT` |   90 |       no | memory exhaustion

| `NEW_FLOAT_EXT`       |   70 |      yes | 8 bytes fixed float

| `NEW_FUN_EXT`         |  112 |       no | atom exhaution

| `NEW_PID_EXT`         |   88 |       no | atom exhaution

| `NEW_PORT_EXT`        |   89 |       no | atom exhaution

| `NEW_REFERENCE_EXT`   |  114 |    maybe | dynamic reference length (16bits)

| `NIL_EXT`             |  106 |      yes | fixed length

| `PID_EXT`             |  103 |       no | atom exhaustion

| `PORT_EXT`            |  102 |       no | atom exhaustion

| `REFERENCE_EXT`       |  101 |       no | atom exhaustion

| `SMALL_ATOM_EXT`      |  115 |       no | atom exhaustion

| `SMALL_ATOM_UTF8_EXT` |  119 |       no | atom exhaustion

| `SMALL_BIG_EXT`       |  110 |    maybe | dynamic integer length (8bits)

| `SMALL_INTEGER_EXT`   |   97 |      yes | fixed size

| `SMALL_TUPLE_EXT`     |  104 |    maybe | dynamic tuple length (8bits)

| `STRING_EXT`          |  107 |    maybe | dynamic string length (16bits)

| `V4_PORT_EXT`         |  120 |       no | atom exhaustion

# Resources

 - [BERT-RPC Official](https://bert-rpc.org) [(archive)](https://web.archive.org/web/20160304092040/http://bert-rpc.org/)

 - [BERT-RPC Google group](https://groups.google.com/g/bert-rpc)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/erlang-punch/berty

Awesome Lists containing this project

README