https://github.com/tailhook/probor

A protocol on top of CBOR that provides protobuf-like functionality
https://github.com/tailhook/probor
Last synced: 6 months ago
JSON representation
A protocol on top of CBOR that provides protobuf-like functionality
Host: GitHub
URL: https://github.com/tailhook/probor
Owner: tailhook
Created: 2015-08-06T23:08:13.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2018-05-18T10:28:36.000Z (almost 8 years ago)
Last Synced: 2025-03-13T22:09:15.821Z (about 1 year ago)
Language: Rust
Size: 7.71 MB
Stars: 53
Watchers: 7
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.rst
Awesome Lists containing this project

README

          ======

Probor

======

:Status: Proof of Concept

:Rust docs: http://tailhook.github.io/probor/

Probor is an extensible mechanism for serializing structured data on top of

CBOR_.

In additional to CBOR_ probor has the following:

1. A library to efficiently read data into language native structures

2. A schema definition language that serves as a documentation for

   interoperability between systems

3. A more compact protocol which omits field names for objects

4. Conventions to make schema backwards compatible

Why?

====

We like CBOR_ for the following:

1. It's IETF standard

2. It's self-descriptive

3. It's compact enough

4. It's extensive_ including mind-boggling_ things

5. It has implementations in most languages

.. _extensive: http://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml

.. _mind-boggling: https://github.com/paroga/cbor-js/issues/3

What we lack in CBOR_:

1. No schema definition, i.e. can't check interoperability between systems

2. Transmitting/storing lots of objects is expensive because keys are encoded

   for every object

3. No standard way to cast "object" (i.e. a map or a dict) into native typed

   object

Comparison

==========

This section roughly compares similar projects to see second our arguments in

"Why?" section. Individual arguments may not be very convincing by they are

reasonable enough in total.

Probor vs Protobuf

------------------

Protobuf_ can't parse data if no schema known. *Probor* is not always totally

readable, but at least you can unpack the data using generic *cbor* decoder and

look at raw values (presumably without key names).

And it's not only hard when schema is unknown, but when you have a schema but

no code generated to inspect it. For example if you have a java application,

but want to inspect some code in python. You need a pythonic code generator and

generate code before you can read anything with *protobuf*.

*Probor* also has debugging (non-compact) mode in which it may encode object

and enums by name so you can easily understand the values. You can also keep

key names for most objects except ones that are transmitted in large

quantities, because compact and non-compact formats are compatible. You are in

control.

The types that *Protobuf* generates are not native. So they are larger and

hard to work with. Because code is generated you usually can't add methods

to the object itself without subtle hacks. *Probor* tries to provide thin layer

around native objects.

Also working with a code generation is inconvenient. *Protobuf* has a code

generator written in C++ which you need to have installed. Moreover you often

need another version of protobuf code generator for every language. *Probor*

works without code generation for all currently supported languages

by providing simple to use macros and/or annotations to native types. We may

provide code generation facilities too for bootstrapping the code, but they

should be done purely in the language they generate.

On the upside of *Protobuf* it can deserialize lookup object and serialize

again without loosing any information (even fields that are not in his version

of a protocol). For *probor* it's not implemented in current libraries for

effiency reasons, but it can be done with apropriate libraries anyway.

.. _Protobuf: https://github.com/google/protobuf

Probor vs Avro

--------------

Avro_ needs a schema to be transported "in-band", i.e. as a prefix to a data

send. We find this redundant.

Also *Avro* types are somewhat historic from C era. We wanted modern algebraic

types like they are in Rust or Haskell.

Also *avro* file format is not in IETF spec and does not have such interesting

extensions like CBOR_ `has`__.

__ mind-boggling_

.. _avro: https://avro.apache.org/

Probor vs Thrift

----------------

Thrift doesn't have good description of the binary format (in fact it has two

both are not documented in any sensible way) unlike CBOR_ which is IETF

standard. Do the data is hard to read without having code generated in advance.

*Thrift* also has ugly union type from 1990x, comparing to nice algebraic types

which we want to use in 2015.

*Thrift* relies on code generation for parsing data which we don't like because

it makes programs hard to build and it's hard to integrate with native

types (i.e. add a method to generated type).

Also *thrift* bindings usually have some implementation of *services*

which usually is a cruft because there are too much ways for dealing with

network in each language to have all of them implemented by thrift authors.

Furthermore *thrift* has long history of generating code that can't be network

IO agnostic.

.. _thrift: http://thrift.apache.org/

Probor vs Capnproto

===================

*Capnproto* has ugly and complex serialization format which is useful for

mapping values directly into memory without decoding. But its more complex to

implement correctly than what we target for. We also wanted compact encoding

which *Capnproto* has but it's built on top of already hard to understand

encoding and complicates things even more.

*Capnproto* like other relies on code generation with ugly protocol objects

as result of decoding, but we wanted native types.

.. _capnproto: https://capnproto.org/

Look-a-Like

===========

For example, here is schema::

    struct SearchResults {

        total_results @0 :int

        results @1 :array Page

    }

    struct Page {

        url @0 :text,

        title @1 :text,

        snippet @2 :optional text,

    }

Note the following things:

* We use generic type names like int (integer), not fixed width (see FAQ)

* We give each field a number, they are similar to ones used in other

  IDL's (like protobuf, thrift or capnproto)

The structure serialized with probor will look like (displaying json for

clarity, in fact you will see exact this data if decode CBOR and encode with

JSON):

.. code-block:: json

   [1100, [

        ["http://example.com", "Example Com"],

        ["http://example.org", "Example Org", "Example organization"]]]

Obviously when unpacked, it looks more like (in javascript):

.. code-block:: javascript

   new SearchResults({"total_results": 1100,

                      "results": [new Page({"url": "http://example.com",

                                            "title": "Example Com"}),

                                  new Page({"url": "http://example.org",

                                            "title": "Example Org",

                                            "snippet": "Example organization"})]}

Actually the object can be serialized like this:

.. code-block:: json

   {"total_results": 1100,

    "results": [{"url": "http://example.com",

                 "title": "Example Com"},

                {"url": "http://example.org",

                 "title": "Example Org",

                 "snippet": "Example organization"}]}

And this would also be **totally valid** serialized representation. I.e. you

can store fields both by names and by numbers. This is occasionally useful for

ad-hoc requests or you may be willing to receive non-compact data from frontend,

then validate and push data in more compact format for storage.

In Python serialization looks like:

.. code-block:: python

    from probor import struct

    class Page(object):

        def __init__(self, url, title, snippet=None):

            # .. your constructor .. omitted for brevity

        probor_protocol = struct(

            required={(0, "url"): str, (1, "title"): str},

            optional={(2, "snippet"): str})

    class SearchResults(object):

        def __init__(self, total_resutls, results):

            # .. your constructor .. omitted for brevity

        probor_protocol = struct(

            required={(0, "total_results"): int, (1, "results"): Page})

TODO: isn't syntax ugly? Should it be more imperative? Is setstate/getstate

used?

.. note:: It's easy to build a more declarative layer on top of this protocol.

   I.e. for some ORM model, you might reuse field names and types. But the

   important property to keep in mind is that you should not rely on field

   order for numbering fields and **numbers must be explicit**, or otherwise

   removing a field might go unnoticed.

   Apart from that, integrating probor data types with model and/or validation

   code is encouraged. And that's actually a reason why we don't provide a

   nicer syntax for this low-level declarations.

Similarly in Rust it looks like:

.. code-block:: rust

    #[macro_use] extern crate probor;

    use probor::{Encoder, Encodable};

    use probor::{Decoder, Config, decode};

    use std::io::Cursor;

    probor_struct!(

    #[derive(PartialEq, Eq, Debug)]

    struct Page {

        url: String => (#0),

        title: String => (#1),

        snippet: Option => (#2 optional),

    });

    probor_struct!(

    #[derive(PartialEq, Eq, Debug)]

    struct SearchResults {

        total_results: u64 => (#0),

        results: Vec => (#1),

    });

    fn main() {

        let buf = Vec::new();

        let mut enc = Encoder::new(buf);

        SearchResults {

            total_results: 112,

            results: vec![Page {

                url: "http://url1.example.com".to_string(),

                title: "One example".to_string(),

                snippet: None,

            }, Page {

                url: "http://url2.example.com".to_string(),

                title: "Two example".to_string(),

                snippet: Some("Example Two".to_string()),

            }],

        }.encode(&mut enc).unwrap();

        let sr: SearchResults = decode(

            &mut Decoder::new(Config::default(), Cursor::new(enc.into_writer())))

            .unwrap();

        println!("Results {:?}", sr);

    }

The Rust example is a bit longer which is bearable for rust.  It's hugely based on

macros, which may seem as similar to code generation. Still, we find it better,

because you are in control of at least the following things:

1. The specific types used (e.g. u64 for int)

2. The structure definition (may use meta attributes including

   ``derive`` and ``repr`` and may use ``struct T(X, Y)``)

3. How objects are created (e.g. use ``VecDeque`` or ``BTreeMap`` instead of

   default ``Vec`` and ``HashMap``)

4. How missing fields are handled (e.g. you can provide defaults for missing

   fields instead of using ``Option``)

5. You can include application-specific validation code

.. note:: Leaving the parentheses empty will result in the field

   strings stored as part of the payload.

   This would undermine the goal of reducing byte count of data stored,

   and in such cases, one may as well use CBOR directly.

At the end of the day, writing a parser explicitly with few helper macros looks

like a much better idea than adding all the data as the meta information to the

schema file.

Type System

===========

Structures

----------

TBD

Algebraic Types

---------------

TBD

In Unsupported Languages

````````````````````````

In language which doesn't support algebraic types, they are implemented

by tying together few normal types. E.g. the following type in Rust:

.. code-block:: rust

    enum HtmlElement {

        Tag(String, Vec),

        Text(String),

    }

Is encoded like this in python:

.. code-block:: python

    from probor import enum

    class HtmlElement:

        """Base class"""

    class Tag(HtmlElement):

        def __init__(self, tag_name, children):

            # .. snip ..

        probor_protocol = ...

    class Text(HtmlElement):

        def __init__(self, text)

            self.text = text

        probor_protocol = ...

    HtmlElement.probor_protocol = enum({

        (0, 'Tag'): Tag,

        (1, 'Text'): Text,

    })

Then you can do pattern-matching-like things by using

``functools.singledispatch`` (in Python3.4) or just use ``isinstance``.

.. note:: The purescript compiles types similarly. It's unchecked, but

   I believe probor's searization into Javascript should be compatible with

   PureScript types.

Forward/Backward Compatibility

==============================

Comparing with protobuf, the probor serializer always considers all fields as

optional. The required fields are only in IDL, so if your future type is smart

enough to

Backwards compatibility is very similar to protobuf.

TBD: exact rules for backward compatibility

TBD: exact rules for forward compatibility

TBD: turning structure in algebraic type with compatibility

FAQ

===

Why Use Generic Types?

----------------------

Well, there are couple of reasons:

1. Different languages have different types, e.g. Python does have generic

   integer only, Java does not have unsigned integer types

2. Fixed width types are not good constaint anyway, valid values have often

   much smaller range than that of the type, so this is not a replacement for

   data validation anyway

Why No Default Values

---------------------

There are couple of reasons:

1. Default value is user-interface feature. And every service might want use

   it's own default value.

2. It's very application-specific if value that equals to default value may

   be omitted when serializing. And we want to use native structures for the

   language without any additional bookkeeping of whether the value is default

   or just equals to it.

.. _CBOR: http://cbor.io/
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tailhook/probor

Awesome Lists containing this project

README