{"id":18603590,"url":"https://github.com/cogini/avro_schema","last_synced_at":"2025-04-10T19:31:54.728Z","repository":{"id":57479129,"uuid":"217956279","full_name":"cogini/avro_schema","owner":"cogini","description":"Elixir convenience library for handling Avro schemas, useful for Kafka","archived":false,"fork":false,"pushed_at":"2024-07-22T21:56:03.000Z","size":130,"stargazers_count":8,"open_issues_count":4,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-07T05:44:14.994Z","etag":null,"topics":["avro","elixir","kafka","schema-registry-client"],"latest_commit_sha":null,"homepage":"","language":"Elixir","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cogini.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-28T03:07:50.000Z","updated_at":"2024-12-13T06:11:45.000Z","dependencies_parsed_at":"2024-07-23T00:15:34.524Z","dependency_job_id":null,"html_url":"https://github.com/cogini/avro_schema","commit_stats":{"total_commits":60,"total_committers":3,"mean_commits":20.0,"dds":0.4,"last_synced_commit":"73dec0bdb3e65e878e1c2de0fc34a26b2f3dd85a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cogini%2Favro_schema","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cogini%2Favro_schema/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cogini%2Favro_schema/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cogini%2Favro_schema/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cogini","download_url":"https://codeload.github.com/cogini/avro_schema/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248281413,"owners_count":21077423,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","elixir","kafka","schema-registry-client"],"created_at":"2024-11-07T02:14:50.879Z","updated_at":"2025-04-10T19:31:52.457Z","avatar_url":"https://github.com/cogini.png","language":"Elixir","funding_links":[],"categories":[],"sub_categories":[],"readme":"![test workflow](https://github.com/cogini/avro_schema/actions/workflows/test.yml/badge.svg)\n[![Module Version](https://img.shields.io/hexpm/v/avro_schema.svg)](https://hex.pm/packages/avro_schema)\n[![Hex Docs](https://img.shields.io/badge/hex-docs-lightgreen.svg)](https://hexdocs.pm/avro_schema/)\n[![Total Download](https://img.shields.io/hexpm/dt/avro_schema.svg)](https://hex.pm/packages/avro_schema)\n[![License](https://img.shields.io/hexpm/l/avro_schema.svg)](https://github.com/cogini/avro_schema/blob/master/LICENSE.md)\n[![Last Updated](https://img.shields.io/github/last-commit/cogini/avro_schema.svg)](https://github.com/cogini/avro_schema/commits/master)\n[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md)\n\n# AvroSchema\n\n---\n\nThis is a library for working with [Avro](https://avro.apache.org/)\nschemas and the [Confluent® Schema Registry](https://www.confluent.io/confluent-schema-registry),\nprimarily focused on working with [Kafka](https://kafka.apache.org/) streams.\nIt relies on [erlavro](https://github.com/klarna/erlavro) for encoding and\ndecoding data and [confluent_schema_registry](https://github.com/cogini/confluent_schema_registry)\nto look up schemas using the [Schema Registry API](https://docs.confluent.io/current/schema-registry/develop/api.html).\n\nIts primary value is that it caches schemas for performance and to allow\nprograms to work independently of the Schema Registry being available.\nIt also has a consistent set of functions to manage schema tags, look up\nschemas from the Schema Registry or files, and encode/decode data.\n\nMuch thanks to Klarna for [Avlizer](https://github.com/klarna/avlizer), which\nprovides similar functionality to this library in Erlang,\n[erlavro](https://github.com/klarna/erlavro) for Avro, and\n[brod](https://github.com/klarna/brod) for dealing with Kafka.\n\n## Installation\n\nAdd the package to your list of dependencies in `mix.exs`:\n\n```elixir\ndef deps do\n  [\n    {:avro_schema, \"~\u003e 0.1.0\"}\n  ]\nend\n```\n\nDocumentation is on [HexDocs](https://hexdocs.pm/avro_schema).\nTo generate a local copy, run `mix docs`.\n\n## Starting\n\nAdd the cache GenServer to your application's supervision tree:\n\n```elixir\ndef start(_type, _args) do\n  cache_dir = Application.get_env(:yourapp, :cache_dir, \"/tmp\")\n\n  children = [\n    {AvroSchema, [cache_dir: cache_dir]},\n  ]\n\n  opts = [strategy: :one_for_one, name: Example.Supervisor]\n  Supervisor.start_link(children, opts)\nend\n```\n\n## Overview\n\nWhen using Kafka, producers and consumers are separated, and schemas may evolve\nover time. It is common for producers to tag data indicating the schema that\nwas used to encode it. Consumers can then look up the corresponding schema\nversion and use it decode the data.\n\nThis library supports two tagging formats, [Confluent wire format](https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format),\nand [Avro single object encoding](https://avro.apache.org/docs/1.8.2/spec.html#single_object_encoding_spec).\n\n### Confluent wire format\n\nWith the [Confluent wire format](https://docs.confluent.io/current/schema-registry/serializer-formatter.html#wire-format),\nAvro binary encoded objects are prefixed with a five-byte tag.\n\nThe first byte indicates the Confluent serialization format version number,\ncurrently always 0. The following four bytes encode the integer schema ID as\nreturned from the Schema Registry in network byte order.\n\n### Avro single-object encoding\n\nWhen used without a schema registry, it's common to prefix binary data with a\nhash of the schema that created it. In the past, that might be something like\nMD5.\n\nThe Avro \"Single-object encoding\" formalizes this, prefixing Avro binary data\nwith a two-byte marker, C3 01, to show that the message is Avro and uses this\nsingle-record format (version 1). That is followed by the 8-byte little-endian\n[CRC-64-AVRO](https://avro.apache.org/docs/1.8.2/spec.html#schema_fingerprints)\nfingerprint of the object's schema.\n\nThe CRC64 algorithm is uncommon, but used because it is relatively short,\nwhile still being good enough to detect collisions. The fingerprint function is\nimplemented in `fingerprint_schema/1`.\n\n### Schema Registry\n\nIn a relatively static system, it's not too hard to exchange schema files\nbetween producers and consumers. When things are changing more frequently, it\ncan be difficult to keep files up to date. It's also easy for insignificant\ndifferences such as whitespace to result in different schema hashes.\n\nThe Schema Registry solves this by providing a centralized service which\nproducers and consumers can call to get a unique identifier for a schema\nversion. Producers register a schema with the service and get an id.\nConsumers look up the id to get the schema.\n\nThe Schema Registry also does validation on new schemas to ensure that they\nmeet a backwards compatibility policy for the organization.\nThis helps to [evolve schemas](https://docs.confluent.io/current/schema-registry/avro.html)\nover time and deploy them without breaking running applications.\n\nThe disadvantage of the Schema Registry is that it can be a single point\nof failure. Different schema registries will, in general, assign a different\nnumeric id to the same schema.\n\nThis library provides functions to register schemas with the Schema Registry\nand look them up by id. It caches the results in RAM (ETS) for performance,\nand optionally also on disk (DETS). This gives good performance and allows\nprograms to work without needing to communicate with the Schema Registry.\nOnce read, the numeric IDs never change, so it's safe to cache them indefinitely.\n\nThe library also has support for managing schemas from files. It can add files\nto the cache by fingerprint, registering the same schema under multiple\nfingerprints, i.e. the raw JSON, a version in\n[Parsing Canonical Form](https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas)\nand with whitespace stripped out. You can also manually register aliases for the\nname and fingerprint to handle legacy data.\n\n## Kafka producer example\n\nA Kafka producer program needs to be able to encode the data with an Avro\nschema and tag it with the schema ID or fingerprint. It may store the\nschema in the code or read it from a file, or it may look it up from the Schema\nRegistry using the subject.\n\nThe subject is a registered name which identifies the type of data.\nThere are are several [standard strategies](https://docs.confluent.io/current/schema-registry/serializer-formatter.html#subject-name-strategy)\nused by Confluent in their Kafka libraries.\n\n* `TopicNameStrategy`, the default, registers the schema based on the Kafka\n  topic name, implicitly requiring that all messages use the same schema.\n\n* `RecordNameStrategy` names the schema using the record type, allowing\n  a single topic to have multiple different types of data or multiple topics\n  to have the same type of data.\n\n  In an Avro schema the \"[full name](https://avro.apache.org/docs/1.8.2/spec.html#names)\", is\n  a namespace-qualified name for the record, e.g. `com.example.X`. In the schema, it is\n  the `name` field.\n\n* `TopicRecordNameStrategy` names the schema using a combination of topic and record.\n\nWith the subject, the producer can call the Schema Registry to get the ID\nmatching the Avro schema:\n\n```elixir\niex\u003e schema_json = \"{\\\"name\\\":\\\"test\\\",\\\"type\\\":\\\"record\\\",\\\"fields\\\":[{\\\"name\\\":\\\"field1\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"field2\\\",\\\"type\\\":\\\"int\\\"}]}\"\niex\u003e subject = \"test\"\niex\u003e {:ok, ref} = AvroSchema.register_schema(subject, schema_json)\n{:ok, 21}\n```\n\nIf the schema has already been registered, then the Schema Registry will\nreturn the current id. If you are registering a new version of the schema, then\nthe Schema Registry will first check if it is compatible with the old one.\nDepending on the compatibility rules, it may reject the schema.\n\nThe producer next needs to get an encoder for the schema.\n\nThe encoder is a function that takes Avro key/value data and encodes\nit to a binary.\n\n```elixir\niex\u003e {:ok, encoder} = AvroSchema.make_encoder(schema_json)\n{:ok, #Function\u003c2.110795165/1 in :avro.make_simple_encoder/2\u003e}\n```\n\nNext we encode some data:\n\n```elixir\niex\u003e data = %{field1: \"hello\", field2: 100}\niex\u003e encoded = AvroSchema.encode!(data, encoder)\n[['\\n', \"hello\"], [200, 1]]\n```\n\nFinally, we tag the data:\n\n```elixir\niex\u003e tagged_confluent = AvroSchema.tag(encoded, 21)\n[\u003c\u003c0, 0, 0, 0, 21\u003e\u003e, [['\\n', \"hello\"], [200, 1]]]\n```\n\nIf you are using files, the process is similar. First\ncreate a fingerprint for the schema:\n\n```elixir\niex\u003e fp = AvroSchema.fingerprint_schema(schema_json)\n\u003c\u003c172, 194, 58, 14, 16, 237, 158, 12\u003e\u003e\n```\n\nNext tag the data:\n\n```elixir\niex\u003e tagged_avro = AvroSchema.tag(encoded, fp)\n[\n  \u003c\u003c195, 1\u003e\u003e,\n  \u003c\u003c172, 194, 58, 14, 16, 237, 158, 12\u003e\u003e,\n  [['\\n', \"hello\"], [200, 1]]\n]\n```\n\nNow you can send the data to Kafka.\n\n## Kafka consumer example\n\nThe process for a consumer is similar.\n\nReceive the data and get the registration id in Confluent format:\n\n```elixir\niex\u003e tagged_confluent = IO.iodata_to_binary(AvroSchema.tag(encoded, 21))\n\u003c\u003c0, 0, 0, 0, 21, 10, 104, 101, 108, 108, 111, 200, 1\u003e\u003e\n\niex\u003e {:ok, {{:confluent, regid}, bin}} = AvroSchema.untag(tagged_confluent)\n{:ok, {{:confluent, 21}, \u003c\u003c10, 104, 101, 108, 108, 111, 200, 1\u003e\u003e}}\n```\n\nGet the schema from the Schema Registry:\n\n```elixir\niex\u003e {:ok, schema} = AvroSchema.get_schema(regid)\n{:ok,\n {:avro_record_type, \"test\", \"\", \"\", [],\n  [\n    {:avro_record_field, \"field1\", \"\", {:avro_primitive_type, \"string\", []},\n     :undefined, :ascending, []},\n    {:avro_record_field, \"field2\", \"\", {:avro_primitive_type, \"int\", []},\n     :undefined, :ascending, []}\n  ], \"test\", []}}\n```\n\nCreate a decoder and decode the data:\n\n```elixir\niex\u003e {:ok, decoder} = AvroSchema.make_decoder(schema)\n{:ok, #Function\u003c4.110795165/1 in :avro.make_simple_decoder/2\u003e}\n\niex\u003e decoded = AvroSchema.decode!(bin, decoder)\n%{\"field1\" =\u003e \"hello\", \"field2\" =\u003e 100}\n```\n\nThe process is similar with a fingerprint.  In this case, we get the schema\nfrom files and register it in the cache using the schema name and fingerprints. There\nis more than one fingerprint because we register it with the raw schema from\nthe file and the normalized JSON for better interop.\n\n```elixir\niex\u003e {:ok, files} = AvroSchema.get_schema_files(\"test/schemas\")\n{:ok, [\"test/schemas/test.avsc\"]}\n\niex\u003e for file \u003c- files, do: AvroSchema.cache_schema_file(file)\n[\n  ok: [\n    {\"test\", \u003c\u003c172, 194, 58, 14, 16, 237, 158, 12\u003e\u003e},\n    {\"test\", \u003c\u003c194, 132, 80, 199, 36, 146, 103, 147\u003e\u003e}\n  ]\n]\n```\n\nTo decode, separate the fingerprint from the data:\n\n```elixir\niex\u003e tagged_avro = IO.iodata_to_binary(AvroSchema.tag(encoded, fp))\n\u003c\u003c195, 1, 172, 194, 58, 14, 16, 237, 158, 12, 10, 104, 101, 108, 108, 111, 200,\n  1\u003e\u003e\n\niex\u003e {:ok, {{:avro, fp}, bin}} = AvroSchema.untag(tagged_avro)\n{:ok, {{:avro, \u003c\u003c172, 194, 58, 14, 16, 237, 158, 12\u003e\u003e},\n  \u003c\u003c10, 104, 101, 108, 108, 111, 200, 1\u003e\u003e}}\n```\n\nGet the decoder and decode the data:\n\n```elixir\niex\u003e {:ok, schema} = AvroSchema.get_schema({\"test\", fp})\n{:ok,\n {:avro_record_type, \"test\", \"\", \"\", [],\n  [\n    {:avro_record_field, \"field1\", \"\", {:avro_primitive_type, \"string\", []},\n     :undefined, :ascending, []},\n    {:avro_record_field, \"field2\", \"\", {:avro_primitive_type, \"int\", []},\n     :undefined, :ascending, []}\n  ], \"test\", []}}\n```\n\nDecoding works the same as with the Schema Registry:\n\n```elixir\niex\u003e {:ok, decoder} = AvroSchema.make_decoder(schema)\n{:ok, #Function\u003c4.110795165/1 in :avro.make_simple_decoder/2\u003e}\n\niex\u003e decoded = AvroSchema.decode!(bin, decoder)\n%{\"field1\" =\u003e \"hello\", \"field2\" =\u003e 100}\n```\n\n\n## Performance\n\nFor best performance, save the encoder or decoder in your process\nstate to avoid the overhead of looking it up for each message.\n\nAn in-memory ETS cache maps the integer registry ID or name + fingerprint\nto the corresponding schema and decoder. It also allows lookups using name and\nfingerprint as a key.\n\nThe fingerprint is CRC64 by default. You can also register a name with your own\nfingerprint.\n\nThis library also allows consumers to look up the schema on demand from the\nSchema Registry using the name + fingerprint as the registry subject name.\n\nThis library can optionally persist the cache data on disk using DETS,\nallowing programs to work without continuous access to the Schema Registry.\n\nPrograms which use Kafka may process high message volumes, so efficiency\nis important. They generally use multiple processes, typically one per\ntopic partition or more. On startup, each process may simultaneously attempt to\nlook up schemas.\n\nThe cache lookup runs in the caller's process, so it can run in parallel.\nIf there is a cache miss, then it calls the GenServer to update the cache.\nThis has the effect of serializing requests, ensuring that only one runs\nat a time. See https://www.cogini.com/blog/avoiding-genserver-bottlenecks/ for\ndiscussion.\n\nIn order to improve interoperability, the schema should be put into standard form.\n\nIt might also call the schema registry to get the schema for a given subject:\n\n```elixir\niex\u003e ConfluentSchemaRegistry.get_schema(client, \"test\")\n{:ok,\n%{\n \"id\" =\u003e 21,\n \"schema\" =\u003e \"{\\\"type\\\":\\\"record\\\",\\\"name\\\":\\\"test\\\",\\\"fields\\\":[{\\\"name\\\":\\\"field1\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"field2\\\",\\\"type\\\":\\\"int\\\"}]}\",\n \"subject\" =\u003e \"test\",\n \"version\" =\u003e 13\n}}\n```\n\n### Timestamps\n\nAvro timestamps are in Unix format with microsecond precision:\n\n```elixir\niex\u003e datetime = DateTime.utc_now()\n~U[2019-11-08 09:09:01.055742Z]\n\niex\u003e timestamp = AvroSchema.to_timestamp(datetime)\n1573204141055742\n\niex\u003e datetime = AvroSchema.to_datetime(timestamp)\n~U[2019-11-08 09:09:01.055742Z]\n```\n\n# Contacts\n\nI am `jakemorrison` on on the Elixir Slack and Discord, `reachfh` on Freenode\n`#elixir-lang` IRC channel. Happy to chat or help with your projects.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcogini%2Favro_schema","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcogini%2Favro_schema","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcogini%2Favro_schema/lists"}