{"id":15562165,"url":"https://github.com/mrkamel/kafka_sync","last_synced_at":"2025-03-29T04:44:07.605Z","repository":{"id":144996296,"uuid":"158843253","full_name":"mrkamel/kafka_sync","owner":"mrkamel","description":"Using Kafka to keep secondary datastores in sync with your primary datastore","archived":false,"fork":false,"pushed_at":"2018-12-10T09:33:28.000Z","size":101,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-03T14:36:54.542Z","etag":null,"topics":["kafka","ruby","sync"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrkamel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-23T14:32:31.000Z","updated_at":"2018-12-10T09:33:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"ff9868bb-5b25-4f8f-a7bd-03d656ff1122","html_url":"https://github.com/mrkamel/kafka_sync","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrkamel%2Fkafka_sync","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrkamel%2Fkafka_sync/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrkamel%2Fkafka_sync/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrkamel%2Fkafka_sync/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrkamel","download_url":"https://codeload.github.com/mrkamel/kafka_sync/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246140542,"owners_count":20729797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kafka","ruby","sync"],"created_at":"2024-10-02T16:12:04.558Z","updated_at":"2025-03-29T04:44:07.581Z","avatar_url":"https://github.com/mrkamel.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# KafkaSync\n\n**Using Kafka to keep secondary datastores in sync with your primary datastore**\n\n[![Build Status](https://secure.travis-ci.org/mrkamel/kafka_sync.png?branch=master)](http://travis-ci.org/mrkamel/kafka_sync)\n\nSimply stated, the primary purpose of KafkaSync is to keep your secondary\ndatastores (ElasticSearch, cache stores, etc.) consistent with the models\nstored in your primary datastore (MySQL, Postgres, etc).\n\nGetting started is easy:\n\n```ruby\nclass MyModel \u003c ActiveRecord::Base\n  include KafkaSync::Model\n\n  kafka_sync\nend\n```\n\n`kafka_sync` installs model lifecycle callbacks, i.e. `after_save`,\n`after_touch`, `after_destroy` and, most importantly, `after_commit`. The\ncallbacks send messages to kafka, having a (customizable) payload:\n\n```ruby\ndef kafka_payload\n  { id: id }\nend\n```\n\nNow, background workers can fetch the messages in batches and update the\nsecondary datastores. However, `after_save`, `after_touch` and `after_destroy`\nonly send \"delay messages\" to kafka. These delay messages should not be fetched\nimmediately. Instead, they should be fetched after e.g. 5 minutes. Only the\n`after_commit` callback is sending messages to kafka which can be fetched\nimmediately by background workers. The delay messages provide a safety net for\ncases when something crashes in between the database commit and the\n`after_commit` callback. Contrary, the purpose of messages sent to Kafka from\nwithin the `after_commit` callback is to keep the secondary datastore updated\nin near-realtime when everything is working without any issues. Due to the\ncombination of delay messages and instant messages, you won't have to to do a\nfull re-index after server crashes again, because your secondary datastores\nwill be self-healing.\n\n## Why Kafka?\n\nKafka has unique properties which nicely fit the use case. Reading messages\nfrom a Kafka topic is done using an offset that must be specified. This allows\nto easily implement bulk processing, which is e.g. very useful when indexing\ndata into ElasticSearch performance wise. Moreover, as we can manage committing\noffsets completly on our own, we are free to only commit an offset when all\nmessages up to this offset have successfully been processed. Next, Kafka has a\nconcept of in-sync replicas and you can configure Kafka to only return success\nto your message producers sending messages if at least N in-sync replicas are\navailable and if the message has been replicated to at least M in-sync\nreplicas. Thus, you can e.g. start with a three node Kafka setup, with\n`min.insync.replicas=2`, `default.replication.factor=3` and `required_acks=-1`,\nwhere -1 means, that the message must have been replicated to all in-sync\nreplicas before Kafka returns success to your producers. This greatly improves\nreliability.\n\nHowever, there now is an alternative to Kafka, because Redis Streams (available\nin Redis \u003e= 5.0) comes with a Redis Streams datatype/feature. So, in case you\nprefer using Redis, you probably want to check out the\n[redstream](https://github.com/mrkamel/redstream) gem.\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'kafka_sync'\n```\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install kafka_sync\n\nAfterwards, you need to specify how to connect to kafka as well as zokeeper:\n\n```ruby\nKafkaSync.seed_brokers = [\"127.0.0.1:9092\"]\nKafkaSync.zk_hosts = \"127.0.0.1:1281\"\n```\n\n## Reference Docs\n\nThe reference docs can be found at\n[https://www.rubydoc.info/github/mrkamel/kafka_sync/master](https://www.rubydoc.info/github/mrkamel/kafka_sync/master).\n\n\n## Model\n\nThe `KafkaSync::Model` module installs model lifecycle methods.\n\n```ruby\nclass MyModel \u003c ActiveRecord::Base\n  include KafkaSync::Model\n\n  kafka_sync\nend\n```\n\n## Consumer\n\n```ruby\nDefaultLogger = Logger.new(STDOUT)\n\nKafkaSync::Consumer.new(topic: \"products\", partition: 0, name: \"consumer\", logger: DefaultLogger).run do |messages|\n  # ...\nend\n```\n\nYou should run a consumer per `(topic, partition, name)` tuple on multiple\nhosts for high availability. They will perform a leader election using\nzookeeper, such that only one consumer of them will be actively consuming\nmessages per tuple while the others are hot-standbys, i.e. if the leader dies,\nanother instance will take over leadership.\n\nPlease note: if you have multiple kinds of consumers for a single model/topic,\nthen you must use distinct names. Assume you have an indexer, which updates a\nsearch index for a model and a cacher, which updates a cache store for a model:\n\n```ruby\nKafkaSync::Consumer.new(topic: MyModel.kafka_topic, partition: 0, name: \"indexer\", logger: DefaultLogger).run do |messages|\n  # ...\nend\n\nKafkaSync::Consumer.new(topic: MyModel.kafka_topic, partition: 0, name: \"cacher\", logger: DefaultLogger).run do |messages|\n  # ...\nend\n```\n\nPlease note that it's up to you to detect and handle deletions. More\nconcretely, `after_destroy` writes the same message to kafka as `after_save`,\nsuch that your consumer needs to fetch the records specified by the kafka\nmessages, check which of them no longer exist and update your secondary\ndatastores accordingly. The code can be as simple as:\n\n```ruby\n  KafkaSync::Consumer.new(topic: MyModel.kafka_topic, partition: 0, name: \"my_consumer\").run do |messages|\n    ids = messages.map { |message| message.payload[\"id\"] }\n    records = MyModel.where(id: ids).index_by(\u0026:id)\n\n    ids.each do |id|\n      if object = records[id]\n        # update secondary data store\n      else\n        # delete from secondary data store\n      end\n    end\n  end\n```\n\nOf course, batching the updates/deletions usually improves the performance.\n\n## Delayer\n\nThe delayer fetches the delay messages, i.e. messages from the specified delay\ntopic. It then checks if enough time has passed in between. Otherwise it will\nsleep until enough time has passed. Afterwards the delay re-sends the messages\nto the desired topic where an indexer can fetch it and index it like usual.\n\n```ruby\nKafkaSync::Delayer.new(topic: MyModel.kafka_topic, partition: 0, delay: 300, logger: DefaultLogger).run\n```\n\nAgain, you should run a delayer per `(topic, partition)` tuple on multiple\nhosts for high availability.\n\nAs you might have noticed, KafkaSync sends 2 messages to Kafka for every update\nto your models.\n\n## Streamer\n\nThe `KafkaSync:Streamer` actually sends the delay as well as instant messages\nto Kafka and is required for cases where you're using `#update_all`,\n`#delete_all`, etc. As you might now, `#update_all`, etc. is by-passing any\nmodel lifecycle callbacks, such that you need to tell KafkaSync about those\nupdates.\n\nMore concretely, you need to change:\n\n```ruby\nProduct.where(on_stock: true).update_all(featured: true)\n```\n\nto the following:\n\n```ruby\nKafkaStreamer = KafkaSync::Streamer.new\n\nProduct.where(on_stock: true).find_in_batches do |products|\n  KafkaStreamer.bulk products do\n    Product.where(id: products.map(\u0026:id)).update_all(featured: true)\n  end\nend\n```\n\nThe `#bulk` method must ensure that the same set of records is used for the\ndelay messages and the instant messages. Thus, you better directly pass an\narray of records to `KafkaSync::Streamer#bulk`, like shown above. If you pass\nan `ActiveRecord::Relation`, the `#bulk` method will convert it to an array,\ni.e. load the whole result set into memory.\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/mrkamel/kafka_sync.\n\n## License\n\nThe gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrkamel%2Fkafka_sync","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrkamel%2Fkafka_sync","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrkamel%2Fkafka_sync/lists"}