{"id":13526875,"url":"https://github.com/soundcloud/roshi","last_synced_at":"2025-05-15T03:07:52.649Z","repository":{"id":13218724,"uuid":"15902954","full_name":"soundcloud/roshi","owner":"soundcloud","description":"Roshi is a large-scale CRDT set implementation for timestamped events.","archived":false,"fork":false,"pushed_at":"2023-04-24T10:36:00.000Z","size":740,"stargazers_count":3168,"open_issues_count":4,"forks_count":155,"subscribers_count":269,"default_branch":"master","last_synced_at":"2025-04-14T03:09:08.484Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soundcloud.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-01-14T13:33:55.000Z","updated_at":"2025-03-25T20:35:17.000Z","dependencies_parsed_at":"2022-07-19T05:32:03.379Z","dependency_job_id":"94dd5521-a639-4090-a054-638aaf7db9a1","html_url":"https://github.com/soundcloud/roshi","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soundcloud%2Froshi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soundcloud%2Froshi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soundcloud%2Froshi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soundcloud%2Froshi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soundcloud","download_url":"https://codeload.github.com/soundcloud/roshi/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254264769,"owners_count":22041794,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:01:36.545Z","updated_at":"2025-05-15T03:07:47.630Z","avatar_url":"https://github.com/soundcloud.png","language":"Go","funding_links":[],"categories":["Go","Implementations"],"sub_categories":["Databases and Logs"],"readme":"# roshi [![Build Status](https://travis-ci.org/soundcloud/roshi.png)](https://travis-ci.org/soundcloud/roshi) [![GoDoc](https://godoc.org/github.com/soundcloud/roshi?status.svg)](http://godoc.org/github.com/soundcloud/roshi)\n\nRoshi implements a time-series event storage via a LWW-element-set CRDT with\nlimited inline garbage collection. Roshi is a stateless, distributed layer on\ntop of Redis and is implemented in Go. It is partition tolerant, highly\navailable and eventually consistent.\n\nAt a high level, Roshi maintains sets of values, with each set ordered\naccording to (external) timestamp, newest-first. Roshi provides the following\nAPI:\n\n* Insert(key, timestamp, value)\n* Delete(key, timestamp, value)\n* Select(key, offset, limit) []TimestampValue\n\nRoshi stores a sharded copy of your dataset in multiple independent Redis\ninstances, called a **cluster**. Roshi provides fault tolerance by duplicating\nclusters; multiple identical clusters, normally at least 3, form a **farm**.\nRoshi leverages CRDT semantics to ensure consistency without explicit\nconsensus.\n\n# Use cases\n\nRoshi is basically a high-performance **index** for timestamped data. It's\ndesigned to sit in the critical (request) path of your application or service.\nThe originating use case is the SoundCloud stream; see [this blog post][blog]\nfor details.\n\n[blog]: http://developers.soundcloud.com/blog/roshi-a-crdt-system-for-timestamped-events\n\n# Theory and system properties\n\nRoshi is a distributed system, for two reasons: it's made for datasets that\ndon't fit on one machine, and it's made to be tolerant against node failure.\n\nNext, we will explain the system design.\n\n## CRDT\n\nCRDTs (conflict-free replicated data types) are data types on which the same \nset of operations yields the same outcome, regardless of order of execution \nand duplication of operations. This allows data convergence without the need \nfor consensus between replicas. In turn, this allows for easier implementation \n(no consensus protocol implementation) as well as lower latency (no wait-time \nfor consensus).\n\nOperations on CRDTs need to adhere [to the following rules][mixu]:\n\n- Associativity (a+(b+c)=(a+b)+c), so that grouping doesn't matter.\n- Commutativity (a+b=b+a), so that order of application doesn't matter.\n- Idempotence (a+a=a), so that duplication doesn't matter.\n\nData types as well as operations have to be specifically crafted to meet these\nrules. CRDTs have known implementations for counters, registers, sets, graphs,\nand others. Roshi implements a set data type, specifically the Last Writer\nWins element set (LWW-element-set).\n\nThis is an intuitive description of the LWW-element-set:\n\n- An element is in the set, if its most-recent operation was an add.\n- An element is not in the set, if its most-recent operation was a remove.\n\nA more formal description of a LWW-element-set, as informed by\n[Shapiro][shapiro], is as follows: a set S is represented by two internal\nsets, the add set A and the remove set R. To add an element e to the set S,\nadd a tuple t with the element and the current timestamp t=(e, now()) to A. To\nremove an element from the set S, add a tuple t with the element and the\ncurrent timestamp t=(e, now()) to R. To check if an element e is in the set S,\ncheck if it is in the add set A and not in the remove set R with a higher\ntimestamp.\n\nRoshi implements the above definition, but extends it by applying a sort of\ninstant garbage collection.  When inserting an element E to the logical set S,\ncheck if E is already in the add set A or the remove set R. If so, check the\nexisting timestamp. If the existing timestamp is **lower** than the incoming\ntimestamp, the write succeeds: remove the existing (element, timestamp) tuple\nfrom whichever set it was found in, and add the incoming (element, timestamp)\ntuple to the add set A. If the existing timestamp is higher than the incoming\ntimestamp, the write is a no-op.\n\nBelow are all possible combinations of add and remove operations.\nA(elements...) is the state of the add set. R(elements...) is the state of\nthe remove set. An element is a tuple with (value, timestamp). add(element)\nand remove(element) are the operations.\n\nOriginal state | Operation   | Resulting state\n---------------|-------------|-----------------\nA(a,1) R()     | add(a,0)    | A(a,1) R()\nA(a,1) R()     | add(a,1)    | A(a,1) R()\nA(a,1) R()     | add(a,2)    | A(a,2) R()\nA(a,1) R()     | remove(a,0) | A(a,1) R()\nA(a,1) R()     | remove(a,1) | A(a,1) R()\nA(a,1) R()     | remove(a,2) | A() R(a,2)\nA() R(a,1)     | add(a,0)    | A() R(a,1)\nA() R(a,1)     | add(a,1)    | A() R(a,1)\nA() R(a,1)     | add(a,2)    | A(a,2) R()\nA() R(a,1)     | remove(a,0) | A() R(a,1)\nA() R(a,1)     | remove(a,1) | A() R(a,1)\nA() R(a,1)     | remove(a,2) | A() R(a,2)\n\nFor a Roshi LWW-element-set, an element will always be in either the add or\nthe remove set exclusively, but never in both and never more than once. This\nmeans that the logical set S is the same as the add set A.\n\nEvery key in Roshi represents a set. Each set is its own LWW-element-set.\n\nFor more information on CRDTs, the following resources might be helpful:\n\n- [The chapter on CRDTs][mixu] in \"Distributed Systems for Fun and Profit\" by Mixu\n- \"[A comprehensive study of Convergent and Commutative Replicated Data Types][shapiro]\" by Mark Shapiro et al. 2011\n\n[mixu]: http://book.mixu.net/distsys/eventual.html\n[shapiro]: http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf\n\n## Replication\n\nRoshi replicates data over several non-communicating clusters. A typical\nreplication factor is 3. Roshi has two methods of replicating data: during\nwrite, and during read-repair.\n\nA write (Insert or Delete) is sent to all clusters. The overall operation\nreturns success the moment a user-defined number of clusters return success.\nUnsuccessful clusters might either have been too slow (but still accepted the\nwrite) or failed (due to a network partition or an instance crash). In case of\nfailure, read-repair might be triggered on a later read.\n\nA read (Select) is dependent on the read strategy employed. If the strategy\nqueries several clusters, it might be able to spot disagreement in the\nreturned sets. If so, the unioned set is returned to the client, and in the\nbackground, a read-repair is triggered, which lazily converges the sets across\nall replicas.\n\n[Package farm][farm] explains replication, read strategies, and read-repair\nfurther.\n\n[farm]: http://github.com/soundcloud/roshi/tree/master/farm\n\n## Fault tolerance\n\nRoshi runs as a homogenous distributed system. Each Roshi instance can serve\nall requests (Insert, Delete, Select) for a client, and communicates with all\nRedis instances.\n\nA Roshi instance is effectively stateless, but holds transient state. If a\nRoshi instance crashes, two types of state are lost:\n\n1. Current client connections are lost. Clients can reconnect to another Roshi\n   instance and re-execute their operation.\n2. Unresolved read-repairs are lost. The read-repair might be triggered again\n   during another read.\n\nSince all operations are idempotent, both failure modes do not impede on\nconvergence of the data.\n\nPersistence is delegated to [Redis][redis-persistence]. Data on a\ncrashed-but-recovered Redis instance might be lost between the time it\ncommited to disk, and the time it accepts connections again. The lost data gap\nmight be repaired via read-repair.\n\n[redis-persistence]: http://redis.io/topics/persistence\n\nIf a Redis instance is permanently lost and has to be replaced with a fresh\ninstance, there are two options:\n\n1. Replace it with an empty instance. Keys will be replicated to it via\n   read-repair. As more and more keys are replicated, the read-repair load will\n   decrease and the instance will work normally. This process might result in\n   data loss over the lifetime of a system: if the other replicas are also\n   lost, non-replicated keys (keys that have not been requested and thus did\n   not trigger a read-repair) are lost.\n2. Replace it with a cloned replica. There will be a gap between the time of\n   the last write respected by the replica and the first write respected by the\n   new instance. This gap might be fixed by subsequent read-repairs.\n\nBoth processes can be expedited via a [keyspace walker process][roshi-walker].\nNevertheless, these properties and procedures warrant careful consideration.\n\n## Responses to write operations\n\nWrite operations (insert or delete) return boolean to indicate whether the\noperation was successfully applied to the data layer, respecting the\nconfigured write quorum. Clients should interpret a write response of false to\nmean they should re-submit their operation. A write response of true does\n**not** imply the operation mutated the state in a way that will be visible to\nreaders, merely that it was accepted and processed according to CRDT\nsemantics.\n\nAs an example, all of these write operations would return true.\n\nWrite operation         | Final state           | Operation description\n------------------------|-----------------------|---------------\nInsert(\"foo\", 3, \"bar\") | foo+ bar/3\u003cbr/\u003efoo- — | Initial write\nInsert(\"foo\", 3, \"bar\") | foo+ bar/3\u003cbr/\u003efoo- — | No-op: incoming score doesn't beat existing score\nDelete(\"foo\", 2, \"bar\") | foo+ bar/3\u003cbr/\u003efoo- — | No-op: incoming score doesn't beat existing score\nDelete(\"foo\", 4, \"bar\") | foo+ —\u003cbr/\u003efoo- bar/4 | \"bar\" moves from add set to remove set\nDelete(\"foo\", 5, \"bar\") | foo+ —\u003cbr/\u003efoo- bar/5 | score of \"bar\" in remove set is incremented\n\n## Considerations\n\n### Elasticity\n\nRoshi does not support elasticity. It is not possible to change the sharding\nconfiguration during operations. Roshi has static service discovery,\nconfigured during startup.\n\n### Data structure\n\nRoshi works with LWW-element-sets only. Clients might choose to model other\ndata types on top of the LWW-element-sets themselves.\n\n### Correct client timestamps\n\nClient timestamps are assumed to correctly represent the physical order of\nevents coming into the system. Incorrect client timestamps might lead to\nvalues of a client either never appearing or always overriding other values in\na set.\n\n### Data loss\n\nAssuming a replication factor of 3, and a write quorum of 2 nodes, Roshi makes\nthe following guarantees in the presence of failures of Redis instances that\nrepresent the same data shard:\n\nFailures | Data loss? | Reads                              | Writes\n---------|------------|------------------------------------|----------\n0        | No         | Succeed                            | Succeed\n1        | No         | Success dependent on read strategy | Succeed\n2        | No         | Success dependent on read strategy | Fail\n3        | Yes        | Fail                               | Fail\n\n[Package farm][farm] explains read strategies further.\n\nFailures of Redis instances over independent data shards don't affect\ninstantaneous data durability. However, over time, independent Redis instance\nfailures can lead to data loss, especially on keys which are not regularly\nread-repaired.\nIn practice, a number of strategies may be used to probabilistically mitigate\nthis concern. For example, walking modified keys after known outages, or the\nwhole keyspace at regular intervals, which will trigger read-repairs for\ninconsistent sets.\nHowever, **Roshi fundamentally does not guarantee perfect data durability**.\nTherefore, Roshi should not be used as a source of truth, but only as an\nintermediate store for performance critical data.\n\n### Authentication, authorization, validation\n\nIn case it's not obvious, Roshi performs no authentication, authorization, or\nany validation of input data. Clients must implement those things themselves.\n\n# Architecture\n\nRoshi has a layered architecture, with each layer performing a specific\njob with a relatively small surface area. From the bottom up...\n\n- **Redis**: Roshi is ultimately implemented on top of Redis instance(s),\n  utilizing the [sorted set][sorted-set] data type. For more details on how\n  the sorted sets are used, see package cluster, below.\n\n- **[Package pool][pool]** performs key-based sharding over one or more Redis\n  instances. It exposes basically a single method, taking a key and yielding a\n  connection to the Redis instance that should hold that key. All Redis\n  interactions go through package pool.\n\n- **[Package cluster][cluster]** implements an Insert/Select/Delete API on top\n  of package pool. To ensure idempotency and [commutativity][commutativity],\n  package cluster expects timestamps to arrive as float64s, and refuses writes\n  with smaller timestamps than what's already been persisted. To ensure\n  information isn't lost via deletes, package cluster maintains two physical\n  Redis sorted sets for every logical (user) key, and manages the transition of\n  key-timestamp-value tuples between those sets.\n\n- **[Package farm][farm]** implements a single Insert/Select/Delete API over\n  multiple underlying clusters. Writes (Inserts and Deletes) are sent to all\n  clusters, and a quorum is required for success. Reads (Selects) abide one of\n  several read strategies. Some read strategies allow for the possibility of\n  read-repair.\n\n- **[roshi-server][roshi-server]** makes a Roshi farm accessible through a\n  REST-ish HTTP interface. It's effectively stateless, and [12-factor][twelve]\n  compliant.\n\n- **[roshi-walker][roshi-walker]** walks the keyspace in semirandom order at a\n  defined rate, making Select requests for each key in order to trigger read\n  repairs.\n\n[sorted-set]: http://redis.io/commands#sorted_set\n[pool]: http://github.com/soundcloud/roshi/tree/master/pool\n[cluster]: http://github.com/soundcloud/roshi/tree/master/cluster\n[commutativity]: http://en.wikipedia.org/wiki/Commutative_property\n[farm]: http://github.com/soundcloud/roshi/tree/master/farm\n[roshi-server]: http://github.com/soundcloud/roshi/tree/master/roshi-server\n[twelve]: http://12factor.net\n[roshi-walker]: http://github.com/soundcloud/roshi/tree/master/roshi-walker\n\n## The big picture\n\n![Overview](http://i.imgur.com/SEeKquW.png)\n\n(Clusters need not have the same number of Redis instances.)\n\n# Development\n\nRoshi is written in [Go](http://golang.org). You'll need a recent version of\nGo installed on your computer to build Roshi. If you're on a Mac and use\n[homebrew](http://brew.sh), `brew install go` should work fine.\n\n## Build\n\n    go build ./...\n\n## Test\n\n    go test ./...\n\n# Running\n\nSee [roshi-server][roshi-server] and [roshi-walker][roshi-walker] for\ninformation about owning and operating your own Roshi.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoundcloud%2Froshi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoundcloud%2Froshi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoundcloud%2Froshi/lists"}