{"id":17780724,"url":"https://github.com/amrdb/data-services","last_synced_at":"2025-03-15T22:31:18.702Z","repository":{"id":259222483,"uuid":"876714603","full_name":"amrdb/data-services","owner":"amrdb","description":"A high-performance, distributed data access layer implementing request coalescing and hash-based routing to reduce database load and prevent hot partitions. ","archived":false,"fork":false,"pushed_at":"2024-10-26T21:15:02.000Z","size":32,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-26T21:49:34.196Z","etag":null,"topics":["cassandra","docker","docker-compose","go","golang","grpc"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amrdb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-22T12:44:25.000Z","updated_at":"2024-10-26T21:15:06.000Z","dependencies_parsed_at":"2024-10-26T21:50:23.400Z","dependency_job_id":"67f5bbc2-f9f4-4361-87db-502b615c47f3","html_url":"https://github.com/amrdb/data-services","commit_stats":null,"previous_names":["amr2812/data-services","amrdb/data-services"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amrdb%2Fdata-services","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amrdb%2Fdata-services/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amrdb%2Fdata-services/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amrdb%2Fdata-services/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amrdb","download_url":"https://codeload.github.com/amrdb/data-services/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243801600,"owners_count":20350105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra","docker","docker-compose","go","golang","grpc"],"created_at":"2024-10-27T03:03:43.131Z","updated_at":"2025-03-15T22:31:13.693Z","avatar_url":"https://github.com/amrdb.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"# Data Services\n\nA high-performance, distributed data access layer implementing request coalescing and hash-based routing to reduce database load and prevent hot partitions. \n\n## Overview\n\nData Services is a middleware layer that sits between API servers and Cassandra clusters, providing request coalescing. It's designed to handle high-traffic scenarios efficiently by reducing duplicate database queries and preventing database overload.\nIt is inspired by Discord's architecture explained in their blog post: [HOW DISCORD STORES TRILLIONS OF MESSAGES](https://discord.com/blog/how-discord-stores-trillions-of-messages).\n\nAn example of usecase from Discord is when a big announcement is sent on a large server (Discord group) that notifies @everyone: users are going to open the app and read the same message, sending tons of traffic to the database. This is where request coalescing comes in handy, as it can combine all the requests for the same data into a single database query, reducing the load on the database and preventing hot partitions.\n\nA simpler way of understanding it is: caching with the duration equal to the time spent running the query. No client has to be aware of the coalescing because the max amount of staleness is the same as if each client had run the query themselves. It also doesn't require extra memory, because the query result falls out of scope as soon as it is sent to all waiters.\n\n### Key Features\n\n- **Request Coalescing**: Automatically combines duplicate requests for the same data into a single database query\n- **Consistent Hash-based Routing**: Routes related requests to the same service instance for optimal coalescing\n- **Distributed Architecture**: Multiple service instances working in parallel\n- **High Availability**: Data service nodes are stateless and can be scaled horizontally\n- **Monitoring**: Built-in metrics for tracking requests and queries counts\n\n## Setup\n\n```bash\n$ docker-compose up --build\n```\nWait for the services to start up. The Cassandra cluster will be initialized with the required keyspace. You will see something like this in the logs that shows that the data service instances are ready to accept requests:\n```\ndata-service1-1   | 2024/10/26 16:18:45 Connected to cassandra\ndata-service1-1   | 2024/10/26 16:18:45 Starting server on port 50051\ndata-service2-1   | 2024/10/26 16:18:45 Connected to cassandra\ndata-service2-1   | 2024/10/26 16:18:45 Starting server on port 50052\n```\n\nRun the client CLI to send test requests to the data service:\n```bash\n$ go run ./client -h\n  -channels int\n        Number of unique channels to distribute requests across (number of unique requests) (default 20)\n  -requests int\n        Total number of requests to send (default 10000)\n```\n\n\nExample usage:\n```\n$ go run ./client\n2024/10/26 17:08:11 Unique requests: 20, Total requests: 10000, Total queries executed: 184\n2024/10/26 17:08:11 Average queries per request: 0.0184\n2024/10/26 17:08:11 Saved queries by coalescing: 9816\n2024/10/26 17:08:11 Total time taken: 816.727364ms\n```\n\n## Architecture\n\n### Components\n\n1. **Data Service Nodes**: gRPC servers that handle incoming requests and manage database connections\n2. **Cassandra Cluster**: A 3-node Cassandra cluster for data storage\n3. **Client**: gRPC Test client for simulating high-traffic scenarios\n\n```mermaid\n%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '16px', 'fontFamily': 'arial', 'nodeTextSize': '16px', 'labelTextSize': '16px', 'titleTextSize': '20px' }}}%%\n\nflowchart TB\n    subgraph Users [\"Multiple Users\"]\n        U1[User 1]\n        U2[User 2]\n        U3[User 3]\n    end\n\n    subgraph API [\"Hash-based Routing API\"]\n        A1[API Server 1]\n        A2[API Server 2]\n    end\n\n    subgraph DS [\"Data Services Layer\"]\n        direction TB\n        subgraph DS1 [\"Data Service Instance 1\"]\n            direction TB\n            subgraph Coalescing1 [\"Request Coalescing\"]\n                R1[\"Request (Channel 1)\"]\n                R2[\"Request (Channel 1)\"]\n                R3[\"Request (Channel 1)\"]\n                RC1[Request Coalescer]\n                DQ1[\"Single DB Query\"]\n                R1 \u0026 R2 \u0026 R3 --\u003e RC1\n                RC1 --\u003e DQ1\n            end\n        end\n        subgraph DS2 [\"Data Service Instance 2\"]\n            direction TB\n            subgraph Coalescing2 [\"Request Coalescing\"]\n                R4[\"Request (Channel 2)\"]\n                R5[\"Request (Channel 2)\"]\n                R6[\"Request (Channel 2)\"]\n                RC2[Request Coalescer]\n                DQ2[\"Single DB Query\"]\n                R4 \u0026 R5 \u0026 R6 --\u003e RC2\n                RC2 --\u003e DQ2\n            end\n        end\n    end\n\n    subgraph DB [\"Cassandra Cluster\"]\n        C1[Node 1] \u003c--\u003e C2[Node 2] \u003c--\u003e C3[Node 3] \u003c--\u003e C1\n    end\n\n    %% Connect users to API\n    U1 --\u003e A1\n    U2 --\u003e API\n    U3 --\u003e A2\n\n    %% Connect API to Data Services\n    A1 --\u003e DS1\n    A1 --\u003e DS2\n    A2 --\u003e DS1\n    A2 --\u003e DS2\n\n    %% Connect Data Services to Cassandra\n    DS1 --\u003e DB\n    DS2 --\u003e DB\n\n    classDef users fill:#B3E5FC,stroke:#0277BD,color:#000000,font-size:16px\n    classDef api fill:#FFB74D,stroke:#E65100,color:#000000,font-size:16px\n    classDef dataservice fill:#CE93D8,stroke:#6A1B9A,color:#000000,font-size:16px\n    classDef cassandra fill:#81C784,stroke:#2E7D32,color:#000000,font-size:16px\n    classDef component fill:#E0E0E0,stroke:#424242,color:#000000,font-size:16px\n\n    class U1,U2,U3 users\n    class A1,A2 api\n    class DS1,DS2,R1,R2,R3,R4,R5,R6,RC1,RC2,DQ1,DQ2 dataservice\n    class C1,C2,C3 cassandra\n```\n\n## Technical Learnings\n\n### Go Concurrency Patterns\n1. **Channels**\n   - Used for async communication between goroutines\n   - Each request is in its own goroutine with a channel for response from the query executer goroutine\n\n2. **Mutex Operations**\n   - Implemented thread-safe access to shared resources\n\n3. **Atomic Operations**\n   - Used lock-free atomic counters for metrics tracking\n\n4. **WaitGroups**\n   - Used for waiting on multiple goroutines to complete in the CLI client\n\n5. **Context Management**\n   - Used context for request timeouts and cancellation\n\n\n### gRPC Implementation\n- Defined service interfaces using Protocol Buffers\n- Managed timeout handling using context\n\n### Docker and Container Orchestration\n- Implemented health checks for service readiness\n- Managed container dependencies and startup order\n- Configured networking between services\n- Implemented volume management for data persistence\n\n## Future Improvements\n\n1. **Monitoring \u0026 Observability**\n   - Add distributed tracing\n\n2. **Scalability**\n   - Implement dynamic service discovery (e.g. Consul or etcd)\n\n3. **Resilience**\n   - Add circuit breakers\n   - Implement retry policies\n   - Add rate limiting\n\n## License\n\nThis project is licensed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famrdb%2Fdata-services","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famrdb%2Fdata-services","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famrdb%2Fdata-services/lists"}