{"id":24087958,"url":"https://github.com/tinybirdco/streaming_join_demo","last_synced_at":"2026-06-14T15:32:43.719Z","repository":{"id":192765285,"uuid":"687350586","full_name":"tinybirdco/streaming_join_demo","owner":"tinybirdco","description":null,"archived":false,"fork":false,"pushed_at":"2024-05-07T07:28:23.000Z","size":13,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-27T05:24:52.091Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tinybirdco.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-05T07:19:26.000Z","updated_at":"2024-07-19T00:20:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"4afae508-a14a-44ad-964f-9638f24d1dbb","html_url":"https://github.com/tinybirdco/streaming_join_demo","commit_stats":null,"previous_names":["tinybirdco/streaming_join_demo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tinybirdco/streaming_join_demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fstreaming_join_demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fstreaming_join_demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fstreaming_join_demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fstreaming_join_demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tinybirdco","download_url":"https://codeload.github.com/tinybirdco/streaming_join_demo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tinybirdco%2Fstreaming_join_demo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285636559,"owners_count":27205878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-21T02:00:06.175Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-10T03:56:39.952Z","updated_at":"2025-11-21T15:03:55.264Z","avatar_url":"https://github.com/tinybirdco.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Streaming join demo\n\n## The issue\n\nLet's assume we have two streams of data that are sent independently —to different Kafka topics, or different Data Sources via Events API or whatever:\n\nGPS position:\n\n```json\n{\n  \"timestamp\": \"2022-10-27T11:43:02\",\n  \"vehicle_id\": \"8d1e1533-6071-4b10-9cda-b8429c1c7a67\",\n  \"latitude\": 40.4169866,\n  \"longitude\": -3.7034816\n}\n\n{\n  \"timestamp\": \"2022-10-27T11:44:03\",\n  \"vehicle_id\": \"8d1e1533-6071-4b10-9cda-b8429c1c7a67\",\n  \"latitude\": 40.4169867,\n  \"longitude\": -3.7034818\n}\n```\n\nand Vehicle data:\n\n```json\n{\n  \"timestamp\": \"2022-10-27T11:43:02\",\n  \"vehicle_id\": \"8d1e1533-6071-4b10-9cda-b8429c1c7a67\",\n  \"speed\": 91,\n  \"fuel_level_percentage\": 85\n}\n\n{\n  \"timestamp\": \"2022-10-27T11:44:03\",\n  \"vehicle_id\": \"8d1e1533-6071-4b10-9cda-b8429c1c7a67\",\n  \"speed\": 89,\n  \"fuel_level_percentage\": 84\n}\n```\n\nSometimes __joining them at query time__, that is, in the pipe whose output is an API Endpoint, is perfectly fine, and we recommend starting there and move to storing the joined data into a different Data Source only when neccessary.\n\nBut it is true that sometimes, due to performance needs, we want them joined using _timestamp_ and _vehicle_id_ in another Data Source:\n\n| timestamp | vehicle_id | latitude | longitude | speed | fuel_level_percentage |\n| :-| :- | -: | -: | -: | -: |\n| 2022-10-27T11:43:02 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.4169866 | -3.7034816 | 91 | 85 |\n| 2022-10-27T11:44:03 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.4169867 | -3.7034818 | 89 | 84 |\n\nSo, our first thought would be to create a [Materialized View](https://www.tinybird.co/docs/concepts/materialized-views.html) that joins both streams:\n\n```sql\nNODE mat_node\nSQL \u003e\n\n    SELECT timestamp, vehicle_id, latitude, longitude, speed, fuel_level_percentage\n    FROM gps_data g\n    JOIN\n        (\n            SELECT timestamp, vehicle_id, speed, fuel_level_percentage\n            FROM vehicle_data\n            WHERE (timestamp, vehicle_id) IN (SELECT timestamp, vehicle_id FROM gps_data)\n        ) v\n        ON g.vehicle_id = v.vehicle_id\n        AND g.timestamp = v.timestamp\n\nTYPE materialized\nDATASOURCE mat_node_mv\nENGINE \"MergeTree\"\nENGINE_PARTITION_KEY \"toYYYYMM(timestamp)\"\nENGINE_SORTING_KEY \"vehicle_id, timestamp\"\n```\n\nBut note that __we don't have control ove when data arrives to Tinybird and when it is ingested__. Probably someone would already have spotted the issue with the MV but let's simulate it to see what happens:\n\n```bash\ncd dataproject0_the_issue\n\ntb auth \n\ntb push\n\n. ../clean_and_ingest_rows.sh\n\ntb sql \"select * from vehicle_data\"\n#----------------------------------------------------------------------------------------------\n#| timestamp           | vehicle_id                           | speed | fuel_level_percentage |\n#----------------------------------------------------------------------------------------------\n#| 2022-10-27 11:44:03 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 |    91 |                    85 |\n#| 2022-10-27 11:43:02 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 |    91 |                    85 |\n#----------------------------------------------------------------------------------------------\n\ntb sql \"select * from gps_data\"\n#--------------------------------------------------------------------------------------\n#| timestamp           | vehicle_id                           | latitude | longitude  |\n#--------------------------------------------------------------------------------------\n#| 2022-10-27 11:43:02 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.7034817 |\n#| 2022-10-27 11:44:03 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.7034817 |\n#--------------------------------------------------------------------------------------\n\ntb sql \"select * from mv_joined_data\"\n#----------------------------------------------------------------------------------------------------------------------\n#| timestamp           | vehicle_id                           | latitude | longitude  | speed | fuel_level_percentage |\n#----------------------------------------------------------------------------------------------------------------------\n#| 2022-10-27 11:43:02 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.7034817 |    91 |                    85 |\n#----------------------------------------------------------------------------------------------------------------------\n```\n\nWhat happened here? Why are we missing one row in the _mat_gps_join_vehicle_ Data Source? From the [docs](https://www.tinybird.co/docs/guides/materialized-views.html#limitations:~:text=Materialized%20Views%20generated%20using%20JOIN%20clauses%20are%20tricky.%20The%20resulting%20Data%20Source%20will%20be%20only%20automatically%20updated%20if%20and%20when%20a%20new%20operation%20is%20performed%20over%20the%20Data%20Source%20in%20the%20FROM.):\n\n\u003e Materialized Views generated using JOIN clauses are tricky. The resulting Data Source will be only automatically updated if and when a new operation is performed over the Data Source in the FROM.\n\n## Different alternatives\n\nSo, to overcome this issue there are several alternatives, each one with its tradeoffs.\n\n### Two materializing pipes joining data and ending in the same Data Source\n\nThe easiest way would be to add another pipe that does the JOIN the other way:\n\n```sql\nNODE mat_node\nSQL \u003e\n\n    SELECT timestamp, vehicle_id, latitude, longitude, speed, fuel_level_percentage\n    FROM vehicle_data v\n    JOIN\n        (\n            SELECT timestamp, vehicle_id, latitude, longitude\n            FROM gps_data\n            WHERE (timestamp, vehicle_id) IN (SELECT timestamp, vehicle_id FROM vehicle_data)\n        ) g\n        ON g.vehicle_id = v.vehicle_id\n        AND g.timestamp = v.timestamp\n\nTYPE materialized\nDATASOURCE mat_node_mv\nENGINE \"MergeTree\"\nENGINE_PARTITION_KEY \"toYYYYMM(timestamp)\"\nENGINE_SORTING_KEY \"vehicle_id, timestamp\"\n```\n\nTesting it:\n\n```bash\ntb workspace clear --yes\n\ncd ../dataproject1_two_MVs_join\n\ncp ../dataproject0_the_issue/.tinyb ./\n\ntb push\n\n. ../clean_and_ingest_rows.sh\n\ntb sql \"select * from vehicle_data\"\n\ntb sql \"select * from gps_data\"\n\ntb sql \"select * from mv_joined_data_from_2_pipes\"\n```\n\nNow we do have the expected 2 rows:\n\n```bash\n----------------------------------------------------------------------------------------------------------------------\n| timestamp           | vehicle_id                           | latitude | longitude  | speed | fuel_level_percentage |\n----------------------------------------------------------------------------------------------------------------------\n| 2022-10-27 11:44:03 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.703482  |    89 |                    84 |\n| 2022-10-27 11:43:02 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.7034817 |    91 |                    85 |\n----------------------------------------------------------------------------------------------------------------------\n```\n\nHowever, with high rates of ingest, there may be race conditions that lead to missing or duplicated rows, which may be solved by using a ReplacingMergeTree as the target Data Source.  \nAlso, depending on your scale and how well queries and sorting keys are defined, the JOIN approach can lead to memory errors.\n\nSo, if you face these errors or if you need a lot of accuracy, consider these other 2 options:\n\n### Two materializing pipes ending in a AggregatingMergeTree Data Source\n\nIt seems a bit strange going for a AggregatingMergeTree, but what we really want from here is its ability to materialize the streams into a DS independently and then the background process and the deduplication at query time would take care of joining them.\n\n```bash\ntb workspace clear --yes\n\ncd ../dataproject2_two_MVs_AggregatingMT\n\ncp ../dataproject0_the_issue/.tinyb ./\n\ntb push\n\n. ../clean_and_ingest_rows.sh\n\ntb sql \"select * from vehicle_data\"\n\ntb sql \"select * from gps_data\"\n\ntb sql \"\nselect \n  timestamp, \n  vehicle_id, \n  argMaxMerge(latitude) latitude, \n  argMaxMerge(longitude) \n  longitude, argMaxMerge(speed) speed, \n  argMaxMerge(fuel_level_percentage) fuel_level_percentage \nfrom mv_combined_data_amt \ngroup by timestamp, vehicle_id\"\n```\n\n### Join with Copy Pipes instead of with MVs\n\n[Copy Pipes](https://www.tinybird.co/docs/publish/copy-pipes.html) can help us overcome some of the limitations of MVs for this use case, but some assumptions are needed.\n\n- We need to define a time window that we think is safe for our usecase. Otherwise we would be scanning the entire Data Source on every copy job and would make the solution prohibitive. In the example in this pipe we are taking 10 mins, so if some messages takes longer we may lose them in the joined target Data Source.\n\n```bash\ntb workspace clear --yes\n\ncd ../dataproject3_using_copy_pipes\n\ncp ../dataproject0_the_issue/.tinyb ./\n\ntb push\n\n. ../clean_and_ingest_rows.sh\n\n\ntb sql \"select * from vehicle_data\"\n\ntb sql \"select * from gps_data\"\n\ntb pipe copy run copy_join --yes\n\n# You can check the copy status with `tb job details` and then query the data source once status is done.\n\ntb sql \"select * from ds_joined_data\"\n\n#----------------------------------------------------------------------------------------------------------------------\n#| timestamp           | vehicle_id                           | latitude | longitude  | speed | fuel_level_percentage |\n#----------------------------------------------------------------------------------------------------------------------\n#| 2022-10-27 11:43:02 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.7034817 |    91 |                    85 |\n#| 2022-10-27 11:44:03 | 8d1e1533-6071-4b10-9cda-b8429c1c7a67 | 40.41699 | -3.703482  |    89 |                    84 |\n#----------------------------------------------------------------------------------------------------------------------\n\n```\n\n### Join with Copy pipes and at query time (Kappa architecture)\n\nIf freshness is a hard requierement for your API Endpoint —and, in Tinybird, it usually is—, this approach can be combined with joining at query time. But these is way more performant than joining everything at query time since with this approach you only have to join the data that was not processed in the latest copy batch. Also, with this [kappa](https://en.wikipedia.org/wiki/Lambda_architecture#Kappa_architecture) (batch join + realtime join) approach, we could relax the frequency of the scheduled copy operations.\n\nThe kappa pipe is equivalent* to _copy_join.pipe_, only that at the end wwe retrieve the data from _ds_joined_data_ too.\n\n```sql\n\n--(... same as copy_join)\n\nNODE endpoint\nDESCRIPTION \u003e\n    and unioning them with the already processed\n\nSQL \u003e\n\n    SELECT * FROM ds_joined_data\n    UNION ALL\n    SELECT * FROM inner_join\n\n```\n\n*although applying some filters first if that matches your use case is highly recommended.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftinybirdco%2Fstreaming_join_demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftinybirdco%2Fstreaming_join_demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftinybirdco%2Fstreaming_join_demo/lists"}