{"id":29016522,"url":"https://github.com/memgraph/data-streams","last_synced_at":"2025-06-25T22:30:34.181Z","repository":{"id":37876036,"uuid":"422431890","full_name":"memgraph/data-streams","owner":"memgraph","description":"Publicly available real-time data sets on Kafka, Redpanda, RabbitMQ \u0026 Apache Pulsar","archived":false,"fork":false,"pushed_at":"2024-08-02T10:14:03.000Z","size":8186,"stargazers_count":30,"open_issues_count":3,"forks_count":3,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-08-02T11:45:40.908Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://github.com/g-despot/data-streams","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/memgraph.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-29T03:36:40.000Z","updated_at":"2024-06-27T06:01:50.000Z","dependencies_parsed_at":"2022-08-18T17:52:17.379Z","dependency_job_id":null,"html_url":"https://github.com/memgraph/data-streams","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/memgraph/data-streams","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgraph%2Fdata-streams","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgraph%2Fdata-streams/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgraph%2Fdata-streams/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgraph%2Fdata-streams/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/memgraph","download_url":"https://codeload.github.com/memgraph/data-streams/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgraph%2Fdata-streams/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261962055,"owners_count":23236861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-25T22:30:33.487Z","updated_at":"2025-06-25T22:30:34.133Z","avatar_url":"https://github.com/memgraph.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003e :bar_chart: data-streams :bar_chart:\u003c/h1\u003e\n\u003cp align=\"center\"\u003e Publicly available real-time data sets on Kafka, Redpanda, RabbitMQ \u0026 Apache Pulsar\u003c/p\u003e\n\n## :speech_balloon: About\n\nThis project serves as a starting point for analyzing real-time streaming data.\nWe have prepared a few cool datasets which can be streamed via Kafka, Redpanda,\nRabbitMQ, and Apache Pulsar. Right now, you can clone/fork the repo and start\nthe service locally, but we will be adding publicly available clusters to which\nyou can just connect.\n\n## :open_file_folder: Datasets\n\nCurrently available datasets:\n\n- [Art Blocks](./datasets/art-blocks/data)\n- [GitHub](./datasets/github/data)\n- [MovieLens](./datasets/movielens/data)\n- [Amazon books](./datasets/amazon-books/data/)\n\n## :fast_forward: How to start the streams?\n\nPlace yourself in root folder and run:\n\n```\npython3 start.py --platforms \u003cPLATFORMS\u003e --dataset \u003cDATASET\u003e\n```\n\nThe argument `\u003cPLATFORMS\u003e` can be:\n- `kafka`,\n- `redpanda`,\n- `rabbitmq` and/or\n- `pulsar`.\n\nThe argument `\u003cDATASET\u003e` can be:\n-  `github` ,\n-  `art-blocks` ,\n-  `movielens` or\n-  `amazon-books`.\n\nThat script will start chosen streaming platforms in docker container, and you will see messages from chosen dataset being consumed.\n\nYou can then connect with Memgraph and stream the data into the database by running:\n```\ndocker-compose up \u003cDATASET\u003e-memgraph\n```\n\nFor example, if you choose Kafka as a streaming platform and art-blocks for your dataset, you should run:\n```\npython3 start.py --platforms kafka --dataset art-blocks\n```\n\n\u003e If you are a Windows user and the upper command doesn't work, try replacing `python3` with `python`.\n\nNext, in the new terminal window run:\n```\ndocker-compose up art-blocks-memgraph\n```\n\n## :scroll: References\n\nThere's no documentation yet, but it's coming soon! Throw us a star to keep up with upcoming changes.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmemgraph%2Fdata-streams","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmemgraph%2Fdata-streams","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmemgraph%2Fdata-streams/lists"}