{"id":15664627,"url":"https://github.com/avriiil/stream-this-dataset","last_synced_at":"2025-05-06T19:13:34.278Z","repository":{"id":220316442,"uuid":"624014208","full_name":"avriiil/stream-this-dataset","owner":"avriiil","description":"Code to convert static datasets into simulated data streams","archived":false,"fork":false,"pushed_at":"2023-04-06T11:46:23.000Z","size":710,"stargazers_count":15,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-06T19:13:31.697Z","etag":null,"topics":["dataset-generation","streaming-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/avriiil.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-04-05T15:03:54.000Z","updated_at":"2024-10-04T04:19:57.000Z","dependencies_parsed_at":"2024-02-01T12:46:09.906Z","dependency_job_id":null,"html_url":"https://github.com/avriiil/stream-this-dataset","commit_stats":null,"previous_names":["avriiil/stream-this-dataset"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avriiil%2Fstream-this-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avriiil%2Fstream-this-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avriiil%2Fstream-this-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/avriiil%2Fstream-this-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/avriiil","download_url":"https://codeload.github.com/avriiil/stream-this-dataset/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252752060,"owners_count":21798723,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-generation","streaming-data"],"created_at":"2024-10-03T13:43:36.273Z","updated_at":"2025-05-06T19:13:34.242Z","avatar_url":"https://github.com/avriiil.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# stream-this-dataset\nPublic streaming datasets can be difficult to find. This repo contains code to convert static (batch) datasets into simulated data streams. This way you can take an existing public dataset and convert it into a streaming use case. \n\nThis code is WIP and currently only supports reading CSV and Parquet datasets. There's much to be improved on here in terms of scalability, parametrization and performance. Feel free to submit PRs (: \n\n\n## Dependencies\nThe `stream-this-dataset.py` script uses `pandas` to access the dataset and send each row as a single streaming message to an Apache Kafka cluster. \n\nA second example script (`analyze-stream-realtime.py`) is included that shows how you could read the data stream from Kafka and analyse it; in this case compiling a list of the locations with the most generous tips. This analysis script uses [`pathway`](www.pathway.com) to perform real-time streaming analysis.\n\n\n## Sample Datasets\nSample CSV and Parquet files containing rideshare (Uber/Lyft) data from the New York City Taxi and Limousine Commission (TLC) Trip Record* dataset are provided. These are subsampled from the dataset for January 2022. Licensing information for this dataset can be found at https://www.nyc.gov/home/terms-of-use.page.\n\nThe dataset path needs to be passed as a command line argument when running the script: `python stream-this-dataset.py path/to/dataset.parquet`\n\n\n## A Note on Infrastructure\nThe Apache Kafka cluster is provisioned through [Upstash](www.upstash.com) which gives you a free, single-replica Kafka cluster. You will need to create an account to run this script yourself. Make sure to set the credentials as environment variables.\n\nNote that the Free-Tier Kafka cluster from Upstash is limited to 10K messages per day, so this is really just for testing purposes. To scale this script for unlimited use, connect to a Kafka cluster you have full control over.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favriiil%2Fstream-this-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Favriiil%2Fstream-this-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Favriiil%2Fstream-this-dataset/lists"}