{"id":20325443,"url":"https://github.com/getyourguide/ddataflow","last_synced_at":"2025-10-27T13:33:31.482Z","repository":{"id":41210763,"uuid":"486560153","full_name":"getyourguide/DDataFlow","owner":"getyourguide","description":"A tool to help you to test and develop pyspark code with sampled and local data","archived":false,"fork":false,"pushed_at":"2023-12-07T10:08:06.000Z","size":390,"stargazers_count":10,"open_issues_count":1,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-04-10T02:05:35.093Z","etag":null,"topics":["machine-learning","python","spark"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getyourguide.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-04-28T11:10:06.000Z","updated_at":"2024-04-15T13:56:41.081Z","dependencies_parsed_at":"2024-04-15T13:56:38.191Z","dependency_job_id":null,"html_url":"https://github.com/getyourguide/DDataFlow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FDDataFlow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FDDataFlow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FDDataFlow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getyourguide%2FDDataFlow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getyourguide","download_url":"https://codeload.github.com/getyourguide/DDataFlow/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248468528,"owners_count":21108835,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","python","spark"],"created_at":"2024-11-14T19:39:50.966Z","updated_at":"2025-10-27T13:33:26.461Z","avatar_url":"https://github.com/getyourguide.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DDataFlow\n\nDDataFlow is an end2end tests and local development solution for machine learning and data pipelines using pyspark.\nCheck out this blogpost if you want to [understand deeper its design motivation](https://www.getyourguide.careers/posts/ddataflow-a-tool-for-data-end-to-end-tests-for-machine-learning-pipelines).\n\n![ddataflow overview](docs/ddataflow.png)\n\nYou can find our documentation under this [link](https://code.getyourguide.com/DDataFlow/).\n\n## Features\n\n- Read a subset of our data so to speed up the running of the pipelines during tests\n- Write to a test location our artifacts so you don't pollute production\n- Download data for enabling local machine development\n\nEnables to run on the pipelines in the CI\n\n## 1. Install DDataflow\n\n```sh\npip install ddataflow \n```\n\n`ddataflow --help` will give you an overview of the available commands.\n\n\n# Getting Started (\u003c5min Tutorial)\n\n\n## 1. Setup some synthetic data\n\nSee the [examples folder](examples/pipeline.py).\n\n## 2. Create a ddataflow_config.py file\n\nThe command `ddtaflow setup_project` creates a file like this for you.\n\n```py\nfrom ddataflow import DDataflow\n\nconfig = {\n    # add here your tables or paths with customized sampling logic\n    \"data_sources\": {\n        \"demo_tours\": {\n            \"source\": lambda spark: spark.table('demo_tours'),\n            \"filter\": lambda df: df.limit(500)\n        }\n        \"demo_locations\": {\n            \"source\": lambda spark: spark.table('demo_locations'),\n            \"default_sampling\": True,\n        }\n    },\n    \"project_folder_name\": \"ddataflow_demo\",\n}\n\n# initialize the application and validate the configuration\nddataflow = DDataflow(**config)\n```\n\n## 3. Use ddataflow in a pipeline\n\n```py\nfrom ddataflow_config import ddataflow\n\n# replace spark.table for ddataflow source will return a spark dataframe\nprint(ddataflow.source('demo_locations').count())\n# for sql queries replace only the name of the table for the sample data source name provided by ddataflow\nprint(spark.sql(f\"\"\" SELECT COUNT(1) from {ddataflow.name('demo_tours')}\"\"\").collect()[0]['count(1)'])\n```\n\nNow run it twice and observe the difference in the amount of records:\n`python pipeline.py`\n\n`ENABLE_DDATAFLOW=True python pipeline.py`\n\nYou will see that the dataframes are sampled when ddataflow is enabled and full when the tool is disabled.\n\nYou completed the short demo!\n\n## How to develop\n\nThe recommended approach to use ddataflow is to use the offline mode, which allows you to test your pipelines without the need for an active cluster. This is especially important for development and debugging purposes, as it allows you to quickly test and identify any issues with your pipelines.\n\nAlternatively, you can use Databricks Connect to test your pipelines on an active cluster. However, our experience with this approach has not been great, memory issues are common and there is the risk of overriding production data, so we recommend using the offline mode instead.\n\nIf you have any questions or need any help, please don't hesitate to reach out. We are here to help you get the most out of ddataflow.\n\n\n## Support\n\nIn case of questions feel free to reach out or create an issue.\n\nCheck out our [FAQ in case of problems](https://github.com/getyourguide/DDataFlow/blob/main/docs/FAQ.md)\n\n## Contributing\n\nWe welcome contributions to DDataFlow! If you would like to contribute, please follow these guidelines:\n\n1. Fork the repository and create a new branch for your contribution.\n2. Make your changes and ensure that the code passes all tests.\n3. Submit a pull request with a clear description of your changes and the problem it solves.\n\nPlease note that all contributions are subject to review and approval by the project maintainers. We appreciate your help in making DDataFlow even better!\n\nIf you have any questions or need any help, please don't hesitate to reach out. We are here to assist you throughout the contribution process.\n\n## License\nDDataFlow is licensed under the [MIT License](https://github.com/getyourguide/DDataFlow/blob/main/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetyourguide%2Fddataflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetyourguide%2Fddataflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetyourguide%2Fddataflow/lists"}