{"id":25862800,"url":"https://github.com/datachefhq/sparkle","last_synced_at":"2025-03-01T23:56:49.023Z","repository":{"id":250279559,"uuid":"833998729","full_name":"DataChefHQ/sparkle","owner":"DataChefHQ","description":"✨ A meta framework for Apache Spark, helping data engineers to focus on solving business problems with highest quality!","archived":false,"fork":false,"pushed_at":"2024-11-25T10:26:01.000Z","size":257,"stargazers_count":4,"open_issues_count":3,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-26T17:53:16.384Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataChefHQ.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-26T07:57:51.000Z","updated_at":"2024-12-16T16:28:03.000Z","dependencies_parsed_at":"2024-07-26T09:26:09.131Z","dependency_job_id":"1f409d98-0951-470b-85fb-d1f5edcda114","html_url":"https://github.com/DataChefHQ/sparkle","commit_stats":null,"previous_names":["datachefhq/sparkle"],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataChefHQ%2Fsparkle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataChefHQ%2Fsparkle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataChefHQ%2Fsparkle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataChefHQ%2Fsparkle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataChefHQ","download_url":"https://codeload.github.com/DataChefHQ/sparkle/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241439767,"owners_count":19963100,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-01T23:56:48.262Z","updated_at":"2025-03-01T23:56:48.984Z","avatar_url":"https://github.com/DataChefHQ.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sparkle ✨\n\n**Sparkle** is a meta-framework built on top of [Apache\nSpark](https://spark.apache.org/), designed to streamline data\nengineering workflows and accelerate the delivery of data\nproducts. Developed by [**DataChef**](https://datachef.co), Sparkle\nfocuses on three main areas:\n\n1. **Improving Developer Experience (DevEx) 🚀**\n2. **Reducing Time to Market ⏱️**\n3. **Easy Maintenance 🔧**\n\nWith these goals in mind, Sparkle has enabled DataChef to deliver\nfunctional data products from day one, allowing for seamless handovers\nto internal teams.\n\nRead more about Sparkle on [DataChef's blog!](https://blog.datachef.co/sparkle-accelerating-data-engineering-with-datachefs-meta-framework)\n\n## Key Features\n\n### 1. Improved Developer Experience 🚀\n\nSparkle enhances the developer experience by abstracting away\nnon-business-critical aspects of Spark application development. It\nachieves this through:\n\n- **Sophisticated Configuration Mechanism**: Simplifies the setup and\n  configuration of Spark applications, allowing developers to focus\n  solely on business logic.\n- **Automatic Functional Tests 🧪**: Generates tests for each\n  application automatically, based on predefined input and output\n  fixtures. This ensures that the application behaves as expected\n  without requiring extensive manual testing.\n\n### 2. Reduced Time to Market ⏱️\n\nSparkle significantly reduces the time to market by automating the\ndeployment and testing processes. This allows data engineers to\nconcentrate exclusively on developing the business logic, with all\nother aspects handled by Sparkle:\n\n- **Automated Testing ✅**: Ensures that all applications are robust\n  and ready for deployment without manual intervention.\n- **Seamless Deployment 🚢**: Automates the deployment pipeline,\n  reducing the time needed to bring new data products to market.\n\n### 3. Enhanced Maintenance 🔧\n\nSparkle simplifies maintenance through heavy testing and abstraction\nof non-business functional requirements. This provides a reliable and\ntrustworthy system that is easy to maintain:\n\n- **Abstraction of Non-Business Logic 📦**: By focusing on business\n  logic, Sparkle minimizes the complexity associated with maintaining\n  Spark applications.\n- **Heavily Tested Framework 🔍**: All non-business functionalities\n  are thoroughly tested, reducing the risk of bugs and ensuring a\n  stable environment for data applications.\n\n## How It Works 🛠️\n\nThe Sparkle framework operates on a principle similar to Function as a\nService (FaaS). Developers can instantiate a Sparkle application that\ntakes a list of input DataFrames and focuses solely on transforming\nthese DataFrames according to the business logic. The Sparkle\napplication then automatically writes the output of this\ntransformation to the desired destination.\n\nSparkle follows a streamlined approach, designed to reduce effort in\ndata transformation workflows. Here’s how it works:\n\n1. **Specify Input Locations and Types**: Easily set up input locations\nand types for your data. Sparkle’s configuration makes this effortless,\nremoving typical setup hurdles and letting you get started\nwith minimal overhead.\n\n    ```python\n    ...\n    config=Config(\n      ...,\n      kafka_input=KafkaReaderConfig(\n                        KafkaConfig(\n                            bootstrap_servers=\"localhost:9119\",\n                            credentials=Credentials(\"test\", \"test\"),\n                        ),\n                        kafka_topic=\"src_orders_v1\",\n                    )\n    ),\n    readers={\"orders\": KafkaReader},\n    ...\n    ```\n\n2. **Define Business Logic**: This is where developers spend most of their time.\nUsing Sparkle, you create transformations on input DataFrames, shaping data\naccording to your business needs.\n\n    ```python\n    # Override process function from parent class\n    def process(self) -\u003e DataFrame:\n            return self.input[\"orders\"].read().join(\n                self.input[\"users\"].read()\n            )\n    ```\n\n3. **Specify Output Locations**: Sparkle automatically writes transformed data to\nthe specified output location, streamlining the output step to make data\navailable wherever it’s needed.\n\n    ```python\n    ...\n    config=Config(\n      ...,\n      iceberg_output=IcebergConfig(\n                        database_name=\"all_products\",\n                        table_name=\"orders_v1\",\n                    ),\n    ),\n    writers=[IcebergWriter],\n    ...\n    ```\n\n\nThis structure lets developers concentrate on meaningful transformations while\nSparkle takes care of configurations, testing, and output management.\n\n## Connectors 🔌\n\nSparkle offers specialized connectors for common data sources and sinks,\nmaking data integration easier. These connectors are designed to\nenhance—not replace—the standard Spark I/O options,\nstreamlining development by automating complex setup requirements.\n\n### Readers\n\n1. **Iceberg Reader**: Simplifies reading from Iceberg tables,\nmaking integration with Spark workflows a breeze.\n\n2. **Kafka Reader (with Avro schema registry)**: Ingest streaming data\nfrom Kafka with seamless Avro schema registry integration, supporting\ndata consistency and schema evolution.\n\n### Writers\n\n1. **Iceberg Writer**: Easily write transformed data to Iceberg tables,\nideal for time-traveling, partitioned data storage.\n\n2. **Kafka Writer**: Publish data to Kafka topics with ease, supporting\nreal-time analytics and downstream consumers.\n\n## Getting Started 🚀\n\nSparkle is currently under heavy development, and we are continuously\nworking on improving and expanding its capabilities.\n\nTo stay updated on our progress and access the latest information,\nfollow us on [LinkedIn](https://nl.linkedin.com/company/datachefco)\nand [GitHub](https://github.com/DataChefHQ/Sparkle).\n\n## Example\n\nThis is the simplest example to create a Orders pipelines by reading records\nfrom a Kafka topic and writing it to an Iceberg table:\n\n```python\nfrom sparkle.config import Config, IcebergConfig, KafkaReaderConfig\nfrom sparkle.config.kafka_config import KafkaConfig, Credentials\nfrom sparkle.writer.iceberg_writer import IcebergWriter\nfrom sparkle.application import Sparkle\nfrom sparkle.reader.kafka_reader import KafkaReader\n\nfrom pyspark.sql import DataFrame\n\n\nclass CustomerOrders(Sparkle):\n  def __init__(self):\n      super().__init__(\n          config=Config(\n              app_name=\"orders\",\n              app_id=\"orders-app\",\n              version=\"0.0.1\",\n              database_bucket=\"s3://test-bucket\",\n              checkpoints_bucket=\"s3://test-checkpoints\",\n              iceberg_output=IcebergConfig(\n                  database_name=\"all_products\",\n                  table_name=\"orders_v1\",\n              ),\n              kafka_input=KafkaReaderConfig(\n                  KafkaConfig(\n                      bootstrap_servers=\"localhost:9119\",\n                      credentials=Credentials(\"test\", \"test\"),\n                  ),\n                  kafka_topic=\"src_orders_v1\",\n              ),\n          ),\n          readers={\"orders\": KafkaReader},\n          writers=[IcebergWriter],\n      )\n\n  def process(self) -\u003e DataFrame:\n      return self.input[\"orders\"].read()\n```\n\n## Contributing 🤝\n\nWe welcome contributions from the community! If you're interested in\ncontributing to Sparkle, please check our [GitHub\nrepository](https://github.com/DataChefHQ/Sparkle) for more details on\nhow you can get involved.\n\n## License 📄\n\nSparkle is licensed under the Apache v2.0 License. See the\n[LICENSE](LICENSE) file for more details.\n\n## Contact 📬\n\nFor more information, questions, or feedback, feel free to reach out\nto us on [LinkedIn](https://nl.linkedin.com/company/datachefco) or\nopen an issue on our\n[GitHub](https://github.com/DataChefHQ/sparkle/issues) repository.\n\n---\n\nThank you for your interest in Sparkle! We're excited to have you join\nus on this journey to revolutionize data engineering with Apache\nSpark. 🎉\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatachefhq%2Fsparkle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatachefhq%2Fsparkle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatachefhq%2Fsparkle/lists"}