https://github.com/datachefhq/sparkle

✨ A meta framework for Apache Spark, helping data engineers to focus on solving business problems with highest quality!
https://github.com/datachefhq/sparkle

Last synced: over 1 year ago
JSON representation

✨ A meta framework for Apache Spark, helping data engineers to focus on solving business problems with highest quality!

Host: GitHub
URL: https://github.com/datachefhq/sparkle
Owner: DataChefHQ
License: apache-2.0
Created: 2024-07-26T07:57:51.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-11-25T10:26:01.000Z (over 1 year ago)
Last Synced: 2025-02-26T17:53:16.384Z (over 1 year ago)
Language: Python
Homepage:
Size: 251 KB
Stars: 4
Watchers: 4
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          # Sparkle ✨

**Sparkle** is a meta-framework built on top of [Apache

Spark](https://spark.apache.org/), designed to streamline data

engineering workflows and accelerate the delivery of data

products. Developed by [**DataChef**](https://datachef.co), Sparkle

focuses on three main areas:

1. **Improving Developer Experience (DevEx) 🚀**

2. **Reducing Time to Market ⏱️**

3. **Easy Maintenance 🔧**

With these goals in mind, Sparkle has enabled DataChef to deliver

functional data products from day one, allowing for seamless handovers

to internal teams.

Read more about Sparkle on [DataChef's blog!](https://blog.datachef.co/sparkle-accelerating-data-engineering-with-datachefs-meta-framework)

## Key Features

### 1. Improved Developer Experience 🚀

Sparkle enhances the developer experience by abstracting away

non-business-critical aspects of Spark application development. It

achieves this through:

- **Sophisticated Configuration Mechanism**: Simplifies the setup and

  configuration of Spark applications, allowing developers to focus

  solely on business logic.

- **Automatic Functional Tests 🧪**: Generates tests for each

  application automatically, based on predefined input and output

  fixtures. This ensures that the application behaves as expected

  without requiring extensive manual testing.

### 2. Reduced Time to Market ⏱️

Sparkle significantly reduces the time to market by automating the

deployment and testing processes. This allows data engineers to

concentrate exclusively on developing the business logic, with all

other aspects handled by Sparkle:

- **Automated Testing ✅**: Ensures that all applications are robust

  and ready for deployment without manual intervention.

- **Seamless Deployment 🚢**: Automates the deployment pipeline,

  reducing the time needed to bring new data products to market.

### 3. Enhanced Maintenance 🔧

Sparkle simplifies maintenance through heavy testing and abstraction

of non-business functional requirements. This provides a reliable and

trustworthy system that is easy to maintain:

- **Abstraction of Non-Business Logic 📦**: By focusing on business

  logic, Sparkle minimizes the complexity associated with maintaining

  Spark applications.

- **Heavily Tested Framework 🔍**: All non-business functionalities

  are thoroughly tested, reducing the risk of bugs and ensuring a

  stable environment for data applications.

## How It Works 🛠️

The Sparkle framework operates on a principle similar to Function as a

Service (FaaS). Developers can instantiate a Sparkle application that

takes a list of input DataFrames and focuses solely on transforming

these DataFrames according to the business logic. The Sparkle

application then automatically writes the output of this

transformation to the desired destination.

Sparkle follows a streamlined approach, designed to reduce effort in

data transformation workflows. Here’s how it works:

1. **Specify Input Locations and Types**: Easily set up input locations

and types for your data. Sparkle’s configuration makes this effortless,

removing typical setup hurdles and letting you get started

with minimal overhead.

    ```python

    ...

    config=Config(

      ...,

      kafka_input=KafkaReaderConfig(

                        KafkaConfig(

                            bootstrap_servers="localhost:9119",

                            credentials=Credentials("test", "test"),

                        ),

                        kafka_topic="src_orders_v1",

                    )

    ),

    readers={"orders": KafkaReader},

    ...

    ```

2. **Define Business Logic**: This is where developers spend most of their time.

Using Sparkle, you create transformations on input DataFrames, shaping data

according to your business needs.

    ```python

    # Override process function from parent class

    def process(self) -> DataFrame:

            return self.input["orders"].read().join(

                self.input["users"].read()

            )

    ```

3. **Specify Output Locations**: Sparkle automatically writes transformed data to

the specified output location, streamlining the output step to make data

available wherever it’s needed.

    ```python

    ...

    config=Config(

      ...,

      iceberg_output=IcebergConfig(

                        database_name="all_products",

                        table_name="orders_v1",

                    ),

    ),

    writers=[IcebergWriter],

    ...

    ```

This structure lets developers concentrate on meaningful transformations while

Sparkle takes care of configurations, testing, and output management.

## Connectors 🔌

Sparkle offers specialized connectors for common data sources and sinks,

making data integration easier. These connectors are designed to

enhance—not replace—the standard Spark I/O options,

streamlining development by automating complex setup requirements.

### Readers

1. **Iceberg Reader**: Simplifies reading from Iceberg tables,

making integration with Spark workflows a breeze.

2. **Kafka Reader (with Avro schema registry)**: Ingest streaming data

from Kafka with seamless Avro schema registry integration, supporting

data consistency and schema evolution.

### Writers

1. **Iceberg Writer**: Easily write transformed data to Iceberg tables,

ideal for time-traveling, partitioned data storage.

2. **Kafka Writer**: Publish data to Kafka topics with ease, supporting

real-time analytics and downstream consumers.

## Getting Started 🚀

Sparkle is currently under heavy development, and we are continuously

working on improving and expanding its capabilities.

To stay updated on our progress and access the latest information,

follow us on [LinkedIn](https://nl.linkedin.com/company/datachefco)

and [GitHub](https://github.com/DataChefHQ/Sparkle).

## Example

This is the simplest example to create a Orders pipelines by reading records

from a Kafka topic and writing it to an Iceberg table:

```python

from sparkle.config import Config, IcebergConfig, KafkaReaderConfig

from sparkle.config.kafka_config import KafkaConfig, Credentials

from sparkle.writer.iceberg_writer import IcebergWriter

from sparkle.application import Sparkle

from sparkle.reader.kafka_reader import KafkaReader

from pyspark.sql import DataFrame

class CustomerOrders(Sparkle):

  def __init__(self):

      super().__init__(

          config=Config(

              app_name="orders",

              app_id="orders-app",

              version="0.0.1",

              database_bucket="s3://test-bucket",

              checkpoints_bucket="s3://test-checkpoints",

              iceberg_output=IcebergConfig(

                  database_name="all_products",

                  table_name="orders_v1",

              ),

              kafka_input=KafkaReaderConfig(

                  KafkaConfig(

                      bootstrap_servers="localhost:9119",

                      credentials=Credentials("test", "test"),

                  ),

                  kafka_topic="src_orders_v1",

              ),

          ),

          readers={"orders": KafkaReader},

          writers=[IcebergWriter],

      )

  def process(self) -> DataFrame:

      return self.input["orders"].read()

```

## Contributing 🤝

We welcome contributions from the community! If you're interested in

contributing to Sparkle, please check our [GitHub

repository](https://github.com/DataChefHQ/Sparkle) for more details on

how you can get involved.

## License 📄

Sparkle is licensed under the Apache v2.0 License. See the

[LICENSE](LICENSE) file for more details.

## Contact 📬

For more information, questions, or feedback, feel free to reach out

to us on [LinkedIn](https://nl.linkedin.com/company/datachefco) or

open an issue on our

[GitHub](https://github.com/DataChefHQ/sparkle/issues) repository.

---

Thank you for your interest in Sparkle! We're excited to have you join

us on this journey to revolutionize data engineering with Apache

Spark. 🎉

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datachefhq/sparkle

Awesome Lists containing this project

README