Sparkle: Accelerating Data Engineering with DataChef’s Meta-Framework

Sparkle: Accelerating Data Engineering with DataChef’s Meta-Framework

Sparkle is revolutionizing the way data engineers build, deploy, and maintain data products. Built on top of Apache Spark, Sparkle is designed by DataChef to streamline workflows and create a seamless experience from development to deployment. Our goal is simple: enable developers to focus on transforming data, without worrying about the complexities of setup, testing, and maintenance.

With three primary objectives, Sparkle aims to:

  1. Enhance Developer Experience (DevEx) 🚀

  2. Reduce Time to Market ⏱️

  3. Simplify Maintenance 🔧


Key Features of Sparkle

1. Developer Experience (DevEx) Like Never Before 🚀

Sparkle simplifies developer workflows with a configuration mechanism that abstracts away Spark-specific complexities, making it easier to focus on business logic.

  • Sophisticated Configuration Mechanism: Sparkle streamlines Spark app setup, so developers can focus on transforming data, not configuring settings.

2. Faster Time to Market ⏱️

Sparkle’s emphasis on automation allows teams to release data products quickly, bypassing the need for extensive manual testing.

  • Automated Testing ✅: Testing is handled automatically, ensuring all applications are deployment-ready.

  • Seamless Deployment 🚢: Sparkle’s automated deployment pipeline means data products reach the market faster, with fewer roadblocks.

3. Hassle-Free Maintenance 🔧

Sparkle reduces maintenance complexity by abstracting non-business logic and conducting rigorous framework testing, making it easy to keep applications stable over time.

  • Abstraction of Non-Business Logic 📦: Sparkle focuses on business transformations, offloading other concerns.

  • Heavily Tested Framework 🔍: All non-business functionalities are thoroughly tested, reducing the risk of bugs and ensuring a stable environment for data applications.


Connectors for Seamless Data Integration 🔌

Sparkle offers specialized connectors for common data sources and sinks, making data integration easier. These connectors are designed to enhance—not replace—the standard Spark I/O options, streamlining development by automating complex setup requirements.

Readers

  1. Iceberg Reader: Simplifies reading from Iceberg tables, making integration with Spark workflows a breeze.

  2. Kafka Reader (with Avro schema registry): Ingest streaming data from Kafka with seamless Avro schema registry integration, supporting data consistency and schema evolution.

Writers

  1. Iceberg Writer: Easily write transformed data to Iceberg tables, ideal for time-traveling, partitioned data storage.

  2. Kafka Writer: Publish data to Kafka topics with ease, supporting real-time analytics and downstream consumers.

Integration Tests with Docker for Kafka

To ensure reliable Kafka integration, Sparkle includes Docker-based integration tests for Kafka, simulating real-world scenarios and validating data ingestion and output. This setup ensures robust and reliable functionality for applications relying on Kafka.


How Sparkle Works 🛠️

Sparkle follows a streamlined approach, designed to reduce effort in data transformation workflows. Here’s how it works:

  1. Specify Input Locations and Types: Easily set up input locations and types for your data. Sparkle’s configuration makes this effortless, removing typical setup hurdles and letting you get started with minimal overhead.

     ...
     config=Config(
       ...,
       kafka_input=KafkaReaderConfig(
                         KafkaConfig(
                             bootstrap_servers="localhost:9119",
                             credentials=Credentials("test", "test"),
                         ),
                         kafka_topic="src_orders_v1",
                     )
     ),
     readers={"orders": KafkaReader},
     ...
    
  2. Define Business Logic: This is where developers spend most of their time. Using Sparkle, you create transformations on input DataFrames, shaping data according to your business needs.

     # Override process function from parent class
     def process(self) -> DataFrame:
             return self.input["orders"].read().join(
                 self.input["users"].read()
             )
    
  3. Specify Output Locations: Sparkle automatically writes transformed data to the specified output location, streamlining the output step to make data available wherever it’s needed.

     ...
     config=Config(
       ...,
       iceberg_output=IcebergConfig(
                         database_name="all_products",
                         table_name="orders_v1",
                     ),
     ),
     writers=[IcebergWriter],
     ...
    

This structure lets developers concentrate on meaningful transformations while Sparkle takes care of configurations, testing, and output management.


Getting Started with Sparkle 🚀

To get started with Sparkle:

  1. Install Sparkle.

  2. Configure input DataFrames.

  3. Define transformation logic.

  4. Let Sparkle handle testing, deployment, and connector management.

Stay updated by following DataChef on LinkedIn and check out Sparkle’s development on GitHub.


Basic Example

This is the simplest example to create a Orders pipelines by reading records from a Kafka topic and writing it to an Iceberg table:

from sparkle.config import Config, IcebergConfig, KafkaReaderConfig
from sparkle.config.kafka_config import KafkaConfig, Credentials
from sparkle.writer.iceberg_writer import IcebergWriter
from sparkle.application import Sparkle
from sparkle.reader.kafka_reader import KafkaReader

from pyspark.sql import DataFrame


class CustomerOrders(Sparkle):
    def __init__(self):
        super().__init__(
            config=Config(
                app_name="orders",
                app_id="orders-app",
                version="0.0.1",
                database_bucket="s3://test-bucket",
                checkpoints_bucket="s3://test-checkpoints",
                iceberg_output=IcebergConfig(
                    database_name="all_products",
                    table_name="orders_v1",
                ),
                kafka_input=KafkaReaderConfig(
                    KafkaConfig(
                        bootstrap_servers="localhost:9119",
                        credentials=Credentials("test", "test"),
                    ),
                    kafka_topic="src_orders_v1",
                ),
            ),
            readers={"orders": KafkaReader},
            writers=[IcebergWriter],
        )

    def process(self) -> DataFrame:
        return self.input["orders"].read()

Contributing to Sparkle 🤝

Sparkle is a community-driven project. If you’d like to contribute, visit our GitHub repository for contribution guidelines and more.


Future of Data Engineering with Sparkle ✨

Sparkle is here to simplify data engineering. By focusing on business-driven data transformations, abstracting complexities, and providing seamless integration with data systems, Sparkle empowers data engineers to build faster, deploy easier, and maintain efficiently.

Ready to sparkle up your data workflows? Get started with Sparkle today!