{"id":15044853,"url":"https://github.com/semyonsinchenko/tsumugi-spark","last_synced_at":"2025-10-24T14:32:19.293Z","repository":{"id":248225213,"uuid":"823113578","full_name":"mrpowers-io/tsumugi-spark","owner":"mrpowers-io","description":"SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.","archived":false,"fork":false,"pushed_at":"2024-11-27T19:17:01.000Z","size":1631,"stargazers_count":26,"open_issues_count":10,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-31T02:42:38.069Z","etag":null,"topics":["data-quality","deequ","pyspark","spark"],"latest_commit_sha":null,"homepage":"https://mrpowers-io.github.io/tsumugi-spark/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrpowers-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-02T12:53:59.000Z","updated_at":"2025-01-06T20:53:18.000Z","dependencies_parsed_at":"2024-09-25T01:55:02.278Z","dependency_job_id":"6fc534e8-7e90-4570-83fb-4d8e02e6103d","html_url":"https://github.com/mrpowers-io/tsumugi-spark","commit_stats":null,"previous_names":["semyonsinchenko/tsumugi-spark","mrpowers-io/tsumugi-spark"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ftsumugi-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ftsumugi-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ftsumugi-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Ftsumugi-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrpowers-io","download_url":"https://codeload.github.com/mrpowers-io/tsumugi-spark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237990634,"owners_count":19398466,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-quality","deequ","pyspark","spark"],"created_at":"2024-09-24T20:51:08.391Z","updated_at":"2025-10-24T14:32:14.236Z","avatar_url":"https://github.com/mrpowers-io.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tsumugi Spark\n\n**_UNDER ACTIVE DEVELOPMENT_**\n\n[![python-client](https://github.com/mrpowers-io/tsumugi-spark/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/mrpowers-io/tsumugi-spark/actions/workflows/ci.yml)\n\n[Documentation](https://mrpowers-io.github.io/tsumugi-spark/)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/SemyonSinchenko/tsumugi-spark/main/static/tsumugi-spark-logo.png\" alt=\"tsumugi-shiraui\" width=\"600\" align=\"middle\"/\u003e\n\u003c/p\u003e\n\n**_NOTE:_** _Tsumugi Shiraui is a chimera: a hybrid of Human and Gauna. She combines the chaotic power of Gauna with a Human intillegence and empathia. Like an original character of the Manga \"Knights of Sidonia\", this project aims to make a hybrid of very powerful but hard to learn and use Deequ Scala Library with a usability and simplicity of Spark Connect (PySpark Connect, Spark Connect Go, Spark Connect Rust, etc.)._\n\n\n## About\n\nThe project's goal is to create a modern, SparkConnect-native wrapper for the elegant and high-performance Data Quality library, [AWS Deequ](https://github.com/awslabs/deequ). The PySpark Connect API is expected to be the primary focus, but the PySpark Classic API will also be maintained. Additionally, other thin clients such as `connect-go` or `connect-rs` will be supported.\n\n## Why another wrapper?\n\nWhile Amazon Deequ itself is well-maintained, the existing PyDeequ wrapper faces two main challenges:\n\n1. It relies on direct `py4j` calls to the underlying Deequ Scala library. This approach presents two issues:\n   a) It cannot be used in any SparkConnect environment.\n   b) `py4j` is not well-suited for working with Scala code. For example, try creating an `Option[Long]` from Python using `py4j` to understand the complexity involved.\n2. It suffers from a lack of maintenance, likely due to the issues mentioned in point 1. This can be seen in [this GitHub issue](https://github.com/awslabs/python-deequ/issues/192).\n3. The current python-deequ implementation makes it impossible to call row-level results because `py4j.reflection.PythonProxyHandler` is not serializable. This problem is documented in [this GitHub issue](https://github.com/awslabs/python-deequ/issues/234).\n\n### Goals of the project\n\n- Maintain `proto3` definitions of basic Deequ Scala structures (such as `Check`, `Analyzer`, `AnomalyDetectionStrategy`, `VerificationSuite`, `Constraint`, etc.);\n- Maintain a Scala SparkConnect Plugin that enables working with Deequ from any client;\n- Maintain a Python client that provides a user-friendly API on top of classes generated by `protoc`;\n- Provide utils to enhance Deequ's server-side functionality by adding more syntactic sugar, while ensuring their maintenance remains on the client-side.\n\n### Non-goals of the project\n\n- Creating a replacement for Deequ is not the goal of this project. Similarly, forking the entire project is not intended. Deequ's core is well-maintained, and there are no compelling reasons to create an aggressive fork of it.\n- Developing a low-code or zero-code Data Quality tool with YAML file configuration is not the project's objective. Currently, the focus is on providing a well-maintained and documented client API that can be used to create low-code tooling.\n\n### Architecture overview\n\nFrom a high-level perspective, Tsumugi implements three main components:\n\n1. Messages for Deequ Core's main data structures\n2. SparkConnect Plugin and utilities\n3. PySpark Connect and PySpark Classic thin client\n\nThe diagram below provides an overview of this concept:\n\n![](https://raw.githubusercontent.com/mrpowers-io/tsumugi-spark/refs/heads/main/static/diagram.png)\n\n## Project structure\n\n### Protobuf messages\n\nThe `tsumugi-server/src/main/protobuf/` directory contains messages that define the main structures of the Deequ Scala library:\n\n- `VerificationSuite`: This is the top-level Deequ object. For more details, refer to `suite.proto`.\n- `Analyzer`: This object is defined using `oneof` from a list of analyzers (including `CountDistinct`, `Size`, `Compliance`, etc.). For implementation details, see `analyzers.proto`.\n- `AnomalyDetection` and its associated strategies. For more information, consult `strategies.proto`.\n- `Check`: This is defined using `Constraint`, `CheckLevel`, and a description.\n- `Constraint`: This is defined as an Analyzer (which computes a metric), a reference value, and a comparison sign.\n\n### SparkConnect Plugin\n\nThe file `tsumugi-server/src/main/scala/org/apache/spark/sql/tsumugi/DeequConnectPlugin.scala` contains the plugin code itself. It is designed to be very simple, consisting of approximately 50 lines of code. The plugin's functionality is straightforward: it checks if the message is a `VerificationSuite`, passes it into `DeequSuiteBuilder`, and then packages the result back into a `Relation`.\n\n### Deequ Suite Builder\n\nThe file `tsumugi-server/src/main/scala/io/mrpowers/tsumugi/DeequSuiteBuilder.scala` contains code that creates Deequ objects from protobuf messages. It maps enums and constants to their corresponding Deequ counterparts, and generates `com.amazon.deequ` objects from the respective protobuf messages. The code ultimately returns a ready-to-use Deequ top-level structure.\n\n\n## Getting Started\n\nAt the moment there are no package distributions of the server part as well there is no pre-built PyPi packages for clients. The only way to play with the project at the moment is to build it from the source code.\n\n### Quick start\n\nThere is a simple Python script that performs the following tasks:\n\n1. Builds the server plugin;\n2. Downloads the required Spark version and all missing JAR files;\n3. Combines everything together;\n4. Runs the local Spark Connect Server with the Tsumugi plugin.\n\n```sh\npython dev/run-connect.py\n```\n\nBuilding the server component requires Maven and Java 11. You can find installation instructions for both in their official documentation: [Maven](https://maven.apache.org/install.html) and [Java 11](https://openjdk.org/install/). This script also requires Python 3.10 or higher. After installation, you can connect to the server and test it by creating a Python virtual environment. This process requires the `poetry` build tool. You can find instructions on how to install Poetry on their [official website](https://python-poetry.org/docs/#installation).\n\n```sh\ncd tsumugi-python\npoetry env use python3.10 # any version bigger than 3.10 should work\npoetry install --with dev # that install tsumugi as well as jupyter notebooks and pyspark[connect]\n```\n\nNow you can run jupyter and try the example notebook (`tsumugi-python/examples/basic_example.ipynb`): [Notebook](https://github.com/mrpowers-io/tsumugi-spark/blob/main/docs/notebooks/basic_example.ipynb)\n\n### Server\n\nBuilding the server part requires Maven.\n\n```sh\ncd tsumugi-server\nmvn clean package\n```\n\n### Client\n\nInstalling the PySpark client requires `poetry`.\n\n```sh\ncd tsumugi-python\npoetry env use python3.10 # 3.10+\npoetry install\n```\n\n## References\n\nTsumugi is built on top of Deequ Data Quality tool:\n\n- _Schelter, Sebastian, et al. \"Automating large-scale data quality verification.\" Proceedings of the VLDB Endowment 11.12 (2018): 1781-1794._, [link](https://www.amazon.science/publications/automating-large-scale-data-quality-verification?ref=https://githubhelp.com)\n- _Schelter, Sebastian, et al. \"Unit testing data with deequ.\" Proceedings of the 2019 International Conference on Management of Data. 2019._, [link](https://www.amazon.science/publications/unit-testing-data-with-deequ)\n- _Schelter, Sebastian, et al. \"Deequ-data quality validation for machine learning pipelines.\" (2018)._, [link](https://www.amazon.science/publications/deequ-data-quality-validation-for-machine-learning-pipelines)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemyonsinchenko%2Ftsumugi-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsemyonsinchenko%2Ftsumugi-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemyonsinchenko%2Ftsumugi-spark/lists"}