{"id":14982380,"url":"https://github.com/dataflint/spark","last_synced_at":"2026-05-10T10:56:56.831Z","repository":{"id":214068446,"uuid":"697667084","full_name":"dataflint/spark","owner":"dataflint","description":"Performance Observability for Apache Spark","archived":false,"fork":false,"pushed_at":"2025-03-23T07:31:22.000Z","size":19385,"stargazers_count":242,"open_issues_count":4,"forks_count":25,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-04T01:17:57.163Z","etag":null,"topics":["apache-spark","big-data","data-pipeline","data-pipelines","databricks","dataproc","emr","etl","observability","optimization","spark-operator"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataflint.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-28T08:21:44.000Z","updated_at":"2025-04-01T17:55:51.000Z","dependencies_parsed_at":"2024-06-03T19:29:44.183Z","dependency_job_id":"1bf3e5bd-0401-4713-8902-ef6257b01aa1","html_url":"https://github.com/dataflint/spark","commit_stats":{"total_commits":359,"total_committers":5,"mean_commits":71.8,"dds":"0.19498607242339838","last_synced_commit":"fdcbc5edad21e3099c95465343abec7854cfcb66"},"previous_names":["dataflint/spark"],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflint%2Fspark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflint%2Fspark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflint%2Fspark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataflint%2Fspark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataflint","download_url":"https://codeload.github.com/dataflint/spark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248633666,"owners_count":21136899,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","big-data","data-pipeline","data-pipelines","databricks","dataproc","emr","etl","observability","optimization","spark-operator"],"created_at":"2024-09-24T14:05:18.803Z","updated_at":"2026-05-10T10:56:56.824Z","avatar_url":"https://github.com/dataflint.png","language":"TypeScript","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg alt=\"Logo\" src=\"documentation/resources/logo.png\" height=\"300\"\u003e\n\u003c/p\u003e\n\n\u003ch2 align=\"center\"\u003e\n Spark Performance Made Simple\n\u003c/h2\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![Maven Package](https://maven-badges.herokuapp.com/maven-central/io.dataflint/spark_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/io.dataflint/spark_2.12)\n[![Slack](https://img.shields.io/badge/Slack-Join%20Us-purple)](https://join.slack.com/t/dataflint/shared_invite/zt-28sr3r3pf-Td_mLx~0Ss6D1t0EJb8CNA)\n[![Test Status](https://github.com/dataflint/spark/actions/workflows/ci.yml/badge.svg)](https://github.com/your_username/your_repo/actions/workflows/tests.yml)\n[![Docs](https://img.shields.io/badge/Docs-Read%20the%20Docs-blue)](https://dataflint.gitbook.io/dataflint-for-spark/)\n![License](https://img.shields.io/badge/License-Apache%202.0-orange)\n\nIf you enjoy DataFlint OSS please give us a ⭐️ and join our [slack community](https://join.slack.com/t/dataflint/shared_invite/zt-28sr3r3pf-Td_mLx~0Ss6D1t0EJb8CNA) for feature requests, support and more!\n\n\u003c/div\u003e\n\n## What is DataFlint OSS?\n\nDataFlint OSS is a modern, user-friendly enhancement for Apache Spark that simplifies performance monitoring and debugging. It adds an intuitive tab to the existing Spark Web UI, transforming a powerful but often overwhelming interface into something easy to navigate and understand.\n\n**Looking for more?** Our full solution is a Production-aware AI copilot for Apache Spark. Learn more at [dataflint.io](https://www.dataflint.io/).\n\n## Why DataFlint OSS?\n\n- **Intuitive Design**: DataFlint OSS's tab in the Spark Web UI presents complex metrics in a clear, easy-to-understand format, making Spark performance accessible to everyone.\n- **Effortless Setup**: Install DataFlint OSS in minutes with just a few lines of code or configuration, without making any changes to your existing Spark environment.\n- **For All Skill Levels**: Whether you're a seasoned data engineer or just starting with Spark, DataFlint OSS provides valuable insights that help you work more effectively.\n\nWith DataFlint OSS, spend less time deciphering Spark Web UI and more time deriving value from your data. Make big data work better for you, regardless of your role or experience level with Spark.\n\n\n\n### Usage\n\nAfter installation, you will see a \"DataFlint OSS\" tab in the Spark Web UI. Click on it to start using DataFlint OSS.\n\n\u003cimg alt=\"Logo\" src=\"documentation/resources/usage.png\"\u003e\n\n## Demo ([Full YouTube Walkthrough](https://youtu.be/4d_jBCmodKQ?si=VGwFORzw6Wm4vkDo))\n\n![Demo](documentation/resources/demo.gif)\n\n## Features\n\n- 📈 Real-time query and cluster status\n- 📊 Query breakdown with performance heat map\n- 📋 Application Run Summary\n- ⚠️ Performance alerts and suggestions\n- 👀 Identify query failures\n- 🤖 Spark AI Assistant\n\nSee [Our Features](https://dataflint.gitbook.io/dataflint-for-spark/overview/our-features) for more information\n\n## Installation\n\n### Scala\n\nInstall DataFlint OSS via sbt:\nFor Spark 3.X:\n```sbt\nlibraryDependencies += \"io.dataflint\" %% \"spark\" % \"0.9.8\"\n```\n\nFor Spark 4.X:\n```sbt\nlibraryDependencies += \"io.dataflint\" %% \"dataflint-spark4\" % \"0.9.8\"\n```\n\n\nThen instruct spark to load the DataFlint OSS plugin:\n```scala\nval spark = SparkSession\n    .builder()\n    .config(\"spark.plugins\", \"io.dataflint.spark.SparkDataflintPlugin\")\n    ...\n    .getOrCreate()\n```\n\n### PySpark\nAdd these 2 configs to your pyspark session builder:\n\nFor Spark 3.X:\n```python\nbuilder = pyspark.sql.SparkSession.builder\n    ...\n    .config(\"spark.jars.packages\", \"io.dataflint:spark_2.12:0.9.8\") \\\n    .config(\"spark.plugins\", \"io.dataflint.spark.SparkDataflintPlugin\") \\\n    ...\n```\n\nFor Spark 4.X:\n```python\nbuilder = pyspark.sql.SparkSession.builder\n    ...\n    .config(\"spark.jars.packages\", \"io.dataflint:dataflint-spark4_2.13:0.9.8\") \\\n    .config(\"spark.plugins\", \"io.dataflint.spark.SparkDataflintPlugin\") \\\n    ...\n```\n\n### Spark Submit\n\nAlternatively, install DataFlint OSS with **no code change** as a spark ivy package by adding these 2 lines to your spark-submit command:\n\n```bash\nspark-submit\n--packages io.dataflint:spark_2.12:0.9.8 \\\n--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \\\n...\n```\n\nFor Spark 4.X:\n```bash\nspark-submit\n--packages io.dataflint:dataflint-spark4_2.13:0.9.8 \\\n--conf spark.plugins=io.dataflint.spark.SparkDataflintPlugin \\\n...\n```\n\n### Additional installation options\n\n* There is also support for scala 2.13, if your spark cluster is using scala 2.13 change package name to io.dataflint:spark_**2.13**:0.9.8\n* For more installation options, including for **python** and **k8s spark-operator**, see [Install on Spark docs](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark)\n* For installing DataFlint OSS in **spark history server** for observability on completed runs see [install on spark history server docs](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-spark-history-server)\n* For installing DataFlint OSS on **DataBricks** see [install on databricks docs](https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks). Databricks Runtime 17.3+ ships `javax.servlet` instead of `jakarta.servlet`, so use the dedicated shaded artifact `io.dataflint:dataflint-spark4-databricks_2.13` (same plugin class — only the jar coordinate differs).\n\n## How it Works\n\n![How it Works](documentation/resources/howitworks.png)\n\nDataFlint OSS is installed as a plugin on the spark driver and history server.\n\nThe plugin exposes an additional HTTP resoures for additional metrics not available in Spark UI, and a modern SPA web-app that fetches data from spark without the need to refresh the page.\n\nFor more information, see [how it works docs](https://dataflint.gitbook.io/dataflint-for-spark/overview/how-it-works)\n\n## Articles\n\n*  [AWS engineering blog post featuring DataFlint - Centralize Apache Spark observability on Amazon EMR on EKS with external Spark History Server](https://aws.amazon.com/blogs/big-data/centralize-apache-spark-observability-on-amazon-emr-on-eks-with-external-spark-history-server/)\n\n*  [Wix engineering blog post featuring DataFlint - How Wix Built the Ultimate Spark-as-a-Service Platform](https://www.wix.engineering/post/how-wix-built-the-ultimate-spark-as-a-service-platform-part1)\n\n*  [Cloudera Community - How to integrated DataFlint in CDP](https://community.cloudera.com/t5/Community-Articles/How-to-integrated-DataFlint-in-CDP/ta-p/383681)\n\n*  [Dataminded engineering blog post featuring DataFlint - Running thousands of Spark applications without losing your cool](https://medium.com/datamindedbe/running-thousands-of-spark-applications-without-losing-your-cool-969208a2d655)\n\n*  [Data Engineering Weekly #156 - Featuring DataFlint](https://www.dataengineeringweekly.com/p/data-engineering-weekly-156)\n\n*  [Medium Blog Post - Fixing small files performance issues in Apache Spark using DataFlint](https://medium.com/@menishmueli/fixing-small-files-performance-issues-in-apache-spark-using-dataflint-49ffe3eb755f)\n\n*  [Medium Blog Post - Are Long Filter Conditions in Apache Spark Leading to Performance Issues?](https://medium.com/@menishmueli/are-long-filter-conditions-in-apache-spark-leading-to-performance-issues-0b5bc6c0f94a)\n\n*  [Medium Blog Post - Optimizing update operations to Apache Iceberg tables using DataFlint](https://medium.com/dev-genius/optimizing-update-operations-to-apache-iceberg-tables-using-dataflint-e4e372e75b8a)\n\n*  [Medium Blog Post - Did you know that your Apache Spark logs might be leaking PIIs?](https://medium.com/system-weakness/did-you-know-that-your-apache-spark-logs-might-be-leaking-piis-06f2a0e8a82c)\n\n*  [Medium Blog Post - Cost vs Speed: measuring Apache Spark performance with DataFlint](https://medium.com/@menishmueli/cost-vs-speed-measuring-apache-spark-performance-with-dataflint-c5f909ebe229)\n\n\n## Compatibility Matrix\n\nDataFlint OSS require spark version 3.2 and up, and supports both scala versions 2.12 or 2.13. \n\n\n| Spark Platforms           | DataFlint OSS Realtime  | DataFlint OSS History server |\n|---------------------------|---------------------|--------------------------|\n| Local                     |       ✅            |           ✅             |\n| Standalone                |       ✅            |           ✅             |\n| Kubernetes Spark Operator |       ✅            |           ✅             |\n| EMR                       |       ✅            |           ✅             |\n| Dataproc                  |       ✅            |           ✅             |\n| HDInsights                |       ✅            |           ❌             |\n| Databricks                |       ✅            |           ❌             |\n\nFor more information, see [supported versions docs](https://dataflint.gitbook.io/dataflint-for-spark/overview/supported-versions)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataflint%2Fspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataflint%2Fspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataflint%2Fspark/lists"}