{"id":24870898,"url":"https://github.com/datamole-ai/pysparkdt","last_synced_at":"2025-10-15T14:31:42.327Z","repository":{"id":271563906,"uuid":"766913156","full_name":"datamole-ai/pysparkdt","owner":"datamole-ai","description":"An open-source Python library for simplifying local testing of Databricks workflows that use PySpark and Delta tables.","archived":false,"fork":false,"pushed_at":"2025-06-11T07:46:59.000Z","size":50,"stargazers_count":38,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-21T16:09:08.261Z","etag":null,"topics":["databricks","delta","delta-tables","pipelines","pyspark","pytest","python","testing","workflows"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datamole-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-04T11:13:56.000Z","updated_at":"2025-08-19T01:02:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"149e852a-b2a0-4bd1-b66f-4b1919425ad4","html_url":"https://github.com/datamole-ai/pysparkdt","commit_stats":null,"previous_names":["datamole-ai/pysparkdt"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/datamole-ai/pysparkdt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datamole-ai%2Fpysparkdt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datamole-ai%2Fpysparkdt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datamole-ai%2Fpysparkdt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datamole-ai%2Fpysparkdt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datamole-ai","download_url":"https://codeload.github.com/datamole-ai/pysparkdt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datamole-ai%2Fpysparkdt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279085462,"owners_count":26100017,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-15T02:00:07.814Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["databricks","delta","delta-tables","pipelines","pyspark","pytest","python","testing","workflows"],"created_at":"2025-02-01T04:01:41.781Z","updated_at":"2025-10-15T14:31:42.322Z","avatar_url":"https://github.com/datamole-ai.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# pysparkdt (PySpark Delta Testing)\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/pysparkdt\"\u003e\n        \u003cimg src=\"https://img.shields.io/pypi/pyversions/pysparkdt.svg?color=%2334D058\"\n             alt=\"Supported Python versions\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/pysparkdt\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://img.shields.io/pypi/v/pysparkdt?color=%2334D058\u0026label=pypi%20package\"\n             alt=\"Package version\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/pysparkdt\"\u003e\n        \u003cimg alt=\"PyPI - Downloads\"\n             src=\"https://img.shields.io/pypi/dm/pysparkdt.svg?label=PyPI%20downloads\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/astral-sh/ruff\"\u003e\n        \u003cimg alt=\"Ruff\"\n             src=\"https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n**An open-source Python library for simplifying local testing of Databricks \nworkflows using PySpark and Delta tables.**\n\nThis library enables seamless testing of PySpark processing logic outside \nDatabricks by **emulating Unity Catalog** behavior. It dynamically generates a \nlocal metastore to mimic Unity Catalog and supports simplified handling of \nDelta tables for both batch and streaming workloads.\n\n# Guideline\n\n## Table of Contents\n\n- [Overview](#overview)\n  - [Scope](#scope)\n  - [Prerequisites](#prerequisites)\n- [Setup](#setup)\n  1. [Installation](#1-installation)\n  2. [Testable Code](#2-testable-code)\n  3. [File Structure](#3-file-structure)\n  4. [Tests](#4-tests)\n- [Advanced](#advanced)\n  - [Testing Stream Processing](#testing-stream-processing)\n  - [Mocking Inside RDD and UDF Operations](#mocking-inside-rdd-and-udf-operations)\n- [Limitations](#limitations)\n  - [Map Key Type Must Be String](#map-key-type-must-be-string)\n\n## Overview\n\n### Scope\nThis guideline helps you test Databricks Python pipelines with a \nfocus on PySpark code. While basic unit testing knowledge with pytest is \nhelpful, it's not the central focus.\n\n### Key Points\n- **Standalone Testing:** The setup allows you to test code without Databricks \naccess, enabling easy CI integration.\n\n- **Local Metastore:** Mimic the Databricks Unity Catalog using a dynamically \ngenerated local metastore with local Delta tables.\n\n- **Code Testability:** Move core processing logic from notebooks to Python \nmodules. Notebooks then serve as entrypoints.\n\n## Setup\nIn the following section we will assume that you are creating tests for a \njob which has one delta table on input and produces one delta table on output. \nIt utilizes PySpark for its processing.\n\n### 1. Installation\n**Install pysparkdt** \n- Get this package from the pypi. It's only needed in your test environment.\n\n```bash\npip install pysparkdt\n```\n\n### 2. Testable code\n- **Modularization:** Move processing logic from notebooks to modules.\n\n- **Notebook Role:** Notebooks primarily handle initialization and triggering \nprocessing. They should contain all the code specific to Databricks \n(e.g. `dbutils` usage)\n\n\u003cdiv align=\"center\"\u003e\n\u003cstrong\u003eentrypoint.py (Databricks Notebook)\u003c/strong\u003e\n\u003c/div\u003e\n\n```python\n# Databricks notebook source\nimport sys\nfrom pathlib import Path\n\nMODULE_DIR = Path.cwd().parent\nsys.path.append(MODULE_DIR.as_posix())\n\n# COMMAND ----------\n\nfrom myjobpackage.processing import process_data\n\n# COMMAND ----------\n\ninput_table = dbutils.widgets.get('input_table')\noutput_table = dbutils.widgets.get('output_table')\n\n# COMMAND ----------\n\nprocess_data(\n    spark=spark,\n    input_table=input_table,\n    output_table=output_table,\n)\n```\n**myjobpackage.processing**\n- Contains the core logic to test\n- Our test focuses on the core function `myjobpackage.processing.process_data`\n\n### 3. File structure\n\n```\nmyjobpackage\n├── __init__.py\n├── entrypoint.py  # Databricks Notebook\n└── processing.py\ntests\n├── __init__.py\n├── test_processing.py\n└── data\n    └── tables\n        ├── example_input.ndjson\n        ├── expected_output.ndjson\n        └── schema\n            ├── example_input.json\n            └── expected_output.json\n```\n\n**Data Format**\n\n- **Test Data:** Newline-delimited JSON (`.ndjson`)\n- **Optional Schema:** JSON\n  - If present, full schema must be provided (all columns included).\n  - The format of the schema file is defined by [PySpark StructType JSON \n  representation](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/types.html#StructType.fromJson).\n\n\u003cdiv align=\"center\"\u003e\n\u003cstrong\u003eexample_input.ndjson\u003c/strong\u003e\n\u003c/div\u003e\n\n```json lines\n{\"id\": 0, \"time_utc\": \"2024-01-08T11:00:00\", \"name\": \"Jorge\", \"feature\": 0.5876}\n{\"id\": 1, \"time_utc\": \"2024-01-11T14:28:00\", \"name\": \"Ricardo\", \"feature\": 0.42}\n```\n\n\u003cdiv align=\"center\"\u003e\n\u003cstrong\u003eexample_input.json\u003c/strong\u003e\n\u003c/div\u003e\n\n```json\n{\n    \"type\": \"struct\",\n    \"fields\": \n    [\n        {\n            \"name\": \"id\",\n            \"type\": \"long\",\n            \"nullable\": false,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"time_utc\",\n            \"type\": \"timestamp\",\n            \"nullable\": false,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"name\",\n            \"type\": \"string\",\n            \"nullable\": true,\n            \"metadata\": {}\n        },\n        {\n            \"name\": \"feature\",\n            \"type\": \"double\",\n            \"nullable\": true,\n            \"metadata\": {}\n        }\n    ]\n}\n```\n\n**Tip:** A schema file for a loaded PySpark DataFrame df can be created using:\n\n```python\nwith(open('example_input.json', 'w')) as file:\n  file.write(json.dumps(df.schema.jsonValue(), indent=4))\n```\n\nThus, you can first load a table without a schema, then create schema file \nfrom it and modify the types to the desired one.\n\n### 4. Tests\n\n**Constants:** Define paths for test data and the temporary metastore.\n\n```python\nDATA_DIR = f'{os.path.dirname(__file__)}/data'\nJSON_TABLES_DIR = f'{DATA_DIR}/tables'\nTMP_DIR = f'{DATA_DIR}/tmp'\nMETASTORE_DIR = f'{TMP_DIR}/metastore'\n```\n\n**Spark Fixture:** Define fixture for the local spark session using \n`spark_base` function from the testing package. Specify the temporal metastore \nlocation.\n\n```python\nfrom pytest import fixture\nfrom pysparkdt import spark_base\n\n@fixture(scope='module')\ndef spark():\n    yield from spark_base(METASTORE_DIR)\n```\n\n**Metastore Initialization:** Use `reinit_local_metastore`\n\nAt the beginning of your test method call `reinit_local_metastore` function \nfrom the testing package to initialize the metastore with the tables from \nyour json folder (`JSON_TABLES_DIR`). You can also choose to enable or disable\n deletion vectors for Delta tables (default: enabled). If the method is called\n  while the metastore already exists, it will delete all the existing tables\n   before initializing the new ones.\n\n*Alternatively, you can call this method only once per testing module, \nbut then individual testing methods might affect each other by modifying \nmetastore tables.*\n\n```python\nfrom myjobpackage.processing import process_data\nfrom pysparkdt import reinit_local_metastore\nfrom pyspark.testing import assertDataFrameEqual\n\ndef test_process_data(\n    spark: SparkSession,\n):\n    reinit_local_metastore(spark, JSON_TABLES_DIR, deletion_vectors=True)\n    \n    process_data(\n        spark=spark,\n        input_table='example_input',\n        output_table='output',\n    )\n    \n    output = spark.read.format('delta').table('output')\n    expected = spark.read.format('delta').table('expected_output')\n    \n    assertDataFrameEqual(\n        actual=output.select(sorted(output.columns)),\n        expected=expected.select(sorted(expected.columns)),\n    )\n```\n\n In the example above, we use `assertDataFrameEqual` to compare PySpark \n DataFrames. We ensure the columns are ordered so that the order of result \n columns does not matter. By default, the order of rows does not matter in \n `assertDataFrameEqual` (this can be adjusted using the `checkRowOrder` \n parameter).\n\n**ℹ️ For complete example, please look at [example](https://github.com/datamole-ai/pysparkdt/blob/main/example).**\n\n\n**⚠️ Manual deletion of local metastore**\n\nDeleting the local metastore manually invalidates any Spark session configured \nfor that location. You would need to start a new Spark session because \nthe original session’s state is no longer valid. Avoid manual deletion — \nuse `reinit_local_metastore` for reinitialization instead.\n\n\n**⚠️ Note on running tests in parallel**\n\nWith the setup above, the metastore is shared on the module scope. \nTherefore, if tests defined in the same module are run in parallel, \nrace conditions can occur if multiple test functions use the same tables.\n\nTo mitigate this, make sure each test in the module uses its own set of tables.\n\n## Advanced\n\n### Testing Stream Processing\n\nLet's now focus on a case where a job is reading input delta table using \nPySpark streaming, performing some computation on the data and saving it to \nthe output delta table.\n\nIn order to be able to test the processing we need to explicitly wait for \nits completion. The best way to do it is to **await the streaming function \nperforming the processing**.\n\nTo be able to await the streaming function, the **test function needs to have \naccess to it**. Thus, we need to make sure the streaming function (query in \nDatabricks terms) is accessible - for example by returning it by \nthe processing function.\n\n\u003cdiv align=\"center\"\u003e\n\u003cstrong\u003emyjobpackage/processing.py\u003c/strong\u003e\n\u003c/div\u003e\n\n```python\ndef process_data(\n    spark: SparkSession,\n    input_table: str, \n    output_table: str, \n    checkpoint_location: str,\n) -\u003e StreamingQuery:\n  load_query = spark.readStream.format('delta').table(input_table)\n    \n  def process_batch(df: pyspark.sql.DataFrame, _) -\u003e None:\n      ... process df ...\n      df.write.mode('append').format('delta').saveAsTable(output_table)\n\n  return (\n      load_query.writeStream.format('delta')\n      .foreachBatch(process_batch)\n      .trigger(availableNow=True)\n      .option('checkpointLocation', checkpoint_location)\n      .start()\n  )\n```\n\n\u003cdiv align=\"center\"\u003e\n\u003cstrong\u003emyjobpackage/tests/test_processing.py\u003c/strong\u003e\n\u003c/div\u003e\n\n```python\ndef test_process_data(spark: SparkSession):\n    ...\n    spark_processing = process_data(\n        spark=spark,\n        input_table_name='example_input',\n        output_table='output',\n        checkpoint_location=f'{TMP_DIR}/_checkpoint/output',\n    )\n    spark_processing.awaitTermination(60)\n    \n    output = spark.read.format('delta').table('output')\n    expected = spark.read.format('delta').table('expected_output')\n    ...\n```\n\n### Mocking Inside RDD and UDF Operations\n\nIf we are testing whole job’s processing code and inside it we have functions \nexecuted through `rdd.mapPartitions`, `rdd.map`, or UDFs, we need to add \nspecial  handling for mocking as regular patching does not propagate to worker \nnodes.\n\n\u003cdiv align=\"center\"\u003e\n\u003cstrong\u003emyjobpackage/processing.py\u003c/strong\u003e\n\u003c/div\u003e\n\n```python\nmyjobpackage/processing.py\n\ndef call_api(\n    data_df: pyspark.sql.DataFrame,\n) -\u003e pyspark.sql.DataFrame:\n    # Call API in parallel (session per partition)\n    result = data_df.rdd.mapPartitions(_partition_run).toDF()\n    return result\n  \ndef _partition_run(\n    iterable: Iterable[Row],\n) -\u003e Iterable[dict[str, Any]]:\n  with ApiSessionClient() as client:\n      for row in iterable:\n          ...\n          output = client.post(prepared_data)\n          ...\n          yield output\n        \ndef process_data(\n    data_df: pyspark.sql.DataFrame,\n) -\u003e pyspark.sql.DataFrame:\n    ...\n    ... = call_api(...)\n    ...\n```\n\n In this example we have a code that calls external API in `_partition_run`, \n we do not want to call the actual API in our test, thus we want to mock \n `ApiSessionClient`. \n \n```python\nfrom pytest import fixture\n\ndef _mocked_session_post(json_data: dict):\n    ...\n    return output\n\n\n@fixture\ndef api_session_client(mocker):\n    api_session_client_mock = mocker.patch.object(\n        myjobpackage.processing,\n        'ApiSessionClient',\n    )\n    api_session_client_mock.return_value = session_client = mocker.Mock()\n    session_client.__enter__ = mocker.Mock()\n    session_client.__enter__.return_value = session_client_ctx = mocker.Mock()\n    session_client.__exit__ = mocker.Mock()\n    session_client_ctx.post = mocker.Mock(side_effect=_mocked_session_post)\n    return session_client\n```\n\nAs `ApiSessionClient` is created inside `rdd.mapPartitions` we need to also \nmock `call_api`.\n\n```python\ndef _mocked_call_api(\n    data_df: pyspark.sql.DataFrame,\n) -\u003e pyspark.sql.DataFrame:\n    results = list(_partition_run(data_df.collect()))\n    spark = SparkSession.builder.getOrCreate()\n    pandas_df = pd.DataFrame(results)\n    return spark.createDataFrame(pandas_df)\n\n\n@fixture\ndef call_api_mock(mocker, api_session_client):\n    mocker.patch.object(\n        myjobpackage.processing, 'call_api', _mocked_call_api\n    )\n```\n\nThen we can run the test with the mocked API.\n\n```python\ndef test_process_data(\n    spark: SparkSession,\n    call_api_mock,\n):\n  ...\n```\n\n## Limitations\n\n### Map Key Type Must Be String\n\nAlthough Spark supports non-string key types in map fields, the JSON format \nitself does not support non-string keys. In JSON, all keys are inherently \ninterpreted as strings, regardless of their declared type in the schema. \nThis discrepancy becomes problematic when testing with `.ndjson` files.\n\nSpecifically, if the schema defines a map key type as anything other than \n`string` (such as `long` or `integer`), the reinitialization of the metastore \nwill  result in `None` values for all fields in the Delta table when the data \nis loaded. This happens because the keys in the JSON data are read as strings, \nbut the schema expects another type, leading to a silent failure where no \nexception or warning is raised. This makes the issue difficult to detect \nand debug.\n\n## License\n\npysparkdt is licensed under the [MIT\nlicense](https://opensource.org/license/mit/). See the \n[LICENSE file](https://github.com/datamole-ai/pysparkdt/blob/main/LICENSE) for more details.\n\n## How to Contribute\n\nSee [CONTRIBUTING.md](https://github.com/datamole-ai/pysparkdt/blob/main/CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatamole-ai%2Fpysparkdt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatamole-ai%2Fpysparkdt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatamole-ai%2Fpysparkdt/lists"}