{"id":15021505,"url":"https://github.com/asuiu/sparkorm","last_synced_at":"2025-05-07T07:33:30.797Z","repository":{"id":187958865,"uuid":"677719886","full_name":"asuiu/SparkORM","owner":"asuiu","description":"ORM for Apache Spark and DataFrames schema manager","archived":false,"fork":false,"pushed_at":"2024-06-24T14:02:12.000Z","size":494,"stargazers_count":14,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-10T22:41:48.350Z","etag":null,"topics":["orm","pyspark","pyspark-python","python","python3","spark","spark-orm","spark-sql","sparkql","sqlalchemy","sqlalchemy-orm"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/sparkorm/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asuiu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-12T12:10:09.000Z","updated_at":"2025-02-01T19:00:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"840ba159-9dea-474e-9942-b1930ff1500e","html_url":"https://github.com/asuiu/SparkORM","commit_stats":{"total_commits":42,"total_committers":2,"mean_commits":21.0,"dds":"0.33333333333333337","last_synced_commit":"05a1c803245a0ea172305404b2b0a5694aaf9b60"},"previous_names":["asuiu/sparkorm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asuiu%2FSparkORM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asuiu%2FSparkORM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asuiu%2FSparkORM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asuiu%2FSparkORM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asuiu","download_url":"https://codeload.github.com/asuiu/SparkORM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252833914,"owners_count":21811275,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["orm","pyspark","pyspark-python","python","python3","spark","spark-orm","spark-sql","sparkql","sqlalchemy","sqlalchemy-orm"],"created_at":"2024-09-24T19:56:39.359Z","updated_at":"2025-05-07T07:33:30.776Z","avatar_url":"https://github.com/asuiu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SparkORM ✨\n\n[![PyPI version](https://badge.fury.io/py/SparkORM.svg)](https://badge.fury.io/py/SparkORM)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nPython Spark SQL \u0026 DataFrame schema management and basic Object Relational Mapping.\n\n## Why use SparkORM\n\n`SparkORM` takes the pain out of working with DataFrame schemas in PySpark.\nIt makes schema definition more Pythonic. And it's\nparticularly useful you're dealing with structured data.\n\nIn plain old PySpark, you might find that you write schemas\n[like this](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_comparison/plain_schema.py):\n\n```python\nCITY_SCHEMA = StructType()\nCITY_NAME_FIELD = \"name\"\nCITY_SCHEMA.add(StructField(CITY_NAME_FIELD, StringType(), False))\nCITY_LAT_FIELD = \"latitude\"\nCITY_SCHEMA.add(StructField(CITY_LAT_FIELD, FloatType()))\nCITY_LONG_FIELD = \"longitude\"\nCITY_SCHEMA.add(StructField(CITY_LONG_FIELD, FloatType()))\n\nCONFERENCE_SCHEMA = StructType()\nCONF_NAME_FIELD = \"name\"\nCONFERENCE_SCHEMA.add(StructField(CONF_NAME_FIELD, StringType(), False))\nCONF_CITY_FIELD = \"city\"\nCONFERENCE_SCHEMA.add(StructField(CONF_CITY_FIELD, CITY_SCHEMA))\n```\n\nAnd then plain old PySpark makes you deal with nested fields like this:\n\n```python\ndframe.withColumn(\"city_name\", df[CONF_CITY_FIELD][CITY_NAME_FIELD])\n```\n\nInstead, with `SparkORM`, schemas become a lot\n[more literate](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_comparison/sparkorm_schema.py):\n\n```python\nclass City(Struct):\n    name = String()\n    latitude = Float()\n    longitude = Float()\n    date_created = Date()\n\nclass Conference(TableModel):\n    class Meta:\n        name = \"conference_table\"\n    name = String(nullable=False)\n    city = City()\n\nclass LocalConferenceView(ViewModel):\n    class Meta:\n        name = \"city_table\"\n\nConference(spark).create()\n\nConference(spark).ensure_exists()  # Creates the table, and if it already exists - validates the scheme and throws an exception if it doesn't match\n\nLocalConferenceView(spark).create_or_replace(select_statement=f\"SELECT * FROM {Conference.get_name()}\")\n\nConference(spark).insert([(\"Bucharest\", 44.4268, 26.1025, date(2020, 1, 1))])\n\nConference(spark).drop()\n```\n\nAs does dealing with nested fields:\n\n```python\ndframe.withColumn(\"city_name\", Conference.city.name.COL)\n```\n\nHere's a summary of `SparkORM`'s features.\n\n- ORM-like class-based Spark schema definitions.\n- Automated field naming: The attribute name of a field as it appears\n  in its `Struct` is (by default) used as its field name. This name can\n  be optionally overridden.\n- Programatically reference nested fields in your structs with the\n  `PATH` and `COL` special properties. Avoid hand-constructing strings\n  (or `Column`s) to reference your nested fields.\n- Validate that a DataFrame matches a `SparkORM` schema.\n- Reuse and build composite schemas with `inheritance`, `includes`, and\n  `implements`.\n- Get a human-readable Spark schema representation with `pretty_schema`.\n- Create an instance of a schema as a dictionary, with validation of\n  the input values.\n\nRead on for documentation on these features.\n\n## Defining a schema\n\nEach Spark atomic type has a counterpart `SparkORM` field:\n\n| PySpark type | `SparkORM` field |\n|---|---|\n| `ByteType` | `Byte` |\n| `IntegerType` | `Integer` |\n| `LongType` | `Long` |\n| `ShortType` | `Short` |\n| `DecimalType` | `Decimal` |\n| `DoubleType` | `Double` |\n| `FloatType` | `Float` |\n| `StringType` | `String` |\n| `BinaryType` | `Binary` |\n| `BooleanType` | `Boolean` |\n| `DateType` | `Date` |\n| `TimestampType` | `Timestamp` |\n\n`Array` (counterpart to `ArrayType` in PySpark) allows the definition\nof arrays of objects. By creating a subclass of `Struct`, we can\ndefine a custom class that will be converted to a `StructType`.\n\nFor\n[example](https://github.com/asuiu/SparkORM/tree/master/examples/arrays/arrays.py),\ngiven the `SparkORM` schema definition:\n\n```python\nfrom SparkORM import TableModel, String, Array\n\nclass Article(TableModel):\n    title = String(nullable=False)\n    tags = Array(String(), nullable=False)\n    comments = Array(String(nullable=False))\n```\n\nThen we can build the equivalent PySpark schema (a `StructType`)\nwith:\n\n```python\n\npyspark_struct = Article.get_schema()\n```\n\nPretty printing the schema with the expression\n`SparkORM.pretty_schema(pyspark_struct)` will give the following:\n\n```text\nStructType([\n    StructField('title', StringType(), False),\n    StructField('tags',\n        ArrayType(StringType(), True),\n        False),\n    StructField('comments',\n        ArrayType(StringType(), False),\n        True)])\n```\n\n## Features\n\nMany examples of how to use `SparkORM` can be found in\n[`examples`](https://github.com/asuiu/SparkORM/tree/master/examples).\n### ORM-like class-based schema definitions\nThe `SparkORM` table schema definition is based on classes. Each column is a class and accepts a number of arguments that will be used to generate the schema.\n\nThe following arguments are supported:\n- `nullable` - if the column is nullable or not (default: `True`)\n- `name` - the name of the column (default: the name of the attribute)\n- `comment` - the comment of the column (default: `None`)\n- `auto_increment` - if the column is auto incremented or not (default: `False`) Note: applicable only for `Long` columns\n- `sql_modifiers` - the SQL modifiers of the column (default: `None`)\n- `partitioned_by` - if the column is partitioned by or not (default: `False`)\n\nExamples:\n```python\nclass City(TableModel):\n    name = String(nullable=False)\n    latitude = Long(auto_increment=True) # auto_increment is a special property that will generate a unique value for each row\n    longitude = Float(comment=\"Some comment\")\n    date_created = Date(sql_modifiers=\"GENERATED ALWAYS AS (CAST(birthDate AS DATE))\") # sql_modifiers will be added to the CREATE clause for the column\n    birthDate = Date(nullable=False, partitioned_by=True) # partitioned_by is a special property that will generate a partitioned_by clause for the column\n```\n\n### Automated field naming\n\nBy default, field names are inferred from the attribute name in the\nstruct they are declared.\n\nFor example, given the struct\n\n```python\nclass Geolocation(TableModel):\n    latitude = Float()\n    longitude = Float()\n```\n\nthe concrete name of the `Geolocation.latitude` field is `latitude`.\n\nNames also be overridden by explicitly specifying the field name as an\nargument to the field\n\n```python\nclass Geolocation(TableModel):\n    latitude = Float(name=\"lat\")\n    longitude = Float(name=\"lon\")\n```\n\nwhich would mean the concrete name of the `Geolocation.latitude` field\nis `lat`.\n\n### Field paths and nested objects\n\nReferencing fields in nested data can be a chore. `SparkORM` simplifies this\nwith path referencing.\n\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/nested_objects/SparkORM_example.py), if we have a\nschema with nested objects:\n\n```python\nclass Address(Struct):\n    post_code = String()\n    city = String()\n\n\nclass User(Struct):\n    username = String(nullable=False)\n    address = Address()\n\n\nclass Comment(Struct):\n    message = String()\n    author = User(nullable=False)\n\n\nclass Article(TableModel):\n    title = String(nullable=False)\n    author = User(nullable=False)\n    comments = Array(Comment())\n```\n\nWe can use the special `PATH` property to turn a path into a\nSpark-understandable string:\n\n```python\nauthor_city_str = Article.author.address.city.PATH\n\"author.address.city\"\n```\n\n`COL` is a counterpart to `PATH` that returns a Spark `Column`\nobject for the path, allowing it to be used in all places where Spark\nrequires a column.\n\nFunction equivalents `path_str`, `path_col`, and `name` are also available.\nThis table demonstrates the equivalence of the property styles and the function\nstyles:\n\n| Property style | Function style | Result (both styles are equivalent) |\n| --- | --- | --- |\n| `Article.author.address.city.PATH` | `SparkORM.path_str(Article.author.address.city)` | `\"author.address.city\"` |\n| `Article.author.address.city.COL` | `SparkORM.path_col(Article.author.address.city)` | `Column` pointing to `author.address.city` |\n| `Article.author.address.city.NAME` | `SparkORM.name(Article.author.address.city)` | `\"city\"` |\n\nFor paths that include an array, two approaches are provided:\n\n```python\ncomment_usernames_str = Article.comments.e.author.username.PATH\n\"comments.author.username\"\n\ncomment_usernames_str = Article.comments.author.username.PATH\n\"comments.author.username\"\n```\n\nBoth give the same result. However, the former (`e`) is more\ntype-oriented. The `e` attribute corresponds to the array's element\nfield. Although this looks strange at first, it has the advantage of\nbeing inspectable by IDEs and other tools, allowing goodness such as\nIDE auto-completion, automated refactoring, and identifying errors\nbefore runtime.\n\n### Field metadata\n\nField [metadata](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.StructField.html) can be specified with the `metadata` argument to a field, which accepts a dictionary\nof key-value pairs.\n\n```python\nclass Article(TableModel):\n    title = String(nullable=False,\n                   metadata={\"description\": \"The title of the article\", \"max_length\": 100})\n```\n\nThe metadata can be accessed with the `METADATA` property of the field:\n\n```python\nArticle.title.METADATA\n{\"description\": \"The title of the article\", \"max_length\": 100}\n```\n\n### DataFrame validation\n\nStruct method `validate_data_frame` will verify if a given DataFrame's\nschema matches the Struct.\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/validation/test_validation.py),\nif we have our `Article`\nstruct and a DataFrame we want to ensure adheres to the `Article`\nschema:\n\n```python\ndframe = spark_session.createDataFrame([{\"title\": \"abc\"}])\n\nclass Article(TableModel):\n    title = String()\n    body = String()\n```\n\nThen we can can validate with:\n\n```python\nvalidation_result = Article.validate_data_frame(dframe)\n```\n\n`validation_result.is_valid` indicates whether the DataFrame is valid\n(`False` in this case), and `validation_result.report` is a\nhuman-readable string describing the differences:\n\n```text\nStruct schema...\n\nStructType([\n    StructField('title', StringType(), True),\n    StructField('body', StringType(), True)])\n\nDataFrame schema...\n\nStructType([\n    StructField('title', StringType(), True)])\n\nDiff of struct -\u003e data frame...\n\n  StructType([\n-     StructField('title', StringType(), True)])\n+     StructField('title', StringType(), True),\n+     StructField('body', StringType(), True)])\n```\n\nFor convenience,\n\n```python\nArticle.validate_data_frame(dframe).raise_on_invalid()\n```\n\nwill raise a `InvalidDataFrameError` (see `SparkORM.exceptions`) if the\nDataFrame is not valid.\n\n### Creating an instance of a schema\n\n`SparkORM` simplifies the process of creating an instance of a struct.\nYou might need to do this, for example, when creating test data, or\nwhen creating an object (a dict or a row) to return from a UDF.\n\nUse `Struct.make_dict(...)` to instantiate a struct as a dictionary.\nThis has the advantage that the input values will be correctly\nvalidated, and it will convert schema property names into their\nunderlying field names.\n\nFor\n[example](https://github.com/asuiu/SparkORM/tree/master/examples/struct_instantiation/instantiate_as_dict.py),\ngiven some simple Structs:\n\n```python\nclass User(TableModel):\n    id = Integer(name=\"user_id\", nullable=False)\n    username = String()\n\nclass Article(TableModel):\n    id = Integer(name=\"article_id\", nullable=False)\n    title = String()\n    author = User()\n    text = String(name=\"body\")\n```\n\nHere are a few examples of creating dicts from `Article`:\n\n```python\nArticle.make_dict(\n    id=1001,\n    title=\"The article title\",\n    author=User.make_dict(\n        id=440,\n        username=\"user\"\n    ),\n    text=\"Lorem ipsum article text lorem ipsum.\"\n)\n\n# generates...\n{\n    \"article_id\": 1001,\n    \"author\": {\n        \"user_id\": 440,\n        \"username\": \"user\"},\n    \"body\": \"Lorem ipsum article text lorem ipsum.\",\n    \"title\": \"The article title\"\n}\n```\n\n```python\nArticle.make_dict(\n    id=1002\n)\n\n# generates...\n{\n    \"article_id\": 1002,\n    \"author\": None,\n    \"body\": None,\n    \"title\": None\n}\n```\n\nSee\n[this example](https://github.com/asuiu/SparkORM/tree/master/examples/conferences_extended/conferences.py)\nfor an extended example of using `make_dict`.\n\n### Composite schemas\n\nIt is sometimes useful to be able to re-use the fields of one struct\nin another struct. `SparkORM` provides a few features to enable this:\n\n- _inheritance_: A subclass inherits the fields of a base struct class.\n- _includes_: Incorporate fields from another struct.\n- _implements_: Enforce that a struct must implement the fields of\n  another struct.\n\nSee the following examples for a better explanation.\n\n#### Using inheritance\n\nFor [example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/inheritance.py), the following:\n\n```python\nclass BaseEvent(TableModel):\n    correlation_id = String(nullable=False)\n    event_time = Timestamp(nullable=False)\n\nclass RegistrationEvent(BaseEvent):\n    user_id = String(nullable=False)\n```\n\nwill produce the following `RegistrationEvent` schema:\n\n```text\nStructType([\n    StructField('correlation_id', StringType(), False),\n    StructField('event_time', TimestampType(), False),\n    StructField('user_id', StringType(), False)])\n```\n\n#### Using an `includes` declaration\n\nFor [example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/includes.py), the following:\n\n```python\nclass EventMetadata(Struct):\n    correlation_id = String(nullable=False)\n    event_time = Timestamp(nullable=False)\n\nclass RegistrationEvent(TableModel):\n    class Meta:\n        includes = [EventMetadata]\n    user_id = String(nullable=False)\n```\n\nwill produce the `RegistrationEvent` schema:\n\n```text\nStructType(List(\n    StructField('user_id', StringType(), False),\n    StructField('correlation_id', StringType(), False),\n    StructField('event_time', TimestampType(), False)))\n```\n\n#### Using an `implements` declaration\n\n`implements` is similar to `includes`, but does not automatically\nincorporate the fields of specified structs. Instead, it is up to\nthe implementor to ensure that the required fields are declared in\nthe struct.\n\nFailing to implement a field from an `implements` struct will result in\na `StructImplementationError` error.\n\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/composite_schemas/implements.py):\n\n```\nclass LogEntryMetadata(TableModel):\n    logged_at = Timestamp(nullable=False)\n\nclass PageViewLogEntry(TableModel):\n    class Meta:\n        implements = [LogEntryMetadata]\n    page_id = String(nullable=False)\n\n# the above class declaration will fail with the following StructImplementationError error:\n#   Struct 'PageViewLogEntry' does not implement field 'logged_at' required by struct 'LogEntryMetadata'\n```\n\n\n### Prettified Spark schema strings\n\nSpark's stringified schema representation isn't very user-friendly, particularly for large schemas:\n\n\n```text\nStructType([StructField('name', StringType(), False), StructField('city', StructType([StructField('name', StringType(), False), StructField('latitude', FloatType(), True), StructField('longitude', FloatType(), True)]), True)])\n```\n\nThe function `pretty_schema` will return something more useful:\n\n```text\nStructType([\n    StructField('name', StringType(), False),\n    StructField('city',\n        StructType([\n            StructField('name', StringType(), False),\n            StructField('latitude', FloatType(), True),\n            StructField('longitude', FloatType(), True)]),\n        True)])\n```\n\n### Merge two Spark `StructType` types\n\nIt can be useful to build a composite schema from two `StructType`s. SparkORM provides a\n`merge_schemas` function to do this.\n\n[For example](https://github.com/asuiu/SparkORM/tree/master/examples/merge_struct_types/merge_struct_types.py):\n\n```python\nschema_a = StructType([\n    StructField(\"message\", StringType()),\n    StructField(\"author\", ArrayType(\n        StructType([\n            StructField(\"name\", StringType())\n        ])\n    ))\n])\n\nschema_b = StructType([\n    StructField(\"author\", ArrayType(\n        StructType([\n            StructField(\"address\", StringType())\n        ])\n    ))\n])\n\nmerged_schema = merge_schemas(schema_a, schema_b)\n```\n\nresults in a `merged_schema` that looks like:\n\n```text\nStructType([\n    StructField('message', StringType(), True),\n    StructField('author',\n        ArrayType(StructType([\n            StructField('name', StringType(), True),\n            StructField('address', StringType(), True)]), True),\n        True)])\n```\n\n## Contributing\n\nContributions are very welcome. Developers who'd like to contribute to\nthis project should refer to [CONTRIBUTING.md](./CONTRIBUTING.md).\n\n## References:\nNote: this library is a Fork from https://github.com/mattjw/sparkql\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasuiu%2Fsparkorm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasuiu%2Fsparkorm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasuiu%2Fsparkorm/lists"}