{"id":20207929,"url":"https://github.com/sharek-dev/spark-tools","last_synced_at":"2026-05-10T08:33:04.407Z","repository":{"id":144126894,"uuid":"241980316","full_name":"sharek-dev/spark-tools","owner":"sharek-dev","description":null,"archived":false,"fork":false,"pushed_at":"2020-02-20T20:14:53.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-01-13T21:09:37.396Z","etag":null,"topics":["bigdata","spark"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sharek-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-20T20:14:39.000Z","updated_at":"2021-12-09T23:28:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"879909c0-9a75-4b12-a68e-9e93b38633a2","html_url":"https://github.com/sharek-dev/spark-tools","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sharek-dev%2Fspark-tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sharek-dev%2Fspark-tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sharek-dev%2Fspark-tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sharek-dev%2Fspark-tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sharek-dev","download_url":"https://codeload.github.com/sharek-dev/spark-tools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241644558,"owners_count":19996179,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","spark"],"created_at":"2024-11-14T05:33:19.196Z","updated_at":"2026-05-10T08:33:04.376Z","avatar_url":"https://github.com/sharek-dev.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.charik/sparktools_2.3_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.charik/sparktools_2.3_2.11)\n\n# Spark-Tools\n\nThis is a collection of useful functions to extends the standard spark\nlibrary.\n\n## Install\n\nAvailable via\n[maven central](https://mvnrepository.com/artifact/org.charik/sparktools).\n\n**sbt**\n\nAdd the latest release as a dependency to your project:\n\n| Spark | Scala |                SparkTools                |\n|:------|:-----:|-----------------------------------------:|\n| 2.2.x | 2.11  | `\"org.charik\" %% \"sparktools\" % \"2.2.1\"` |\n| 2.3.x | 2.11  | `\"org.charik\" %% \"sparktools\" % \"2.3.1\"` |\n| 2.4.x | 2.11 / 2.12  | `\"org.charik\" %% \"sparktools\" % \"2.4.1\"` |\n\n\n**Maven**\n\n```\n\u003cdependency\u003e\n    \u003cgroupId\u003eorg.charik\u003c/groupId\u003e\n    \u003cartifactId\u003esparktools_2.11\u003c/artifactId\u003e\n    \u003cversion\u003e2.4.1\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n## Additional functions\n* sparktools. [sql](docs/sql.md)\n* sparktools.sql. [functions](docs/functions.md)\n* sparktools.sql. [checks](docs/checks.md)\n\n## Examples:\n\n### Basic column utils\n\n**flattenSchema**\n\n```scala\nimport org.charik.sparktools.sql.functions._\nval flatDF = df.flattenSchema(\"_\")\n```\n\n**withColumnNested**\n\n```scala\nimport org.charik.sparktools.sql.functions._\nval nestedDF = df.withColumnNested(\"user.flag.active\", lit(1))\n```\n\n**withColumnsSuffixed**\n\n```scala\nimport org.charik.sparktools.sql.functions._\nval renamedAllColumns = df.withColumnsSuffixed(\"_suffix\")\nval renamedSomeColumns = df.withColumnsSuffixed(\"_suffix\", List(\"id\", \"sale_id\"))\n```\n\n**withColumnsPrefixed**\n\n```scala\nimport org.charik.sparktools.sql.functions._\nval renamedAllColumns = df.withColumnsPrefixed(\"prefix_\")\nval renamedSomeColumns = df.withColumnsPrefixed(\"prefix_\", List(\"id\", \"sale_id\"))\n```\n\n**dropColumns**\n\n```scala\nimport org.charik.sparktools.sql.functions._\nval lightDF = df.dropColumns(List(\"name\", \"password\", \"email\"))\n```\n\n**sqlAdvanced**\n\nExecute multi-line sql requests and return the last request as\nDataFrame. Support comments starting with `#` or `--`\n\n```scala\nimport org.charik.sparktools.sql._\nval df = spark.sqlAdvanced(\"\"\"  \n    CREATE TEMPORARY VIEW Table as (SELECT * FROM json.`src/test.json` );\n    # This is a comment\n    SELECT * FROM Table;\n\"\"\")\n```\n\n### Data Quality Utils\n\n**isPrimaryKey**\n\n```scala\nimport org.charik.sparktools.sql.checks._\ndf.isPrimaryKey(List(\"id\", \"sale_id\"))\ndf.isUnique(\"id\")\n```\n\n## More examples\n\nOur library contains much more functionality than what we showed in the\nbasic example. We are in the process of adding more examples for its\nadvanced features.\n\n\n## RoadMap:\n\n* sql.functions:\n  + withColumnNestedRenamed(colName: String, newColName: String) :\n    DataFrame\n  + withColumnsConcatenated(colName: String, colNames: List[String],\n    sep: String=\"_\") : DataFrame\n  + orderColumns(dir: String = \"asc\") : DataFrame\n  + join(df: DataFrame, on, how: String=\"left\", lsuffix: String,\n    rsuffix: String)\n  + dropNestedColumn\n* sql.testing:\n  + compareSchema(df: DataFrame): Boolean\n  + compareAll(df: DataFrame): Boolean\n* sql.checks\n  + isSchemaFlat: Boolean\n  + isComplete(colName: String): Boolean\n* sql.refined\n  + isConstraintValid(colName: String, constraint: RefinedType): Boolean\n+ sql.dates\n  + addDays\n  + addYears\n  + addHours\n  + litToday\n\n# Contributing\n\nThe main mechanisms for contribution are:\n\n* Reporting issues, suggesting improved functionality on Github issue\n  tracker\n* Suggesting new features in this discussion thread (see\n  [RoadMap](https://github.com/helkaroui/spark-tools/issues/1) for\n  details)\n* Submitting Pull Requests (PRs) to fix issues, improve functionality.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsharek-dev%2Fspark-tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsharek-dev%2Fspark-tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsharek-dev%2Fspark-tools/lists"}