{"id":27108621,"url":"https://github.com/yotpoltd/metorikku","last_synced_at":"2025-04-06T22:21:27.820Z","repository":{"id":27028534,"uuid":"106382269","full_name":"YotpoLtd/metorikku","owner":"YotpoLtd","description":"A simplified, lightweight ETL Framework based on Apache Spark","archived":false,"fork":false,"pushed_at":"2024-01-24T10:03:28.000Z","size":4405,"stargazers_count":585,"open_issues_count":65,"forks_count":155,"subscribers_count":53,"default_branch":"master","last_synced_at":"2024-11-25T13:38:09.438Z","etag":null,"topics":["big-data","distributed-computing","etl","etl-framework","etl-pipeline","scala","spark","sql"],"latest_commit_sha":null,"homepage":" https://yotpoltd.github.io/metorikku/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YotpoLtd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-10T07:19:06.000Z","updated_at":"2024-11-21T22:41:09.000Z","dependencies_parsed_at":"2024-11-18T13:12:28.618Z","dependency_job_id":null,"html_url":"https://github.com/YotpoLtd/metorikku","commit_stats":null,"previous_names":[],"tags_count":147,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YotpoLtd%2Fmetorikku","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YotpoLtd%2Fmetorikku/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YotpoLtd%2Fmetorikku/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YotpoLtd%2Fmetorikku/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YotpoLtd","download_url":"https://codeload.github.com/YotpoLtd/metorikku/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247558685,"owners_count":20958202,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","distributed-computing","etl","etl-framework","etl-pipeline","scala","spark","sql"],"created_at":"2025-04-06T22:21:27.077Z","updated_at":"2025-04-06T22:21:27.805Z","avatar_url":"https://github.com/YotpoLtd.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Metorikku Logo](https://raw.githubusercontent.com/wiki/yotpoltd/metorikku/metorikku.png)\n\n[![Build Status](https://travis-ci.org/YotpoLtd/metorikku.svg?branch=master)](https://travis-ci.org/YotpoLtd/metorikku)\n\n[![Gitter](https://badges.gitter.im/metorikku/Lobby.svg)](https://gitter.im/metorikku/Lobby?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge)\n\nMetorikku is a library that simplifies writing and executing ETLs on top of [Apache Spark](http://spark.apache.org/).\n\nIt is based on simple YAML configuration files and runs on any Spark cluster.\n\nThe platform also includes a simple way to write unit and E2E tests.\n\n### Getting started\nTo run Metorikku you must first define 2 files.\n\n#### Metric file\nA metric file defines the steps and queries of the ETL as well as where and what to output.\n\nFor example a simple configuration YAML (JSON is also supported) should be as follows:\n```yaml\nsteps:\n- dataFrameName: df1\n  checkpoint: true #This persists the dataframe to storage and truncates the execution plan. For more details, see https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-checkpointing.html\n  sql:\n    SELECT *\n    FROM input_1\n    WHERE id \u003e 100\n- dataFrameName: df2\n  sql:\n    SELECT *\n    FROM df1\n    WHERE id \u003c 1000\noutput:\n- dataFrameName: df2\n  outputType: Parquet\n  outputOptions:\n    saveMode: Overwrite\n    path: df2.parquet\n```\nYou can check out a full example file for all possible values in the [sample YAML configuration file](config/metric_config_sample.yaml).\n\nMake sure to also check out the full [Spark SQL Language manual](https://docs.databricks.com/spark/latest/spark-sql/index.html#sql-language-manual) for the possible queries.\n\n#### Job file\nThis file will include **input sources**, **output destinations** and the location of the **metric config** files.\n\nSo for example a simple YAML (JSON is also supported) should be as follows:\n```yaml\nmetrics:\n  - /full/path/to/your/metric/file.yaml\ninputs:\n  input_1: parquet/input_1.parquet\n  input_2: parquet/input_2.parquet\noutput:\n    file:\n        dir: /path/to/parquet/output\n```\nYou can check out a full example file for all possible values in the [sample YAML configuration file](config/job_config_sample.yaml).\n\nAlso make sure to check out all our [examples](examples).\n\n#### Supported input/output:\n\nCurrently Metorikku supports the following inputs:\n**CSV, JSON, parquet, JDBC, Kafka, Cassandra, Elasticsearch**\n\nAnd the following outputs:\n**CSV, JSON, parquet, Redshift, Cassandra, Segment, JDBC, Kafka, Elasticsearch**\u003cbr /\u003e\n\n### Running Metorikku\nThere are currently 3 options to run Metorikku.\n#### Run on a spark cluster\n*To run on a cluster Metorikku requires [Apache Spark](http://spark.apache.org/) v2.2+*\n* Download the [last released JAR](https://github.com/YotpoLtd/metorikku/releases/latest)\n* Run the following command:\n     `spark-submit --class com.yotpo.metorikku.Metorikku metorikku.jar -c config.yaml`\n\n*Running with remote job/metric files:*\n\nMetorikku supports using remote job/metric files.\n\nSimply write the full path to the job/metric. example: `s3://bucket/job.yaml`\n\nAnything supported by hadoop can be used (s3, hdfs etc.)\n\nTo help running both locally and remotely you can add the following env variable at runtime to add a prefix to all your configuration files paths:\n`CONFIG_FILES_PATH_PREFIX=s3://bucket/`\n\n#### Run locally\n*Metorikku is released with a JAR that includes a bundled spark.*\n* Download the [last released Standalone JAR](https://github.com/YotpoLtd/metorikku/releases/latest)\n* Metorikku is required to be running with `Java 1.8`\n* Run the following command:\n`java -D\"spark.master=local[*]\" -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku -c config.yaml`\n* Also job in a JSON format is supported, run following command:\n`java -D\"spark.master=local[*]\" -cp metorikku-standalone.jar com.yotpo.metorikku.Metorikku --job \"{*}\"`\n\n*Run locally in intellij:*\n\nGo to Run-\u003eEdit Configuration-\u003eadd application configuration\n\n* Main Class:\n`com.yotpo.metorikku.Metorikku`\n* Vm options:\n`-Dspark.master=local[*] -Dspark.executor.cores=1 -Dspark.driver.bindAddress=127.0.0.1 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer`\n* program arguments:\n`-c examples/movies.yaml`\n* JRE: `1.8`\n\n*Run tester in intellij:*\n* Main class: `com.yotpo.metorikku.MetorikkuTester`\n* Program arguments: `--test-settings /{path to }/test_settings.yaml`\n\n\n#### Run as a library\n*It's also possible to use Metorikku inside your own software*\n\n*Metorikku library requires scala 2.11 (spark 2)/2.12 (spark 3)*\n\nTo use it add the following dependency to your build.sbt:\n`\"com.yotpo\" % \"metorikku\" % \"LATEST VERSION\"`\n\n### Metorikku Tester\nIn order to test and fully automate the deployment of metrics we added a method to run tests against a metric.\n\nA test is comprised of the following:\n#### Test settings\nThis defines what to test and where to get the mocked data.\n\n** All the paths must be relative to the directory of the test file. **\n\nFor example, a simple test YAML (JSON is also supported) will be:\n```yaml\nmetric: \"/path/to/metric\"\nmocks:\n- name: table_1\n  path: mocks/table_1.jsonl\ntests:\n  df2:\n  - id: 200\n    name: test\n  - id: 300\n    name: test2\nkeys:\n  df2:\n  - id\n  - name\n```\n\nAnd the corresponding `mocks/table_1.jsonl`:\n```jsonl\n{ \"id\": 200, \"name\": \"test\" }\n{ \"id\": 300, \"name\": \"test2\" }\n{ \"id\": 1, \"name\": \"test3\" }\n```\n\nThe Keys section allows the user to define the unique columns of every DataFrame's expected results -\nevery expected row result should have a unique combination for the values of the key columns.\nThis part is optional and can be used to define only part of the expected DataFrames -\nfor the DataFrames that don't have a key definition, all of the columns defined in the first row result\nwill be taken by default as the unique keys.\nDefining a shorter list of key columns will result in better performances and a more detailed error message in case of test failure.\n\nThe structure of the defined expected dataFrame's result must be identical for all rows, and the keys must be valid\n(defined as columns of the expected results of the same DataFrame)\n\n\n#### Running Metorikku Tester\nYou can run Metorikku tester in any of the above methods (just like a normal Metorikku).\n\nThe main class changes from `com.yotpo.metorikku.Metorikku` to `com.yotpo.metorikku.MetorikkuTester`\n\n#### Testing streaming metrics\nIn Spark some behaviors are different when writing queries for streaming sources (for example kafka).\n\nIn order to make sure the test behaves the same as the real life queries, you can configure a mock to behave like a streaming input by writing the following:\n```yaml\nmetric: \"/path/to/metric\"\nmocks:\n- name: table_1\n  path: mocks/table_1.jsonl\n  # default is false\n  streaming: true\n# default is append output mode\noutputMode: update\ntests:\n  df2:\n  - id: 200\n    name: test\n  - id: 300\n    name: test2\n```\n\n### Notes\n\n#### Variable interpolation\nAll configuration files support variable interpolation from environment variables and system properties using the following format:\n`${variable_name}`\n\n#### Using JDBC\nWhen using JDBC writer or input you must provide a path to the driver JAR.\n\nFor example to run with spark-submit with a mysql driver:\n`spark-submit --driver-class-path mysql-connector-java-5.1.45.jar --jars mysql-connector-java-5.1.45.jar --class com.yotpo.metorikku.Metorikku metorikku.jar -c config.yaml`\n\nIf you want to run this with the standalone JAR:\n`java -Dspark.master=local[*] -cp metorikku-standalone.jar:mysql-connector-java-5.1.45.jar -c config.yaml`\n\n#### JDBC query\nJDBC query output allows running a query for each record in the dataframe.\n\n##### Mandatory parameters:\n* **query** - defines the SQL query.\nIn the query you can address the column of the DataFrame by their location using the dollar sign ($) followed by the column index. For example:\n```sql\nINSERT INTO table_name (column1, column2, column3, ...) VALUES ($1, $2, $3, ...);\n```\n##### Optional Parameters:\n* **maxBatchSize** - The maximum size of queries to execute against the DB in one commit.\n* **minPartitions** - Minimum partitions in the DataFrame - may cause repartition.\n* **maxPartitions** - Maximum partitions in the DataFrame - may cause coalesce.\n\n#### Kafka output\nKafka output allows writing batch operations to kafka\n\nWe use spark-sql-kafka-0-10 as a provided jar - spark-submit command should look like so:\n\n```spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1 --class com.yotpo.metorikku.Metorikku metorikku.jar```\n\n##### Mandatory parameters:\n* **topic** - defines the topic in kafka which the data will be written to.\ncurrently supported only one topic\n\n* **valueColumn** - defines the values which will be written to the Kafka topic,\nUsually a json version of data, For example:\n```sql\nSELECT keyColumn, to_json(struct(*)) AS valueColumn FROM table\n```\n##### Optional Parameters:\n* **keyColumn** - key that can be used to perform de-duplication when reading\n\n#### Periodic job\nPeriodic job configuration allows to schedule a batch job to execute repeatedly every configured duration of time.\nThis is an example of a periodic configuraion:\n```yaml\nperiodic:\n  triggerDuration: 20 minutes\n```\n\n### Streaming Input\nUsing streaming input will convert your application into a streaming application build on top of Spark Structured Streaming.\n\nTo enable all other writers and also enable multiple outputs for a single streaming dataframe, add ```batchMode``` to your job configuration, this will enable the [foreachBatch](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch) mode (only available in spark \u003e= 2.4.0)\nCheck out all possible streaming configurations in the ```streaming``` section of the [sample job configuration file](config/job_config_sample.yaml).\n\nPlease note the following while using streaming applications:\n\n* Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets.\n\n* Limit and take first N rows are not supported on streaming Datasets.\n* Distinct operations on streaming Datasets are not supported.\n\n* Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.\n\n* Make sure to add the relevant [Output Mode](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes) to your Metric as seen in the Examples\n\n* Make sure to add the relevant [Triggers](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) to your Metric if needed as seen in the Examples\n\n* For more information please go to [Spark Structured Streaming WIKI](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)\n\n#### Kafka Input\nKafka input allows reading messages from topics\n```yaml\ninputs:\n  testStream:\n    kafka:\n      servers:\n        - 127.0.0.1:9092\n      topic: test\n      consumerGroup: testConsumerGroupID # optional\n      schemaRegistryUrl: https://schema-registry-url # optional\n      schemaSubject: subject # optional\n```\nWhen using kafka input, writing is only available to ```File``` and ```Kafka```, and only to a single output.\n* In order to measure your consumer lag you can use the ```consumerGroup``` parameter to track your application offsets against your kafka input.\nThis will commit the offsets to kafka, as a new dummy consumer group.\n\n* we use ABRiS as a provided jar In order to deserialize your kafka stream messages (https://github.com/AbsaOSS/ABRiS), add the  ```schemaRegistryUrl``` option to the kafka input config\nspark-submit command should look like so:\n\n```spark-submit --repositories http://packages.confluent.io/maven/ --jars https://repo1.maven.org/maven2/za/co/absa/abris_2.12/3.2.1/abris_2.12-3.2.1.jar --packages org.apache.spark:spark-avro_2.12:3.2.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,io.confluent:kafka-schema-registry-client:5.3.0,io.confluent:kafka-avro-serializer:5.3.0 --class com.yotpo.metorikku.Metorikku metorikku.jar```\n\n* If your subject schema name is not ```\u003cTOPIC NAME\u003e-value``` (e.g. if the topic is a regex pattern) you can specify the schema subject in the ```schemaSubject``` section\n\n###### Topic Pattern\nKafka input also allows reading messages from multiple topics by using subscribe pattern:\n```yaml\ninputs:\n  testStream:\n    kafka:\n      servers:\n        - 127.0.0.1:9092\n      # topicPattern can be any Java regex string\n      topicPattern: my_topics_regex.*\n      consumerGroup: testConsumerGroupID # optional\n      schemaRegistryUrl: https://schema-registry-url # optional\n      schemaSubject: subject # optional\n```\n* While using topicPattern, consider using ```schemaRegistryUrl``` and ```schemaSubject``` in case your topics have different schemas.\n\n##### File Streaming Input\nMetorikku supports streaming over a file system as well.\nYou can use the Data stream reading by specifying ```isStream: true```,\nand a specific path in the job (must be a single path) as a streaming source, this will trigger jobs for new files added to the folder.\n```yaml\ninputs:\n  testStream:\n    file:\n      path: examples/file_input_stream/input\n      isStream: true\n      format: json\n      options:\n        timestampFormat: \"yyyy-MM-dd'T'HH:mm:ss'Z'\"\n```\n\n##### Watermark\nMetorikku supports Watermark method which helps a stream processing engine to deal with late data.\nYou can use watermarking by adding a new udf step in your metric:\n```yaml\n# This will become the new watermarked dataframe name.\n- dataFrameName: dataframe\n  classpath: com.yotpo.metorikku.code.steps.Watermark\n  params:\n    # Watermark table my_table\n    table: my_table\n    # The column representing the event time (needs to be a TIMESTAMP or DATE column)\n    eventTime: event\n    delayThreshold: 2 hours\n```\n\n##### ToAvro\nMetorikku supports to_avro() method which turns a dataframe into Avro records.\n\nThe method requires the following parameters:\n```\n- table\n- schema.registry.url\n- schema.registry.topic\n- schema.name\n- schema.namespace\n```\n\n```table``` is the input table and it should contain a \"value\" column, and can contain a ```key``` column.\n\nThe content of the ```value``` column in the input table will turn into avro in the ```value``` column of\nthe output table. The content of the ```key``` column in the input table will turn into avro in the ```key``` column of\nthe output table. ```key``` is not necessary in the input.\n\nThe ```schema.name``` and ```schema.namespace``` will be the schame name and namespace for both the value schema and the key schema, if a key exist.\n\nA subject will be created in the schema registry (if one doesn't already exist). The subject name will be: \u003cschema.registry.topic\u003e-value\nfor the value schema and  \u003cschema.registry.topic\u003e-key for the key schema.\n\u003cbr/\u003eYou can use ToAvro by adding a new udf step in your metric:\n```yaml\n- dataFrameName: dataframe\n  classpath: com.yotpo.metorikku.code.steps.ToAvro\n  params:\n    table: my_table\n    schema.registry.url: http://localhost:8081\n    schema.registry.topic: my_topic\n    schema.name: my_schema_name\n    schema.namespace: my_schema_namespace\n```\n\n#### Instrumentation\nOne of the most useful features in Metorikku is it's instrumentation capabilities.\n\nInstrumentation metrics are written by default to what's configured in [spark-metrics](https://spark.apache.org/docs/latest/monitoring.html#metrics).\n\nMetorikku sends automatically on top of what spark is already sending the following:\n\n* Number of rows written to each output\n\n* Number of successful steps per metric\n\n* Number of failed steps per metric\n\n* In streaming: records per second\n\n* In streaming: number of processed records in batch\n\nYou can also send any information you like to the instrumentation output within a metric.\nby default the last column of the schema will be the field value.\nOther columns that are not value or time columns will be merged together as the name of the metric.\nIf writing directly to influxDB these will become tags.\n\nCheck out the [example](examples/movies_metric.yaml) for further details.\n\n##### using InfluxDB\n\nYou can also send metric directly to InfluxDB (gaining the ability to use tags and time field).\n\nCheck out the [example](examples/influxdb) and also the [InfluxDB E2E test](e2e/influxdb) for further details.\n\n##### Elasticsearch output\nElasticsearch output allows bulk writing to elasticsearch\nWe use elasticsearch-hadoop as a provided jar - spark-submit command should look like so:\n\n```spark-submit --packages org.elasticsearch:elasticsearch-hadoop:6.6.1 --class com.yotpo.metorikku.Metorikku metorikku.jar```\n\nCheck out the [example](examples/elasticsearch) and also the [Elasticsearch E2E test](e2e/elasticsearch) for further details.\n\n#### Docker\nMetorikku is provided with a [docker image](https://hub.docker.com/r/metorikku/metorikku).\n\nYou can use this docker to deploy metorikku in container based environments (we're using [Nomad by HashiCorp](https://www.nomadproject.io/)).\n\nCheck out this [docker-compose](docker/docker-compose.yml) for a full example of all the different parameters available and how to set up a cluster.\n\nCurrently the image only supports running metorikku in a spark cluster mode with the standalone scheduler.\n\nThe image can also be used to run E2E tests of a metorikku job.\nCheck out an example of running a kafka 2 kafka E2E with docker-compose [here](e2e/kafka/docker-compose.yml)\n\n#### UDF\nMetorikku supports adding custom code as a step.\nThis requires creating a JAR with the custom code.\nCheck out the [UDF examples directory](examples/udf) for a very simple example of such a JAR.\n\nThe only thing important in this JAR is that you have an object with the following method:\n```scala\nobject SomeObject {\n  def run(ss: org.apache.spark.sql.SparkSession, metricName: String, dataFrameName: String, params: Option[Map[String, String]]): Unit = {}\n}\n```\nInside the run function do whatever you feel like, in the example folder you'll see that we registered a new UDF.\nOnce you have a proper scala file and a ```build.sbt``` file you can run ```sbt package``` to create the JAR.\n\nWhen you have the newly created JAR (should be in the target folder), copy it to the spark cluster (you can of course also deploy it to your favorite repo).\n\nYou must now include this JAR in your spark-submit command by using the ```--jars``` flag, or if you're using java to run add it to the ```-cp``` flag.\n\nNow all that's left is to add it as a new step in your metric:\n```yaml\n- dataFrameName: dataframe\n  classpath: com.example.SomeObject\n  params:\n    param1: value1\n```\nThis will trigger your ```run``` method with the above dataFrameName.\nCheck out the built-in code steps [here](src/main/scala/com/yotpo/metorikku/code/steps).\n\n*NOTE: If you added some dependencies to your custom JAR build.sbt you have to either use [sbt-assembly](https://github.com/sbt/sbt-assembly) to add them to the JAR or you can use the ```--packages``` when running the spark-submit command*\n\n##### Custom Functions\nThere are some custom functions already implemented as part of the Metorikku JAR:\n\n- **SelectiveMerge:** Outer joins two tables according to keys.\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.SelectiveMerge\n  params:\n    df1: table1\n    df2: table2\n    joinKeys: column1,column2\n```\n- **RemoveDuplicates:** Remove duplicate rows based on index columns, or compare entire rows if not provided.\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.RemoveDuplicates\n  params:\n    table: tableName\n    columns: column1,column2\n```\n- **DropColumns:** Remove redundant columns by a given list.\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.DropColumns\n  params:\n    table: tableName\n    columns: column1,column2\n```\n- **CamelCaseColumnNames:** Converts snake_case column names to CamelCase\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.CamelCaseColumnNames\n  params:\n    table: tableName\n```\n- **AlignTables:** Converts source table schema to target schema, adding any missing column as null\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.AlignTables\n  params:\n    from: tableToConvert\n    to: tableSchemaToUse\n```\n- **LoadIfExists:** Loads table into a Dataframe, only if the table exists. If not - the result DataFrame will be empty\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.LoadIfExists\n  params:\n    dfName: dfToFill\n    tableName: tableToFillWith\n```\n- **ObfuscateColumns:** Obfuscates columns in the dataframe, supports md5, sha256, and a literal value.\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.obfuscate.ObfuscateColumns\n  params:\n    table: table\n    columns: 'col1,col2,col3'\n    delimiter: ','\n    value: sha256\n```\n\n```yaml\n- dataFrameName: resultFrame\n  classpath: com.yotpo.metorikku.code.steps.obfuscate.ObfuscateColumns\n  params:\n    table: table\n    columns: 'col1|col2|col3'\n    delimiter: '|'\n    value: '********'\n```\n\n#### Apache Hive metastore\nMetorikku supports reading and saving tables with Apache hive metastore.\nTo enable hive support via spark-submit (assuming you're using MySQL as Hive's DB but any backend can work) send the following configurations:\n```bash\nspark-submit \\\n--packages mysql:mysql-connector-java:5.1.75 \\\n--conf spark.sql.catalogImplementation=hive \\\n--conf spark.hadoop.javax.jdo.option.ConnectionURL=\"jdbc:mysql://localhost:3306/hive?useSSL=false\u0026createDatabaseIfNotExist=true\" \\\n--conf spark.hadoop.javax.jdo.option.ConnectionDriverName=com.mysql.jdbc.Driver \\\n--conf spark.hadoop.javax.jdo.option.ConnectionUserName=user \\\n--conf spark.hadoop.javax.jdo.option.ConnectionPassword=pass \\\n--conf spark.sql.warehouse.dir=/warehouse ...\n```\n\n*NOTE: If you're running via the standalone metorikku you can use system properties instead (```-Dspark.hadoop...```) and you must add the MySQL connector JAR to your class path via ```-cp```*\n\nThis will enable reading from the metastore.\n\nTo write an external table to the metastore you need to add **tableName** to your output configuration:\n```yaml\n...\noutput:\n- dataFrameName: moviesWithRatings\n  outputType: Parquet\n  outputOptions:\n    saveMode: Overwrite\n    path: moviesWithRatings.parquet\n    tableName: hiveTable\n    overwrite: true\n```\nOnly file formats are supported for table saves (**Parquet**, **CSV**, **JSON**).\n\nTo write a managed table (that will reside in the warehouse dir) simply omit the **path** in the output configuration.\n\nTo change the default database you can add the following to the job configuration:\n```yaml\n...\ncatalog:\n  database: some_database\n...\n\n```\n\n##### Hive table properties\n\nMetorikku enables the update of a [table's properties](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82706445#LanguageManualDDL-listTableProperties) in hive.\nYou can use one of the following methods to do it.\nUsing static table properties:\n```yaml\noutput:\n- dataFrameName: dataFrame\n  outputType: Parquet\n  outputOptions:\n    saveMode: Overwrite\n    path: path.parquet\n    tableName: table\n    tableProperties:\n      property: value1\n      comment: comment1\n```\n\nOr by using dynamic properties with the Catalog writer (please note that the dataframe needs to contain exaclty a single row, all columns from this row will be written as properties in the hive table):\n```yaml\nsteps:\n- dataFrameName: tableProperties\n  sql: SELECT count(1) as number_of_rows FROM anotherTable\n...\n- dataFrameName: tableProperties\n  outputType: Catalog\n  outputOptions:\n    tableName: table\n```\n\nCheck out the [examples](examples/hive) and the [E2E test](e2e/hive)\n\n\n#### Apache Hudi\nMetorikku supports reading/writing with [Apache Hudi](https://github.com/apache/incubator-hudi).\n\nHudi is a very exciting project that basically allows upserts and deletes directly on top of partitioned parquet data.\n\nIn order to use Hudi with Metorikku you need to add to your classpath (via ```--jars``` or if running locally with ```-cp```)\nan external JAR from here: https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.12/0.10.0/hudi-spark-bundle_2.12-0.10.0.jar\n\nTo run Hudi jobs you also have to make sure you have the following spark configuration (pass with ```--conf``` or ```-D```):\n```properties\nspark.serializer=org.apache.spark.serializer.KryoSerializer\n```\n\nAfter that you can start using the new Hudi writer like this:\n\n#### Job config\n```yaml\noutput:\n  hudi:\n    dir: /examples/output\n    # This controls the level of parallelism of hudi writing (should be similar to shuffle partitions)\n    parallelism: 1\n    # upsert/insert/bulkinsert\n    operation: upsert\n    # COPY_ON_WRITE/MERGE_ON_READ\n    storageType: COPY_ON_WRITE\n    # Maximum number of versions to retain\n    maxVersions: 1\n    # Hive database to use when writing\n    hiveDB: default\n    # Hive server URL (no longer needed in hudi 0.5.3+)\n    hiveJDBCURL: jdbc:hive2://hive:10000\n    hiveUserName: root\n    hivePassword: pass\n    # Delete inflight and compaction requested of unfinished commit\n    deletePendingCompactions: true\n```\n\n#### Metric config\n```yaml\ndataFrameName: test\n  outputType: Hudi\n  outputOptions:\n    path: test.parquet\n    # The key to use for upserts\n    keyColumn: userkey\n    # This will be used to determine which row should prevail (newer timestamps will win)\n    timeColumn: ts\n    # Partition column - note that hudi support a single column only, so if you require multiple levels of partitioning you need to add / to your column values\n    partitionBy: date\n    # Mapping of the above partitions to hive (for example if above is yyyy/MM/dd than the mapping should be year,month,day)\n    hivePartitions: year,month,day\n    # Hive table to save the results to\n    tableName: test_table\n    # Add missing columns according to previous schema, if exists\n    alignToPreviousSchema: true\n    # Remove completely null columns\n    removeNullColumns: true\n```\n\nIn order to delete send in your dataframe a boolean column called ```_hoodie_delete```, if it's true that row will be deleted.\n\nCheck out the [examples](e2e/hudi) and the [E2E test](e2e/hudi) for more details.\n\nAlso check the full list of configurations possible with hudi [here](http://hudi.incubator.apache.org/configurations.html).\n\n#### Apache Atlas\nMetorikku supports Data Lineage and Governance using [Apache Atlas](https://atlas.apache.org/) and the [Spark Atlas Connector](https://github.com/hortonworks-spark/spark-atlas-connector)\n\nAtlas is an open source Data Governance and Metadata framework for Hadoop which provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.\n\nIn order to use the spark-atlas-connector with Metorikku  you need to add to your classpath (via ```--jars``` or if running locally with ```-cp```)\nan external JAR from here: https://github.com/YotpoLtd/spark-atlas-connector/releases/download/latest/spark-atlas-connector-assembly.jar\n\nTo integrate the connector with Metorikku docker, you need to pass `USE_ATLAS=true` as en environment variable and the following config will be automatically added to `spark-default.conf`:\n```properties\nspark.extraListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker\nspark.sql.queryExecutionListeners=com.hortonworks.spark.atlas.SparkAtlasEventTracker\nspark.sql.streaming.streamingQueryListeners=com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker\n```\nFor a full example please refer to examples/docker-compose-atlas.yml\n\n#### Data Quality\nYou can also execute a series of verifications on your SQL steps with adding a `dq` block to your SQL step within the metric file.\nfor example:\n```\nsteps:\n- dataFrameName: df1\n  sql:\n    SELECT col1, col2\n    FROM input_1\n    WHERE id \u003e 100\n  dq:\n    level: warn\n    checks:\n      - isComplete\n          column: col1\n      - isComplete:\n          column: col2\n          level: error\n```\nCheck out the [readme](examples/dq/README.md) and [example](examples/dq) for further details.\n\n## License\nSee the [LICENSE](LICENSE.md) file for license rights and limitations (MIT).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyotpoltd%2Fmetorikku","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyotpoltd%2Fmetorikku","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyotpoltd%2Fmetorikku/lists"}