Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/andreoss/etoile

ETL on Apache Spark
https://github.com/andreoss/etoile

etl spark

Last synced: 3 months ago
JSON representation

ETL on Apache Spark

Host: GitHub
URL: https://github.com/andreoss/etoile
Owner: andreoss
License: lgpl-3.0
Created: 2019-12-10T17:20:15.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-02-28T01:57:08.000Z (almost 2 years ago)
Last Synced: 2023-03-12T00:33:35.641Z (almost 2 years ago)
Topics: etl, spark
Language: Java
Homepage:
Size: 528 KB
Stars: 3
Watchers: 3
Forks: 3
Open Issues: 19
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Étoilé

[![Build Status](https://travis-ci.org/andreoss/etoile.svg?branch=master)](https://travis-ci.org/andreoss/etoile)

[![Maintainability](https://api.codeclimate.com/v1/badges/45765fb306089171912c/maintainability)](https://codeclimate.com/github/andreoss/etoile/maintainability)

[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=andreoss_etoile&metric=alert_status)](https://sonarcloud.io/dashboard?id=andreoss_etoile)

[![Lines of Code](https://sonarcloud.io/api/project_badges/measure?project=andreoss_etoile&metric=ncloc)](https://sonarcloud.io/dashboard?id=andreoss_etoile)

[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=andreoss_etoile&metric=coverage)](https://sonarcloud.io/dashboard?id=andreoss_etoile)

[![Hits-of-Code](https://hitsofcode.com/github/andreoss/etoile)](https://hitsofcode.com/view/github/andreoss/etoile)

[![Language grade: Java](https://img.shields.io/lgtm/grade/java/g/andreoss/etoile.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/andreoss/etoile/context:java)

[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fandreoss%2Fetoile.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fandreoss%2Fetoile?ref=badge_shield)

**Étoilé** is a multi-purpose Spark ETL application.

## Features

* Dump (`dump`)

  Dumps a dataset to a different format.

* Partition validation (`pv`)

  Validates partitioning schema according to a provided formula.

  It uses `input_file_name()` UDF and compares directory name to the formula.

  More than one expression can be passed in case of nested partitions.

* Compare (`compare`)

  Compare two datasets by joining them on primary keys.

## Build

```

./mvnw clean install -DskipTests

```

## Option format

Options are prefixed, after prefixes stripped they are primaraly passed to `DataFrameReader` or `DataFrameWriter`, thus each its option can be specified (such as `delimiter` for csv format, or `dbtable` for JDBC format).

Prefixes could be `input`, `output`, `left`, `right`, depending on command.

## Run

Submit with `spark2-submit`:

```

spark2-submit --master local étoilé.jar --command=....

```

## Example 1: Dumping Hive table to local CSV files

```

--command=dump

--input.table=FOO.BAR

--output.format=csv

--output.path=file://tmp/dump-result

```

## Example 2: Dumping Oracle JDBC table to Avro

Option `hive-names` indicates that Hive-incompatable symbols should be removed from column names.

```

--command=dump

--input.format=jdbc

--input.url=jdbc...

--input.dbtable=FOO.BAR

--output.format=avro

--output.hive-names=true

--output.path=file://tmp/dump-result

```

## Example 3: Validate partitioning structure

With these two expressions the valid directory structure should be `id_part=1/cn_part=1` and so on.

Values after `=` should match the results of the expressions.

```

--command=pv

--expression.1=pmod(id, 10) as id_part

--expression.2=pmod(cn, 10) as cn_part

--input.format=parquet

--input.path=/data/my/table

--input.dbtable=FOO.BAR

--output.format=avro

--output.hive-names=true

--output.path=file://tmp/dump-result

```

## Example 4: Compare two datasets

Output will contain rows which not match.

```

--command=compare

--keys=id

--left.format=parquet

--left.path=/data/my/table1

--right.format=parquet

--right.path=/data/my/table2

--output.format=json

--output.path=file://tmp/-result

```

## Misc options

* `--.drop` can be used to exclude one or more columns.

* `--.sort` reorders dataset by column or expression.

* `--.cast` applies cast to a column.

Example: `--output.cast=id:string,date:timestamp`

* `--.convert` same as cast, but casts type to type without specified columns.

Example: `--output.convert=timestamp:string,decimal:timestamp`

* `--.hive-names` convert names to hive-comtable, i.e by removing non-alphanumeric characters

* `--.rename` renames a column.

Example: `--output.rename=id as iden`

## License

[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fandreoss%2Fetoile.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2Fandreoss%2Fetoile?ref=badge_large)