{"id":18026693,"url":"https://github.com/andreoss/etoile","last_synced_at":"2025-03-27T01:31:20.900Z","repository":{"id":37833273,"uuid":"227176324","full_name":"andreoss/etoile","owner":"andreoss","description":"ETL on Apache Spark","archived":false,"fork":false,"pushed_at":"2023-03-15T01:57:29.000Z","size":541,"stargazers_count":3,"open_issues_count":19,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-22T21:06:36.222Z","etag":null,"topics":["etl","spark"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andreoss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-10T17:20:15.000Z","updated_at":"2022-09-22T20:10:51.000Z","dependencies_parsed_at":"2024-10-30T08:28:29.237Z","dependency_job_id":null,"html_url":"https://github.com/andreoss/etoile","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreoss%2Fetoile","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreoss%2Fetoile/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreoss%2Fetoile/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreoss%2Fetoile/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andreoss","download_url":"https://codeload.github.com/andreoss/etoile/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245764682,"owners_count":20668458,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etl","spark"],"created_at":"2024-10-30T08:07:43.608Z","updated_at":"2025-03-27T01:31:19.209Z","avatar_url":"https://github.com/andreoss.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Étoilé\n[![Build Status](https://travis-ci.org/andreoss/etoile.svg?branch=master)](https://travis-ci.org/andreoss/etoile)\n[![Maintainability](https://api.codeclimate.com/v1/badges/45765fb306089171912c/maintainability)](https://codeclimate.com/github/andreoss/etoile/maintainability)\n[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=andreoss_etoile\u0026metric=alert_status)](https://sonarcloud.io/dashboard?id=andreoss_etoile)\n[![Lines of Code](https://sonarcloud.io/api/project_badges/measure?project=andreoss_etoile\u0026metric=ncloc)](https://sonarcloud.io/dashboard?id=andreoss_etoile)\n[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=andreoss_etoile\u0026metric=coverage)](https://sonarcloud.io/dashboard?id=andreoss_etoile)\n[![Hits-of-Code](https://hitsofcode.com/github/andreoss/etoile)](https://hitsofcode.com/view/github/andreoss/etoile)\n[![Language grade: Java](https://img.shields.io/lgtm/grade/java/g/andreoss/etoile.svg?logo=lgtm\u0026logoWidth=18)](https://lgtm.com/projects/g/andreoss/etoile/context:java)\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fandreoss%2Fetoile.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fandreoss%2Fetoile?ref=badge_shield)\n\n**Étoilé** is a multi-purpose Spark ETL application.\n\n## Features\n\n* Dump (`dump`)\n  Dumps a dataset to a different format.\n\n* Partition validation (`pv`)\n  Validates partitioning schema according to a provided formula.\n  It uses `input_file_name()` UDF and compares directory name to the formula.\n  More than one expression can be passed in case of nested partitions.\n\n* Compare (`compare`)\n  Compare two datasets by joining them on primary keys.\n\n## Build\n\n```\n./mvnw clean install -DskipTests\n```\n\n## Option format\n\nOptions are prefixed, after prefixes stripped they are primaraly passed to `DataFrameReader` or `DataFrameWriter`, thus each its option can be specified (such as `delimiter` for csv format, or `dbtable` for JDBC format).\nPrefixes could be `input`, `output`, `left`, `right`, depending on command.\n\n\n## Run\n\nSubmit with `spark2-submit`:\n\n```\nspark2-submit --master local étoilé.jar --command=....\n```\n\n## Example 1: Dumping Hive table to local CSV files\n\n```\n--command=dump\n--input.table=FOO.BAR\n--output.format=csv\n--output.path=file://tmp/dump-result\n```\n## Example 2: Dumping Oracle JDBC table to Avro\nOption `hive-names` indicates that Hive-incompatable symbols should be removed from column names.\n\n```\n--command=dump\n--input.format=jdbc\n--input.url=jdbc...\n--input.dbtable=FOO.BAR\n--output.format=avro\n--output.hive-names=true\n--output.path=file://tmp/dump-result\n```\n\n## Example 3: Validate partitioning structure\n\nWith these two expressions the valid directory structure should be `id_part=1/cn_part=1` and so on.\nValues after `=` should match the results of the expressions.\n\n```\n--command=pv\n--expression.1=pmod(id, 10) as id_part\n--expression.2=pmod(cn, 10) as cn_part\n--input.format=parquet\n--input.path=/data/my/table\n--input.dbtable=FOO.BAR\n--output.format=avro\n--output.hive-names=true\n--output.path=file://tmp/dump-result\n```\n\n## Example 4: Compare two datasets\nOutput will contain rows which not match.\n```\n--command=compare\n--keys=id\n--left.format=parquet\n--left.path=/data/my/table1\n--right.format=parquet\n--right.path=/data/my/table2\n--output.format=json\n--output.path=file://tmp/-result\n```\n\n## Misc options\n\n* `--\u003cprefix\u003e.drop` can be used to exclude one or more columns.\n* `--\u003cprefix\u003e.sort` reorders dataset by column or expression.\n* `--\u003cprefix\u003e.cast` applies cast to a column.\nExample: `--output.cast=id:string,date:timestamp`\n* `--\u003cprefix\u003e.convert` same as cast, but casts type to type without specified columns.\nExample: `--output.convert=timestamp:string,decimal:timestamp`\n* `--\u003cprefix\u003e.hive-names` convert names to hive-comtable, i.e by removing non-alphanumeric characters\n* `--\u003cprefix\u003e.rename` renames a column.\nExample: `--output.rename=id as iden`\n\n\n## License\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fandreoss%2Fetoile.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2Fandreoss%2Fetoile?ref=badge_large)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreoss%2Fetoile","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandreoss%2Fetoile","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreoss%2Fetoile/lists"}