An open API service indexing awesome lists of open source software.

https://github.com/teragrep/pth_10

Data Processing Language (DPL) translator for Apache Spark
https://github.com/teragrep/pth_10

apache-spark data-processing-language dpl programming-language-translator teragrep

Last synced: 4 months ago
JSON representation

Data Processing Language (DPL) translator for Apache Spark

Awesome Lists containing this project

README

          

= PTH_10: DPL to Apache Spark Translator

Translates Data Processing Language (DPL) commands to Apache Spark actions and transformations.
Uses ANTLR visitors to generate a list of step objects, which contain the actual implementations of the commands
using the Apache Spark API.

== Features

- Translates a string-based DPL command using the parse tree generated by the https://github.com/teragrep/pth_03[PTH_03]
ANTLR-based parser to Apache Spark actions and transformations.
- Fetch data from a datasource provider (by default, https://github.com/teragrep/pth_06[PTH_06] datasource provider) and
filter the data with the filters specified in the DPL command.
- Apply various transformations and actions to the data with simple easy-to-understand commands.
- Supports parallel and sequential modes based on which kind of commands are used. If a command requires batch-based
processing, sequential mode will be used. Otherwise, processing will remain on parallel mode, allowing stream processing.
- Spark API implementations are enclosed in so-called Step objects, which take a Dataset as input and return the
transformed dataset as the return value, allowing for easy reusability of these objects.
- ANTLR-based visitor functions purely gather all the necessary parameters for these objects, not containing
any implementation logic of the commands themselves.

== Documentation

See the official documentation on https://docs.teragrep.com[docs.teragrep.com].

== Limitations

Not all commands in the Data Processing Language are yet implemented.

== How to

Use:

- Create a new DPLParserCatalystContext. It requires a `SparkSession` object and a `com.typesafe.config.Config`. The
config is usually provided from the Zeppelin component.
[,java]
----
DPLParserCatalystContext catCtx = new DPLParserCatalystContext(sparkSession, config);
----
- Create a new DPLParserCatalystVisitor, in which you set the DPLParserCatalystContext.
[,java]
----
DPLParserCatalystVisitor catVisitor = new DPLParserCatalystVisitor(catCtx);

----
- Visit the parse tree generated by PTH_03 using the visitor functions with the DPLParserCatalystVisitor.visit() function.
[,java]
----
CatalystNode n = (CatalystNode) visitor.visit(tree);
----
- The result of that function is a CatalystNode. It contains a DataStreamWriter, which can be started to start the execution.
[,java]
----
n.getDataStreamWriter();
----
- Set the visitor's Consumer to a function of your liking to view or move the resulting Dataset to the desired component.
[,java]
----
visitor.setConsumer((ds, id) -> {
ds.show();
});
----

For a more concrete example, check out the https://github.com/teragrep/pth_07[PTH_07] Zeppelin DPL Interpreter project.

Compile:

[,sh]
----
mvn clean install -Pbuild
----

== Contributing

You can involve yourself with our project by https://github.com/teragrep/pth_10/issues/new/choose[opening an issue]
or submitting a pull request.

Contribution requirements:

. *All changes must be accompanied by a new or changed test.* If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
. Security checks must pass
. Pull requests must align with the principles and http://www.extremeprogramming.org/values.html[values] of extreme programming.
. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our https://github.com/teragrep/teragrep/blob/main/contributing.adoc[Contributing Guideline].

=== Contributor License Agreement

Contributors must sign https://github.com/teragrep/teragrep/blob/main/cla.adoc[Teragrep Contributor License Agreement] before a pull request is accepted to organization's repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep's repositories.