https://github.com/teragrep/pth_10
Data Processing Language (DPL) translator for Apache Spark
https://github.com/teragrep/pth_10
apache-spark data-processing-language dpl programming-language-translator teragrep
Last synced: 4 months ago
JSON representation
Data Processing Language (DPL) translator for Apache Spark
- Host: GitHub
- URL: https://github.com/teragrep/pth_10
- Owner: teragrep
- License: agpl-3.0
- Created: 2023-06-07T09:35:02.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2026-02-06T07:22:44.000Z (4 months ago)
- Last Synced: 2026-02-06T15:35:31.643Z (4 months ago)
- Topics: apache-spark, data-processing-language, dpl, programming-language-translator, teragrep
- Language: Java
- Homepage: https://teragrep.com
- Size: 1.98 MB
- Stars: 1
- Watchers: 2
- Forks: 9
- Open Issues: 427
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= PTH_10: DPL to Apache Spark Translator
Translates Data Processing Language (DPL) commands to Apache Spark actions and transformations.
Uses ANTLR visitors to generate a list of step objects, which contain the actual implementations of the commands
using the Apache Spark API.
== Features
- Translates a string-based DPL command using the parse tree generated by the https://github.com/teragrep/pth_03[PTH_03]
ANTLR-based parser to Apache Spark actions and transformations.
- Fetch data from a datasource provider (by default, https://github.com/teragrep/pth_06[PTH_06] datasource provider) and
filter the data with the filters specified in the DPL command.
- Apply various transformations and actions to the data with simple easy-to-understand commands.
- Supports parallel and sequential modes based on which kind of commands are used. If a command requires batch-based
processing, sequential mode will be used. Otherwise, processing will remain on parallel mode, allowing stream processing.
- Spark API implementations are enclosed in so-called Step objects, which take a Dataset as input and return the
transformed dataset as the return value, allowing for easy reusability of these objects.
- ANTLR-based visitor functions purely gather all the necessary parameters for these objects, not containing
any implementation logic of the commands themselves.
== Documentation
See the official documentation on https://docs.teragrep.com[docs.teragrep.com].
== Limitations
Not all commands in the Data Processing Language are yet implemented.
== How to
Use:
- Create a new DPLParserCatalystContext. It requires a `SparkSession` object and a `com.typesafe.config.Config`. The
config is usually provided from the Zeppelin component.
[,java]
----
DPLParserCatalystContext catCtx = new DPLParserCatalystContext(sparkSession, config);
----
- Create a new DPLParserCatalystVisitor, in which you set the DPLParserCatalystContext.
[,java]
----
DPLParserCatalystVisitor catVisitor = new DPLParserCatalystVisitor(catCtx);
----
- Visit the parse tree generated by PTH_03 using the visitor functions with the DPLParserCatalystVisitor.visit() function.
[,java]
----
CatalystNode n = (CatalystNode) visitor.visit(tree);
----
- The result of that function is a CatalystNode. It contains a DataStreamWriter, which can be started to start the execution.
[,java]
----
n.getDataStreamWriter();
----
- Set the visitor's Consumer to a function of your liking to view or move the resulting Dataset to the desired component.
[,java]
----
visitor.setConsumer((ds, id) -> {
ds.show();
});
----
For a more concrete example, check out the https://github.com/teragrep/pth_07[PTH_07] Zeppelin DPL Interpreter project.
Compile:
[,sh]
----
mvn clean install -Pbuild
----
== Contributing
You can involve yourself with our project by https://github.com/teragrep/pth_10/issues/new/choose[opening an issue]
or submitting a pull request.
Contribution requirements:
. *All changes must be accompanied by a new or changed test.* If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
. Security checks must pass
. Pull requests must align with the principles and http://www.extremeprogramming.org/values.html[values] of extreme programming.
. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).
Read more in our https://github.com/teragrep/teragrep/blob/main/contributing.adoc[Contributing Guideline].
=== Contributor License Agreement
Contributors must sign https://github.com/teragrep/teragrep/blob/main/cla.adoc[Teragrep Contributor License Agreement] before a pull request is accepted to organization's repositories.
You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep's repositories.