https://github.com/teragrep/pth_10

Data Processing Language (DPL) translator for Apache Spark
https://github.com/teragrep/pth_10

apache-spark data-processing-language dpl programming-language-translator teragrep

Last synced: 5 months ago
JSON representation

Data Processing Language (DPL) translator for Apache Spark

Host: GitHub
URL: https://github.com/teragrep/pth_10
Owner: teragrep
License: agpl-3.0
Created: 2023-06-07T09:35:02.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2026-02-06T07:22:44.000Z (5 months ago)
Last Synced: 2026-02-06T15:35:31.643Z (5 months ago)
Topics: apache-spark, data-processing-language, dpl, programming-language-translator, teragrep
Language: Java
Homepage: https://teragrep.com
Size: 1.98 MB
Stars: 1
Watchers: 2
Forks: 9
Open Issues: 427
Metadata Files:
- Readme: README.adoc
- License: LICENSE

Awesome Lists containing this project

README

          = PTH_10: DPL to Apache Spark Translator

Translates Data Processing Language (DPL) commands to Apache Spark actions and transformations.

Uses ANTLR visitors to generate a list of step objects, which contain the actual implementations of the commands

using the Apache Spark API.

== Features

- Translates a string-based DPL command using the parse tree generated by the https://github.com/teragrep/pth_03[PTH_03]

ANTLR-based parser to Apache Spark actions and transformations.

- Fetch data from a datasource provider (by default, https://github.com/teragrep/pth_06[PTH_06] datasource provider) and

filter the data with the filters specified in the DPL command.

- Apply various transformations and actions to the data with simple easy-to-understand commands.

- Supports parallel and sequential modes based on which kind of commands are used. If a command requires batch-based

processing, sequential mode will be used. Otherwise, processing will remain on parallel mode, allowing stream processing.

- Spark API implementations are enclosed in so-called Step objects, which take a Dataset as input and return the

transformed dataset as the return value, allowing for easy reusability of these objects.

- ANTLR-based visitor functions purely gather all the necessary parameters for these objects, not containing

any implementation logic of the commands themselves.

== Documentation

See the official documentation on https://docs.teragrep.com[docs.teragrep.com].

== Limitations

Not all commands in the Data Processing Language are yet implemented.

== How to

Use:

- Create a new DPLParserCatalystContext. It requires a `SparkSession` object and a `com.typesafe.config.Config`. The

config is usually provided from the Zeppelin component.

[,java]

----

DPLParserCatalystContext catCtx = new DPLParserCatalystContext(sparkSession, config);

----

- Create a new DPLParserCatalystVisitor, in which you set the DPLParserCatalystContext.

[,java]

----

DPLParserCatalystVisitor catVisitor = new DPLParserCatalystVisitor(catCtx);

----

- Visit the parse tree generated by PTH_03 using the visitor functions with the DPLParserCatalystVisitor.visit() function.

[,java]

----

CatalystNode n = (CatalystNode) visitor.visit(tree);

----

- The result of that function is a CatalystNode. It contains a DataStreamWriter, which can be started to start the execution.

[,java]

----

n.getDataStreamWriter();

----

- Set the visitor's Consumer to a function of your liking to view or move the resulting Dataset to the desired component.

[,java]

----

visitor.setConsumer((ds, id) -> {

    ds.show();

});

----

For a more concrete example, check out the https://github.com/teragrep/pth_07[PTH_07] Zeppelin DPL Interpreter project.

Compile:

[,sh]

----

mvn clean install -Pbuild

----

== Contributing

You can involve yourself with our project by https://github.com/teragrep/pth_10/issues/new/choose[opening an issue]

or submitting a pull request.

Contribution requirements:

. *All changes must be accompanied by a new or changed test.* If you think testing is not required in your pull request, include a sufficient explanation as why you think so.

. Security checks must pass

. Pull requests must align with the principles and http://www.extremeprogramming.org/values.html[values] of extreme programming.

. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our https://github.com/teragrep/teragrep/blob/main/contributing.adoc[Contributing Guideline].

=== Contributor License Agreement

Contributors must sign https://github.com/teragrep/teragrep/blob/main/cla.adoc[Teragrep Contributor License Agreement] before a pull request is accepted to organization's repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep's repositories.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/teragrep/pth_10

Awesome Lists containing this project

README