Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aappddeevv/loader

ETL data into a database with an easy to use DSL.
https://github.com/aappddeevv/loader

Last synced: 24 days ago
JSON representation

ETL data into a database with an easy to use DSL.

Awesome Lists containing this project

README

        

##Purpose
An application that loads data into an RDBMS
using parallel loads and a simple DSL inspired by ETL tools
to specify the attribute mappings.

The DSL for schema definition and mapping transforms is
generic and can be used in many environments, including
spark. You can define your schema using the DSL then
define your transformation rules using the rules DSL and
apply them to a dataframe. The rules are automatically
translated into spark-friendly code.

Using the DSL requires some knowledge of scala, but not much.

##History

The project started out many years ago as a java program
hosted on sourceforge but I moved it over to github and
updated it to use scala about a year ago. I received
a large number of private updates from various versions
in between and recently was prompted privately to publish
them into github.

Its been tested
in production environments to be "good enough" to work on
large loads. Use your ETL tool or bulk loaders
specific to your RDBMS first, but otherwise you may find this
simple application useful.

##Mappings Development
Create a new sbt project then include this project
as a dependency.

You must first publish this project locally using
```sh
sbt publishLocal
```
then add the published file as a dependency to your
project.
```scala
libraryDependencies ++= Seq(
"org.im.loader" %% "csv" % "latest.version"
)
```
This automatically pulls in `org.im.loader.core`.

Once you have specified this project as a dependency
you need to:
* Create your main program
* Create your command line options. There are some
options available to you using the program.parser value.
* Develop your mappings (see below).
* Call the `program.runloader(..)` function providing
your command line parser (derived from (2)), the
default configuration derived from `org.im.loader.Config`
with your list of mappings and the command arguments
from your main class.

That's it!

Tip: To create a command line parser from the one
provided in the program object just do:
```scala
val yourparser = new scopt.OptionParser[Config]("loader") {
options ++= program.parser.stdargs // don't retype them
...

}
```

##Mapping Development
To create your mappings, derive from the mappings
object in org.im.loader and specify mappings using
the DSL.
```scala
object table1mappings extends mappings("table1", "table1", Some("theschema")) {
import sourcefirst._
import org.im.loader.Implicits._
import com.lucidchart.open.relate.interp.Parameter._

string("cola").directMove
long("colb)".to("colbtarget")
...
to[Long]("colc").rule(0){ ctx =>
ctx.success(ctx.input.get("funkycolcsource"))
}
}
```
You can also define the schema in the mappings to help
with type conversions before your rule receive your data.
Subclassing the mappings object allows you to add your
convenience combinator methods to the mappings object.
For example, you could add a 'lookup' combinator or
a `.directMoveButOnlyUnderCertainConditions` combinator.

"Source first" mappings are mappings that start with the
source such as `string("cola")`. That says that the mapping
should have the source attribute come from the attribute `cola1`
in the input record.

It's better to specify a "target first" mapping
such as `to[..](..)` and then specify processing rules. Rules
have a priority and are run in priority order. See the
dsltests.scala file in the test directory for examples
of mappings and how to specify the rules.

##Mapping Testing
The typical development model is to leave your project open
in your editor, edit your mappings, then run the load from
the sbt command line for unit tests. Once the mappings
are complete, bundle up "your" project and deploy it. Since
this library is not deployed to maven, download it,
then create your IDE's configuration using
```sh
sbt eclipse with-source=true
```
Develop and test your mappings. Then deploy the entire
application via a zip file.

Check out the `dsltest.scala` test file for examples of how
to specify your mappings.

You will want to drop your favorite jdbc lib into the lib directory
or include it in the dependencies inside build.sbt.

##Deploying

The application can be packaged by typing
```sh
sbt universal:packageBin
```
to obtain a zip file that can be installed. You will want
to have the same plugins specified in this library
in your own project's project/plugins.sbt to make this work.

##Spark Support
Spark support is in the mix and the code will be refactored
so that the ETL-style approach expressed in the DSL
works well with Spark dataframes. This includes schema definition.