https://github.com/hyper63/etl-template

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/hyper63/etl-template
Owner: hyper63
Created: 2021-01-08T15:22:44.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2021-01-08T15:25:29.000Z (about 5 years ago)
Last Synced: 2025-01-13T05:41:40.242Z (about 1 year ago)
Language: JavaScript
Size: 19.5 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # hyper63 ETL Job Template

This is a template project for creating an ETL Job.

You can create a new ETL job from this template by running the following

command:

```

npx bam github:hyper63/etl-template [your-jobname]

```

This will create a new directory based on your jobname

Then you will want to cd into the directory and open `app.arc` in your

editor. You will want to modify the name of the app.

```

@app

[your-jobname]

@scheduled

etl

  rate 1 day

  src src

```

> NOTE: It is very important to change the name of your app, also change the rate of your scheduled ETL Job in this file if different than daily.

## Developer Setup Requirements

> NOTE: nodejs and aws-cli are required see https://nodejs.org and https://aws.amazon.com/cli/

```

cd src

npm install -g @architect/architect

npm install

```

## Setting up Environment Variables

This tempate is setup to leverage environment variables for the job specialized configuration information.

```

SOURCE_URL

SOURCE_TOKEN

TARGET_URL

TARGET_TOKEN

```

There may be more config params based on the source endpoint you need to specify. Using the `arc env` cli command you can set these variables in aws.

```

arc env production KEY value

```

## Project Structure

```

- src

  - lib

    index.js - ETL Pipeline Details

    index_test.js - Pipeline Test

    utils.js - Async Utils

    get-data.js - example source/extract function

    put-stats.js - example target/load function

  index.js - main handler for scheduled events

  package.json - manifest file

  config.arc - architect aws lambda config

README.md

app.arc - architect aws app config

```

## About the ETL Job Code 

The ETL Job is broken out into three distinct functions: Extract, Transform and Load. In the `lib/index.js` file you can see each function defined with some sample code for each. 

### Extract Function

The Extract function is responsible for getting all of the data from the source or sources. In this template, there is a `lib/get-data.js` that shows an example of how to get the data, this function takes an `object` and returns an `Async` which is like a promise but gets lazy loaded so that the caller can control when the async call will occur. You can `map` and `chain` on the `Async` object, if you map, the value you return will be placed in the `Async` and if you `chain` you must return a new `Async` object.

### Transform Function

The transform function takes `data` and then returns an `AsyncReader` which wraps around a value. The easiest way to work with the transform function is this pattern:

```

exports.transform = data => AsyncReader.of(

  ...do stuff..

)

```

Then you can map over the data an create any modifications or changes to the data you want.

### Load function

The load function is very similar to the extract function, you will get data as your argument and you will want to put each item in the data to the data warehouse. For the most part, if you are using `hyper63` as your data warehouse, you should not have to modify the load function, it should just work, as long as your data is ready to go and your target is properly setup.

## Testing

The easiest way to test in a development environment is with `fetch-mock`, which allows you to simulate exactly what the api servers will return back and it can allow your job to react to it. This allows you to focus on your code and your patterns.

This template has a test setup and ready to run, you can find it `lib/index_test.js`, if you look at the file, you can see that it has a fetchMock setup for two endpoints and the actual test routine should look very similar to the handler code that is getting invoked. You can use this test script to verify your code is properly running each step.

## Deployment

Now you have tested locally, you are ready to deploy, make sure you have the right region and profile set.

```

export AWS_PROFILE=default

export AWS_REGION=us-east-1

```

Then you will want to make sure you are in the project root directory.

```

cat app.arc

```

> NOTE: if you do not see the app.arc file you are not in the project root directory

Then run

```

arc deploy --production

```

## Monitoring

Now that you are up and running, you may want to check out your jobs progress or any errors that may be happening.

You can access the logs for your deployment

```

arc logs production src

```

## Misc

You can make changes and then deploy often, it will replace the existing jobs.

## Destroy Job

```

arc destroy --production --name your-jobname

```

This command will destroy the job from aws and remove all traces.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hyper63/etl-template

Awesome Lists containing this project

README