https://github.com/lapetitesouris/csvloader

Optimized CSV Loader, which replaces a traditional ETL process to load huge CSV dataset to traditional databases
https://github.com/lapetitesouris/csvloader

dataengineering etl-job pattern worker-pool

Last synced: 6 months ago
JSON representation

Optimized CSV Loader, which replaces a traditional ETL process to load huge CSV dataset to traditional databases

Host: GitHub
URL: https://github.com/lapetitesouris/csvloader
Owner: LaPetiteSouris
Created: 2020-07-19T08:14:02.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-07-22T12:49:37.000Z (about 5 years ago)
Last Synced: 2025-02-09T00:41:22.287Z (8 months ago)
Topics: dataengineering, etl-job, pattern, worker-pool
Language: Go
Homepage:
Size: 6.84 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # What is this ?

Just another CSV loader, which load csv and dumps into different Database.

The program use worker pub/sub pattern internally to speed up the data loading.

This may gain lots of time if your ETL process involves loading a huge csv files

# How does it work ?

1. Read raw CSV input

2. Distribute the work load into different worker, each worker is an independent goroutine. This helps speed up the data loading process

For more [information](https://medium.com/life-of-a-senior-data-engineer/worker-pattern-in-golang-for-data-etl-ebf8a52da636)

# Supported loading

As of now, only Postgres interface is implemented, thus you can load CSV to POSTGRES_HOST

# Add more supported Database

Either create an issue for follow the guidelines to implement it yourself.

# Guidelines

1. Create your own type of worker (Refer to `workerpool\postgresworker.go`).

2. Your new worker must satisfy the interface

```golang

// Worker is the work horse

type Worker interface {

	ExecuteTask([]string, *sync.WaitGroup, ...interface{}) error

}

```

3. Using your own worker, initiate your own loader, refer to `loader.go`

For example, you may create a `MongoDBWorker` struct, then your loader function may look like

```golang

// LoadRecordToDatabase take records and dump to Database

func LoadRecordToDatabase(records []string, numberOfGoroutine int, args ...interface{}) error {

	var wg sync.WaitGroup

	// Initiate worker pool

	// Use the corresponding worker type

	workerArray := make([]pool.Worker, 0)

	for i := 0; i < numberOfGoroutine; i++ {

    // Initiate MongoDBWorker

		w := &pool.MongoDBWorker{ID: strconv.FormatInt(int64(i), 10)}

		workerArray = append(workerArray, w)

	}

	workerPool := &pool.WorkerPool{Wg: &wg, Pool: workerArray}

	workerPool.ExecuteJob(records, args...)

	wg.Wait()

	return nil

}

```

### Build and Execution

Build with Docker

```bash

docker build . -t csvloader

# run the image in the directory where you can locate the csv and mount it to the container

docker run exec -it csvloader /bin/bash --mount src=`pwd`,target=/csvloader

# inside your container

cd /csvloader

POSTGRES_HOST="localhost" POSTGRES_PORT=5432 POSTGRES_USER="postgres" POSTGRES_PASS="admin" POSTGRES_DBNAME="ronin" go run *.go -filePath=sample.csv -query="INSERT INTO samples VALUES (\$1, \$2) ON CONFLICT (id) DO UPDATE SET value = \$2 RETURNING id" -nbrgoroutines=5

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lapetitesouris/csvloader

Awesome Lists containing this project

README