{"id":17932755,"url":"https://github.com/lapetitesouris/csvloader","last_synced_at":"2025-04-03T11:17:24.438Z","repository":{"id":96858968,"uuid":"280821654","full_name":"LaPetiteSouris/csvloader","owner":"LaPetiteSouris","description":"Optimized CSV Loader, which replaces a traditional ETL process to load huge CSV dataset to traditional databases","archived":false,"fork":false,"pushed_at":"2020-07-22T12:49:37.000Z","size":7,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-09T00:41:22.287Z","etag":null,"topics":["dataengineering","etl-job","pattern","worker-pool"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LaPetiteSouris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-19T08:14:02.000Z","updated_at":"2021-05-24T20:51:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"5bef823c-6984-4c01-bfa2-16315e9fcc0f","html_url":"https://github.com/LaPetiteSouris/csvloader","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fcsvloader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fcsvloader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fcsvloader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LaPetiteSouris%2Fcsvloader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LaPetiteSouris","download_url":"https://codeload.github.com/LaPetiteSouris/csvloader/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246989752,"owners_count":20865331,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataengineering","etl-job","pattern","worker-pool"],"created_at":"2024-10-28T21:30:17.192Z","updated_at":"2025-04-03T11:17:24.423Z","avatar_url":"https://github.com/LaPetiteSouris.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What is this ?\n\nJust another CSV loader, which load csv and dumps into different Database.\n\nThe program use worker pub/sub pattern internally to speed up the data loading.\n\nThis may gain lots of time if your ETL process involves loading a huge csv files\n\n# How does it work ?\n\n1. Read raw CSV input\n2. Distribute the work load into different worker, each worker is an independent goroutine. This helps speed up the data loading process\n\nFor more [information](https://medium.com/life-of-a-senior-data-engineer/worker-pattern-in-golang-for-data-etl-ebf8a52da636)\n\n# Supported loading\n\nAs of now, only Postgres interface is implemented, thus you can load CSV to POSTGRES_HOST\n\n# Add more supported Database\nEither create an issue for follow the guidelines to implement it yourself.\n\n# Guidelines\n1. Create your own type of worker (Refer to `workerpool\\postgresworker.go`).\n2. Your new worker must satisfy the interface\n\n```golang\n\n// Worker is the work horse\ntype Worker interface {\n\tExecuteTask([]string, *sync.WaitGroup, ...interface{}) error\n}\n\n```\n\n3. Using your own worker, initiate your own loader, refer to `loader.go`\nFor example, you may create a `MongoDBWorker` struct, then your loader function may look like\n\n```golang\n\n// LoadRecordToDatabase take records and dump to Database\nfunc LoadRecordToDatabase(records []string, numberOfGoroutine int, args ...interface{}) error {\n\tvar wg sync.WaitGroup\n\n\t// Initiate worker pool\n\t// Use the corresponding worker type\n\tworkerArray := make([]pool.Worker, 0)\n\tfor i := 0; i \u003c numberOfGoroutine; i++ {\n\n    // Initiate MongoDBWorker\n\t\tw := \u0026pool.MongoDBWorker{ID: strconv.FormatInt(int64(i), 10)}\n\t\tworkerArray = append(workerArray, w)\n\t}\n\tworkerPool := \u0026pool.WorkerPool{Wg: \u0026wg, Pool: workerArray}\n\tworkerPool.ExecuteJob(records, args...)\n\twg.Wait()\n\treturn nil\n}\n\n```\n### Build and Execution\n\nBuild with Docker\n\n```bash\ndocker build . -t csvloader\n# run the image in the directory where you can locate the csv and mount it to the container\ndocker run exec -it csvloader /bin/bash --mount src=`pwd`,target=/csvloader\n# inside your container\ncd /csvloader\n\nPOSTGRES_HOST=\"localhost\" POSTGRES_PORT=5432 POSTGRES_USER=\"postgres\" POSTGRES_PASS=\"admin\" POSTGRES_DBNAME=\"ronin\" go run *.go -filePath=sample.csv -query=\"INSERT INTO samples VALUES (\\$1, \\$2) ON CONFLICT (id) DO UPDATE SET value = \\$2 RETURNING id\" -nbrgoroutines=5\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flapetitesouris%2Fcsvloader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flapetitesouris%2Fcsvloader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flapetitesouris%2Fcsvloader/lists"}