{"id":19055758,"url":"https://github.com/taskrabbit/empujar","last_synced_at":"2025-04-13T10:58:19.069Z","repository":{"id":55863721,"uuid":"41481729","full_name":"taskrabbit/empujar","owner":"taskrabbit","description":"When you need to push data around, you push it. A node.js ETL tool.","archived":false,"fork":false,"pushed_at":"2020-12-10T20:12:20.000Z","size":755,"stargazers_count":142,"open_issues_count":9,"forks_count":15,"subscribers_count":37,"default_branch":"master","last_synced_at":"2025-03-27T02:11:50.157Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/taskrabbit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-27T11:00:38.000Z","updated_at":"2025-01-13T01:51:49.000Z","dependencies_parsed_at":"2022-08-15T08:00:57.577Z","dependency_job_id":null,"html_url":"https://github.com/taskrabbit/empujar","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fempujar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fempujar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fempujar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fempujar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/taskrabbit","download_url":"https://codeload.github.com/taskrabbit/empujar/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248703196,"owners_count":21148117,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T23:46:53.674Z","updated_at":"2025-04-13T10:58:19.036Z","avatar_url":"https://github.com/taskrabbit.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Empujar. Empujarlo Bueno.\nWhen you need to push data around, you push it. Push it real good.  \nAn ETL and Operations tool.\n\n[![Build Status](https://travis-ci.org/taskrabbit/empujar.svg?branch=master)](https://travis-ci.org/taskrabbit/empujar)\n\n![https://raw.githubusercontent.com/taskrabbit/empujar/master/empujar.png](https://raw.githubusercontent.com/taskrabbit/empujar/master/empujar.png)\n\n## What\n\nEmpujar is a tool which moves stuff around.  It's built in node.js so you can do lots of stuff async-ly.  You can move data around (a ETL tool), files (a backup tool), and more!  \n\nEmpujar's top level object is a \"book\", which contains \"chapters\" and then \"pages\".  Chapters are excecuted 1-by-1 in order, and then each page in a chapter can be run in parallel (up to a *threading* limit you specify).\n\nSee an [example project here](https://github.com/taskrabbit/empujar/tree/master/books/etl).\n\nFor Example, an example chapter to extract all data from a mySQL database would be:\n\n```javascript\nvar dateformat = require('dateformat');\n\nexports.chapterLoader = function(book){\n\n  // define\n  var chapter = book.addChapter(1, 'EXTRACT \u0026 LOAD', {threads: 5});\n\n  // helpers\n  var source       = book.connections.source.connection;\n  var destination  = book.connections.destination.connection;\n  var queryLimit   = 1000;\n  var tableMaxes   = {};\n\n  var extractTable = function(table, callback){\n    destination.getMax(table, 'updatedAt', function(error, max){\n      if(error){ return callback(error); }\n\n      var query = 'SELECT * FROM `' + table + '` ';\n      if(max){\n        query += ' WHERE `updatedAt` \u003e= \"' + dateformat(max, 'yyyy-mm-dd HH:MM:ss') + '\"';\n      }\n\n      source.getAll(query, queryLimit, function(error, rows, done){\n        destination.insertData(table, rows, function(error){\n          if(error){ return next(error); }\n          done();\n        });\n      }, callback);\n    });\n  };\n\n  chapter.addLoader('determine extract queries', function(done){\n    source.tables.forEach(function(table){\n      chapter.addPage('extract table: ' + table, function(next){\n        extractTable(table, next);\n      });\n    });\n    done();\n  });\n\n};\n\n```\n\nEmpujar runs operations in series or parallel.  These are defined by `books` and `chapters` and `pages`.\n\n```javascript\n#!/usr/bin/env node\n\nprocess.chdir(__dirname);\n\nvar Empujar    = require('empujar');\nvar optimist   = require('optimist');\nvar options    = optimist.argv; // get command line opts, like `--logLevel debug` or `--chapters 100`\n\nvar book = new Empujar.book(options);\n\n// you can define custom error behavior when a page callback retruns an error\nvar errorHandler = function(error, context){\n  console.log(\"OH NO! (but I handled the error) | \" + error);\n  setTimeout(process.exit, 5000);\n};\n\nbook.on('error', errorHandler);\n\nbook.connect(function(){\n\n  // the logger will output to the console and a log file\n  book.logger.log('I am a debug message', 'debug'); // log levels can be set on log lines, and toggled with the `--logLevel` flag\n\n  // define `book.data.stuff` to make it availalbe to all phases of the book\n  book.data.stuff = 'something cool';\n\n  var chapter1 = book.addChapter(1, 'Do the first thing in parallel', {threads: 10});\n  var chapter2 = book.addChapter(2, 'Do that next thing in serial', {threads: 1});\n\n  // chapter 1\n  var i = 0;\n  while(i \u003c 100){\n    chapter1.addPage('sleepy thing: ' + i, function(next){\n      setTimeout(next, 100);\n    });\n    i++;\n  }\n\n  // chapter 2\n\n  // chapters can also have pre-loaders which run before all pages\n  chapter2.addLoader('do something before', function(next){\n    book.logger.log('I am the preloader');\n    next();\n  });\n\n  chapter2.addPage('the final step', function(next){\n    next();\n    // next(new Error('on no!')); // if you end a page with an error, the errorHandler will be invoked, and the book stopped\n  });\n\n  // chapters can also be loaded from /chapters/name/chapter.js in the project\n  // book.loadChapters();\n\n  // you can also configure an optional logger (perhaps to a DB) for empujar's internal status\n  // book.on('state', function(data){\n  //   databse.insertData('empujar', [data]);\n  // });\n\n  book.run(function(){\n    setTimeout(process.exit, 5000);\n  });\n});\n```\n\nThere is also a more formal example you can explore within this project.  Check out /books/etl to learn more.\n\nEmpujar will connect to connections you define in `book/config/connections/NAME.js`, and there should be a matching transport in `/lib/connections/TYPE.js`.\n\nWhen `book.run()` is complete, you probably want to `process.exit()`, or more gracefully shutdown.\n\nYou can subscribe to `book.on('error')` and `book.on('state')` events.  A cool thing to do would be to actually record these state events into your datawarehouse, if you are using empujar as an ETL tool:\n\n```javascript\nbook.on('state', function(data){  datawarehouse.insertData('empujar', [data]);  });\n```\n\n## Project Layout\n\nCreate your project so that it looks like this:\n```\n| -\\books\n| ---\\myBook\n| -----\\book.js\n| -----\\pids\\\n| -----\\logs\\\n| -----\\config\\\n| -----\\config\\connections\\\n| -----\\config\\connections\\myDatabase.js\n| -----\\chapters\\\n| -----\\chapters\\chapte1.js\n| -----\\chapters\\chapte2.js\n```\n\n## Launch Flags\n\nThe defaults for all launch flags are:\n\n```javascript\n{\n  chapterFiles: path.normalize( process.cwd() + '/chapters/**/*.js' ),\n  configPath:   path.normalize( process.cwd() + '/config' ),\n  logPath:      path.normalize( process.cwd() + '/log' ),\n  pidsPath:     path.normalize( process.cwd() + '/pids' ),\n  logFile:      'empujar.log',\n  tmpPath:      path.normalize( process.cwd() + '/tmp' ),\n  logStdout:    true,\n  logLevel:     'info',\n  chapters:     [],\n  getAllLimit:  Infinity,\n}\n```\n\n**Examples:**\n\n1. Run your book: `node yourBook.js`\n2. Run your book in verbose mode: `node yourBook.js --logLevel debug`\n3. Run only certain chapters in your book: `node yourBook.js --chapters 1,4` or a range: `node yourBook.js --chapters 100-300`\n4. Extract only a small subset of yoru data (great in testing) `node yourBook.js --getAllLimit 1000`\n  - This would make all invocations of `connection.getAll()` exit sucessfully after retrieving 1000 rows.\n\n## Connections\n\nWhile you can create your own connections, Empujar ships with the tools to work with a number of the most common ones:\n\n- [mySQL](#mysql)\n- [Amazon Redshift](#amazon-redshift)\n- [Elasticsearch](#elasticsearch)\n- [FTP](#ftp)\n- [S3](#s3)\n\n### MySQL\n\n```javascript\nvar connection = book.connections.mysql.connection;\n\nconnection.connect = function(callback)\n// Connection method; handled by book.connect();\n// callback is passed (error)\n\nconnection.showTables = function(callback)\n// list tables\n// callback is returned error, array of table names\n\nconnection.showColumns = function(table, callback)\n// list the columns + metadata for each column\n// callback is returned error, hash of columns + metadata\n\nconnection.query = function(query, data, callback)\n// query the table\n// data can be optional; used to fill in missing attributes/interpolate (?)\n// callback is returned error, rows (array of hashes col-value)\n\nconnection.getAll = function(queryBase, chunkSize, dataCallback, doneCallback)\n// fetch data from the cluster; normalized as an array of hashes.  Data is already typecast.\n// queryBase -\u003e the base mySQL query (Limit and offset will be appended automatically)\n// chunkSize -\u003e number of results to return (IE: limit)\n// dataCallback -\u003e callback called with each collection of data\n//   -\u003e (error, data, next)\n//   -\u003e data is normalized\n//   -\u003e next() must be called to continue\n// doneCallback is passed (error, rowsFound)\n\nconnection.getMax = function(table, column, callback)\n// list the maximum value for a column in a table\n// callback is returned error, maximum value from the table or null\n\nconnection.queryStream = function(query, callback)\n// get a stream that returns results of a query\n// events listed here: https://github.com/felixge/node-mysql#streaming-query-rows\n// callback is returned error, stream\n\nconnection.insertData = function(table, data, callback, mergeOnDuplicates)\n// add data to an table; create the index if needed.  Data should be normalized (IE results from #getAll)\n// callback is passed (error)\n\nconnection.addColumn = function(table, column, rowData, callback)\n// add a column to a table.\n// RowData is an array of data to insert into the column which can be used to determine the column data type\n// callback is returned error\n\nconnection.alterColumn = function(table, column, definition, callback)\n// change the datatype of a column\n// definition is a mySQL statment\n// callback is returned error\n\nconnection.mergeTables = function(sourceTable, destinationTable, callback)\n// merge the data from sourceTable into destinationTable\n// destinationTable will be created if if doesn't exist\n// destinationTable will be erased and recreated from sourceTable if there is no primary key present\n// callback is returned error\n\nconnection.copyTableSchema = function(sourceTable, destinationTable, callback)\n// create a new table (destinationTable) with the same schema as (sourceTable)\n// callback is returned error\n\nconnection.dump = function(file, options, callback)\n// mysqlDump the DB to file\n// options:\n/*\n  if(!options.binary){   options.binary = 'mysqldump';             }\n  if(!options.database){ options.database = self.options.database; }\n  if(!options.password){ options.password = self.options.password; }\n  if(!options.host){     options.host = self.options.host;         }\n  if(!options.port){     options.port = self.options.port;         }\n  if(!options.user){     options.user = self.options.user;         }\n  if(!options.tables){   options.tables = [];                      }\n  if(!options.gzip){     options.gzip = false;                     }\n*/\n// callback is returned error\n\n```\n\n### Elasticsearch\n\n```javascript\nvar connection = book.connections.elasticsearch.connection;\n\nconnection.connect = function(callback)\n// Connection method; handled by book.connect();\n// callback is passed (error)\n\nconnection.showIndices = function(callback)\n// list the indices in the cluster\n// callback is passed (error, indicies)\n//  -\u003e `indicies` is a hash with index names and metadata\n\nconnection.insertData = function(index, data, callback)\n// add data to an index; create the index if needed.  Data should be normalized (IE results from #getAll)\n// callback is passed (error)\n\nconnection.getAll = function(index, query, fields, chunkSize, dataCallback, doneCallback)\n// fetch data from the cluster; normalized as an array of hashes.  Data is already typecast.\n// index -\u003e string name of index\n// query -\u003e the elasticsearch query (as a hash)\n// fields -\u003e array of fields you want returned; '*' can be passed as an argument to request all fields\n// chunkSize -\u003e number of results to return (from each server)\n// dataCallback -\u003e callback called with each collection of data\n//   -\u003e (error, data, next)\n//   -\u003e data is normalized\n//   -\u003e next() must be called to continue\n// doneCallback is passed (error, rowsFound)\n\n```\n\n### S3\n``` javascript\nvar connection = book.connections.s3.connection;\n\nconnection.connect = function(callback)\n// Connection method; handled by book.connect();\n// callback is passed (error)\n\nconnection.listFolders = function(prefix, callback)\n// list all folders in this S3 bucket (starting with `prefix`)\n// prefix can be `*`of `''` to get all folders in the bucket\n// callback is passed (error, arrayOfFolderNames)\n\nconnection.listObjects = function(prefix, callback)\n// list all objects in this S3 bucket (starting with `prefix`)\n// prefix can be `*`of `''` to get all folders in the bucket\n// callback is passed (error, arrayOfObjectNames)\n\nconnection.deleteFolder = function(prefix, callback)\n// delete the folder starging with `prefix`, and all objects contatined within\n// like `rm -rf prefix`\n// prefix can be `*`of `''` to delete all folders and files in the bucket\n// callback is passed (error)\n\nconnection.objectExists = function(filename, callback)\n// check if a file exists in this bucket\n// callback is passed (error, exists) where exists is a boolean\n\nconnection.delete = function(filename, callback)\n// delete a file from this bucket\n// callback is passed (error)\n\nconnection.streamingUpload = function(inputStream, filename, callback)\n// upload a file* to S3 with the filename `filename`\n// the file you are uploading should be a readableStream created with fs.createReadStream\n// callback is passed (error)\n```\n\n### FTP\n\n``` javascript\nvar connection = book.connections.ftp.connection;\n\nconnection.connect = function(callback)\n// Connection method; handled by book.connect();\n// callback is passed (error)\n\nconnection.get = function(file, callback)\n// donwload a file from the FTP server\n// callback is passed (error, stream)\n//  -\u003e `stream` which you can pipe to a file on disk or S3, etc\n\nconnection.listFiles = function(dir, callback)\n// list files from a remote directory\n// callback is passed (error, files)\n//  -\u003e `files` is an array of remote file names\n\n```\n\n### Amazon Redshift\n\n``` javascript\nvar connection = book.connections.redshift.connection;\n\nconnection.connect = function(callback)\n// Connection method; handled by book.connect();\n// callback is passed (error)\n\nconnection.showTables = function(callback)\n// list tables\n// callback is returned error, array of table names\n\nconnection.showColumns = function(table, callback)\n// list the columns + metadata for each column\n// callback is returned error, hash of columns + metadata\n\nconnection.query = function(query, callback)\n// query the table\n// callback is returned error, rows (array of hashes col-value)\n\nconnection.getAll = function(queryBase, chunkSize, dataCallback, doneCallback)\n// fetch data from the cluster; normalized as an array of hashes.  Data is already typecast.\n// queryBase -\u003e the base mySQL query (Limit and offset will be appended automatically)\n// chunkSize -\u003e number of results to return (IE: limit)\n// dataCallback -\u003e callback called with each collection of data\n//   -\u003e (error, data, next)\n//   -\u003e data is normalized\n//   -\u003e next() must be called to continue\n// doneCallback is passed (error, rowsFound)\n\nconnection.insertData = function(table, data, callback)\n// add data to an table; create the index if needed.  Data should be normalized (IE results from #getAll)\n// callback is passed (error)\n\nconnection.mergeTables = function(sourceTable, destinationTable, callback)\n// merge the data from sourceTable into destinationTable\n// destinationTable will be created if if doesn't exist\n// destinationTable will be erased and recreated from sourceTable if there is no primary key present\n// callback is returned error\n\nconnection.addColumn = function(table, column, rowData, callback)\n// add a column to a table.\n// RowData is an array of data to insert into the column which can be used to determine the column data type\n// callback is returned error\n\nconnection.alterColumn = function(table, column, definition, callback)\n// change the datatype of a column\n// definition is a mySQL statment\n// callback is returned error\n\nconnection.copyTableSchema = function(sourceTable, destinationTable, callback)\n// create a new table (destinationTable) with the same schema as (sourceTable)\n// callback is returned error\n\nconnection.getMax = function(table, column, callback)\n// list the maximum value for a column in a table\n// callback is returned error, maximum value from the table or null\n\n```\n\n## Creating your own connections.\n\nIt's easy to add your own connections to empujar.  All you need is a `/connections` folder in your project, and to follow some conventions.  The basic building block of a connection looks like this:\n\n```javascript\nvar connection = function(name, type, options, book){\n  this.name       = name;\n  this.type       = type;\n  this.options    = options;\n  this.book       = book;\n  this.connection = null;\n};\n\nconnection.prototype.connect = function(callback){\n  var self = this;\n  // connection logic\n  callback();\n};\n\n/// Your Methods...\n\nexports.connection = connection;\n```\n... and then extend your connection model with more prototypes.\n\nFor example, here'e a connection, `delighted.js` which TaskRabbit uses to import NPS survey data from our partner [Delighted](https://delighted.com/).  We extend their library to match the `getAll` method of the built-in connections above.\n\n```javascript\nvar dateformat = require('dateformat');\nvar Delighted  = require('delighted');\n\nvar connection = function(name, type, options, book){\n  this.name       = name;\n  this.type       = type;\n  this.options    = options;\n  this.book       = book;\n  this.connection = null;\n};\n\nconnection.prototype.connect = function(callback){\n  var self = this;\n  self.connection = Delighted(self.options.apiKey);\n  callback();\n};\n\nconnection.prototype.getAll = function(since, dataCallback, doneCallback, page, rowsFound){\n  var self = this;\n  var data = [];\n  if(page === undefined || page === null){ page = 1; }\n  if(!rowsFound){ rowsFound = 0; }\n\n  var options = {\n    per_page : 100,\n    since    : since, // in unix timestamps (not JS timestamps)\n    page     : page,\n    expand   : 'person',\n  };\n\n  self.connection.surveyResponse.all(options).then(function(responses) {\n\n    if(responses.length === 0){\n      doneCallback(null, rowsFound);\n    }else{\n      rowsFound = rowsFound + responses.length;\n\n      responses.forEach(function(resp){\n        data.push({\n          id:            parseInt(resp.id),\n          person:        parseInt(resp.person.id),\n          score:         parseInt(resp.score),\n          comment:       resp.comment,\n          permalink:     resp.permalink,\n          created_at:    dateformat(resp.created_at * 1000, 'yyyy-mm-dd HH:MM:ss'),\n          updated_at:    dateformat(resp.updated_at * 1000, 'yyyy-mm-dd HH:MM:ss'),\n          customer_type: resp.customer_type,\n          email:         resp.person.email,\n          name:          resp.person.name,\n        });\n      });\n\n      dataCallback(null, data, function(){\n        if(self.book.options.getAllLimit \u003e rowsFound){\n          self.getAll(since, dataCallback, doneCallback, (page + 1), rowsFound);\n        }else{\n          doneCallback(null, rowsFound);\n        }\n      });\n\n    }\n  });\n};\n\nexports.connection = connection;\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaskrabbit%2Fempujar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaskrabbit%2Fempujar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaskrabbit%2Fempujar/lists"}