{"id":13665711,"url":"https://github.com/taskrabbit/forklift","last_synced_at":"2025-04-05T23:10:40.517Z","repository":{"id":7923810,"uuid":"9310726","full_name":"taskrabbit/forklift","owner":"taskrabbit","description":"Forklift: Moving big databases around. A ruby ETL tool.","archived":false,"fork":false,"pushed_at":"2022-07-21T22:33:17.000Z","size":1474,"stargazers_count":137,"open_issues_count":8,"forks_count":12,"subscribers_count":44,"default_branch":"master","last_synced_at":"2025-03-29T22:08:15.373Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/taskrabbit.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-04-09T01:34:49.000Z","updated_at":"2024-10-12T06:14:54.000Z","dependencies_parsed_at":"2022-08-08T08:15:28.534Z","dependency_job_id":null,"html_url":"https://github.com/taskrabbit/forklift","commit_stats":null,"previous_names":[],"tags_count":55,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fforklift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fforklift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fforklift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/taskrabbit%2Fforklift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/taskrabbit","download_url":"https://codeload.github.com/taskrabbit/forklift/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247411235,"owners_count":20934653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T06:00:48.176Z","updated_at":"2025-04-05T23:10:40.475Z","avatar_url":"https://github.com/taskrabbit.png","language":"Ruby","readme":"# Forklift ETL\n\nMoving heavy databases around. [![Gem Version](https://badge.fury.io/rb/forklift_etl.svg)](http://badge.fury.io/rb/forklift_etl)\n[![Build Status](https://secure.travis-ci.org/taskrabbit/forklift.png?branch=master)](http://travis-ci.org/taskrabbit/forklift)\n\n![picture](forklift_small.jpg)\n\n## What?\n\n[Forklift](https://github.com/taskrabbit/forklift) is a ruby gem that makes it easy for you to move your data around.  Forklift can be an integral part of your datawarehouse pipeline or a backup tool.  Forklift can collect and collapse data from multiple sources or across a single source.  In forklift's first version, it was only a MySQL tool but now, you can create transports to deal with the data of your choice.\n\n## Set up\n\nMake a new directory with a `Gemfile` like this:\n```ruby\nsource 'http://rubygems.org'\ngem 'forklift_etl'\n```\n\nThen `bundle`\n\nUse the generator by doing `(bundle exec) forklift --generate`\n\nMake your `plan.rb` using the examples below.\n\nRun your plan `forklift plan.rb`\nYou can run specific parts of your plan like `forklift plan.rb step1 step5`\n\n### Directory structure\nForklift expects your project to be arranged like:\n\n```bash\n├── config/\n|   ├── email.yml\n├── connections/\n|   ├── mysql/\n|       ├── (DB).yml\n|   ├── elasticsearch/\n|       ├── (DB).yml\n|   ├── csv/\n|       ├── (file).yml\n├── log/\n├── pid/\n├── template/\n├── patterns/\n├── transformations/\n├── Gemfile\n├── Gemfile.lock\n├── plan.rb\n```\n\nTo enable a foklift connection, all you need to do is place the yml config file for it within `/config/connections/(type)/(name).yml`\nFiles you place within `/patterns/` or `connections/(type)/` will be loaded automatically.\n\n## Examples \n\n### Example Project\n\nVisit the [`/example`](https://github.com/taskrabbit/forklift/tree/master/example) directory to see a whole forklift project.\n\n### Simple extract and load (no transformations)\n\nIf you have multiple databases and want to consolidate into one, this plan\nshould suffice.\n\n```ruby\nplan = Forklift::Plan.new\n\nplan.do! do\n  # ==\u003e Connections\n  service1 = plan.connections[:mysql][:service1]\n  service2 = plan.connections[:mysql][:service2]\n  analytics_working = plan.connections[:mysql][:analytics_working]\n  analytics = plan.connections[:mysql][:analytics]\n\n  # ==\u003e Extract\n  # Load data from your services into your working database\n  # If you want every table: service1.tables.each do |table|\n  # Data will be extracted in 1000 row collections\n  %w(users organizations).each do |table|\n    service1.read(\"select * from `#{table}`\") { |data| analytics_working.write(data, table) }\n  end\n\n  %w(orders line_items).each do |table|\n    service2.read(\"select * from `#{table}`\") { |data| analytics_working.write(data, table) }\n  end\n\n  # ==\u003e Load\n  # Load data from the working database to the final database\n  analytics_working.tables.each do |table|\n    # will attempt to do an incremental pipe, will fall back to a full table copy\n    # by default, incremental updates happen off of the `updated_at` column, but you can modify this by setting the `matcher` in the options\n    # If you want a full pipe instead of incremental, then just use `pipe` instead of `optimistic_pipe`\n    # The `pipe pattern` works within the same database.  To copy across databases, try the `mysql_optimistic_import` method\n    # This example show the options with their default values.\n    Forklift::Patterns::Mysql.optimistic_pipe(analytics_working.current_database, table, analytics.current_database, table, matcher: 'updated_at', primary_key: 'id')\n  end\nend\n```\n\n### Simple MySQL ETL\n```ruby\nplan = Forklift::Plan.new\nplan.do! do\n  # Do some SQL transformations\n  # SQL transformations are done exactly as they are written\n  destination = plan.connections[:mysql][:destination]\n  destination.exec!(\"./transformations/combined_name.sql\")\n\n  # Do some Ruby transformations\n  # Ruby transformations expect `do!(connection, forklift)` to be defined\n  destination = plan.connections[:mysql][:destination]\n  destination.exec!(\"./transformations/email_suffix.rb\")\n\n  # mySQL Dump the destination\n  destination = plan.connections[:mysql][:destination]\n  destination.dump('/tmp/destination.sql.gz')\nend\n```\n\n### Elasticsearch to MySQL\n```ruby\nplan = Forklift::Plan.new\nplan.do! do\n  source = plan.connections[:elasticsearch][:source]\n  destination = plan.connections[:mysql][:destination]\n  table = 'es_import'\n  index = 'aaa'\n  query = { query: { match_all: {} } } # pagination will happen automatically\n  destination.truncate!(table) if destination.tables.include? table\n  source.read(index, query) {|data| destination.write(data, table) }\nend\n```\n\n### MySQL to Elasticsearch\n```ruby\nplan = Forklift::Plan.new\nplan.do! do\n  source = plan.connections[:mysql][:source]\n  destination = plan.connections[:elasticsearch][:source]\n  table = 'users'\n  index = 'users'\n  query = \"select * from users\" # pagination will happen automatically\n  source.read(query) {|data| destination.write(data, table, true, 'user') }\nend\n```\n\n## Forklift Emails\n\n#### Setup\nPut this at the end of your plan inside the `do!` block.\n\n```ruby\n# ==\u003e Email\n# Let your team know the outcome. Attaches the log.\nemail_args = {\n  to: \"team@yourcompany.com\",\n  from: \"Forklift\",\n  subject: \"Forklift has moved your database @ #{Time.new}\",\n  body: \"So much data!\"\n}\nplan.mailer.send(email_args, plan.logger.messages)\n```\n\n#### ERB templates\nYou can get fancy by using an ERB template for your email and SQL variables:\n\n```ruby\n# ==\u003e Email\n# Let your team know the outcome. Attaches the log.\nemail_args = {\n  to: \"team@yourcompany.com\",\n  from: \"Forklift\",\n  subject: \"Forklift has moved your database @ #{Time.new}\"\n}\nemail_variables = {\n  total_users_count: service1.read('select count(1) as \"count\" from users')[0][:count]\n}\nemail_template = \"./template/email.erb\"\nplan.mailer.send_template(email_args, email_template, email_variables, plan.logger.messages)\n```\n\nThen in `template/email.erb`:\n\n```erb\n\u003ch1\u003eYour forklift email\u003c/h1\u003e\n\n\u003cul\u003e\n  \u003cli\u003e\u003cstrong\u003eTotal Users\u003c/strong\u003e: \u003c%= @total_users_count %\u003e\u003c/li\u003e\n\u003c/ul\u003e\n```\n\n#### Config\nWhen you run `forklift --generate`, we create `config/email.yml` for you:\n\n```yml\n# Configuration is passed to Pony (https://github.com/benprew/pony)\n\n# ==\u003e SMTP\n# If testing locally, mailcatcher (https://github.com/sj26/mailcatcher) is a helpful gem\nvia: smtp\nvia_options:\n  address: localhost\n  port: 1025\n  # user_name: user\n  # password: password\n  # authentication: :plain # :plain, :login, :cram_md5, no auth by default\n  # domain: \"localhost.localdomain\" # the HELO domain provided by the client to the server\n\n# ==\u003e Sendmail\n# via: sendmail\n# via_options:\n#   location: /usr/sbin/sendmail\n#   arguments: '-t -i'\n```\n\n## Workflow\n\n```ruby\n# do! is a wrapper around common setup methods (pidfile locking, setting up the logger, etc)\n# you don't need to use do! if you want finer control\ndef do!\n  # you can use `plan.logger.log` in your plan for logging\n  self.logger.log \"Starting forklift\"\n\n  # use a pidfile to ensure that only one instance of forklift is running at a time; store the file if OK\n  self.pid.safe_to_run?\n  self.pid.store!\n\n  # this will load all connections in /config/connections/#{type}/#{name}.yml into the plan.connections hash\n  # and build all the connection objects (and try to connect in some cases)\n  self.connect!\n\n  yield # your stuff here!\n\n  # remove the pidfile\n  self.logger.log \"Completed forklift\"\n  self.pid.delete!\nend\n\n```\n\n### Steps\n\nYou can optionally divide up your forklift plan into steps:\n\n```ruby\nplan = Forklift::Plan.new\nplan.do! do\n\n  plan.step('Mysql Import'){\n    source = plan.connections[:mysql][:source]\n    destination = plan.connections[:mysql][:destination]\n    source.tables.each do |table|\n      Forklift::Patterns::Mysql.optimistic_pipe(source, table, destination, table)\n    end\n  }\n\n  plan.step('Elasticsearch Import'){\n    source = plan.connections[:elasticsearch][:source]\n    destination = plan.connections[:mysql][:destination]\n    table = 'es_import'\n    index = 'aaa'\n    query = { query: { match_all: {} } } # pagination will happen automatically\n    destination.truncate!(table) if destination.tables.include? table\n    source.read(index, query) {|data| destination.write(data, table) }\n  }\n\nend\n```\n\nWhen you use steps, you can run your whole plan, or just part if it with command line arguments.  For example, `forklift plan.rb \"Elasticsearch Import\"` would just run that single portion of the plan.  Note that any parts of your plan not within a step will be run each time. \n\n### Error Handling\n\nBy default, exceptions within your plan will raise and crash your application.  However, you can pass an optional `error_handler` lambda to your step about how to handle the error.  the `error_handler` will be passed (`step_name`,`exception`).  If you don't re-raise within your error handler, your plan will continue to excecute.  For example:\n\n```ruby\n\nerror_handler = lambda { |name, exception|\n  if exception.class =~ /connection/\n    # I can't connect, I should halt\n    raise e\n  elsif exception.class =~ /SoftError/\n    # this type of error is OK\n  else\n    raise e\n  end\n}\n\nplan.step('a_complex_step', error_handler){\n  # ...\n}\n\n```\n\n## Transports\n\nTransports are how you interact with your data.  Every transport defines `read` and `write` methods which handle arrays of data objects (and the helper methods required).  \n\nEach transport should have a config file in `./config/connections/#{transport}/`. It will be loaded at boot.\n\nTransports optionally define helper methods which are a shortcut to copy data *within* a transport, like the mysql `pipe` methods (i.e.: `insert into #{to_db}.#{to_table}; select * from #{from_db}.#{from_table})`. A transport may also define other helpers (like how to create a MySQL dump).  These should be defined in `/patterns/#{type}.rb` within the `Forklift::Patterns::#{type}` namespace.\n\n### Creating your own transport\n\nIn the `/connections` directory in your project, create a file that defines at least the following:\n\n```ruby\nmodule Forklift\n  module Connection\n    class Mixpanel \u003c Forklift::Base::Connection\n\n      def initialize(config, forklift)\n        @config = config\n        @forklift = forklift\n      end\n\n      def config\n        @config\n      end\n\n      def forklift\n        @forklift\n      end\n\n      def read(index, query, args)\n        # ...\n        data = [] # data is an array of hashes\n        # ...\n        if block_given?\n          yield data\n        else\n          return data\n        end\n      end\n\n      def write(data, table)\n        # data is an array of hashes\n        # \"table\" can be any argument(s) you need to know where/how to write\n        # ...\n      end\n\n      def pipe(from_table, from_db, to_table, to_db)\n        # ...\n      end\n\n      private\n\n      #/private\n\n    end\n  end\nend\n```\n\nExisting transports and patterns for them are documented [here](http://www.rubydoc.info/gems/forklift_etl)\n### MySQL\n\n- [Transport](http://www.rubydoc.info/gems/forklift_etl/Forklift/Connection/Mysql)\n- [Patterns](http://www.rubydoc.info/gems/forklift_etl/Forklift/Patterns/Mysql)\n\n### Elasticsearch\n\n- [Transport](http://www.rubydoc.info/gems/forklift_etl/Forklift/Connection/Elasticsearch)\n- [Patterns](http://www.rubydoc.info/gems/forklift_etl/Forklift/Patterns/Elasticsearch)\n\n### Csv\n\n- [Transport](http://www.rubydoc.info/gems/forklift_etl/Forklift/Connection/Csv)\n\n## Transformations\n\nForklift allows you to create both Ruby transformations and script transformations.\n\n- It is up to the transport to define `exec_script`, and not all transports will support it.  Mysql can run `.sql` files, but there is not an equivalent for elasticsearch. Mysql scripts evaluate statement by statement. The delimeter (by default `;`) can be redefined using the `delimeter` command as described [here](http://dev.mysql.com/doc/refman/5.7/en/stored-programs-defining.html)\n- `.exec` runs and logs exceptions, while `.exec!` will raise on an error.  For example, `destination.exec(\"./transformations/cleanup.rb\")` will run cleanup.rb on the destination database.\n- Script files are run as-is, but ruby transformations must define a `do!` method in their class and are passed `def do!(connection, forklift)`\n- args is optional, and can be passed in from your plan\n\n```ruby\n# Example transformation to count users\n# count_users.rb\n\nclass CountUsers\n  def do!(connection, forklift, args)\n    forklift.logger.log \"counting users\"\n    count = connection.count('users')\n    forklift.logger.log \"[#{args.name}] found #{count} users\"\n  end\nend\n```\n\n```ruby\n# in your plan.rb\nplan = Forklift::Plan.new\nplan.do! do\n  destination = plan.connections[:mysql][:destination]\n  destination.exec!(\"./transformations/combined_name.sql\", {name: 'user counter'})\n\n  end\n```\n\n## Options \u0026 Notes\n- Thanks to [@rahilsondhi](https://github.com/rahilsondhi), [@rgarver](https://github.com/rgarver) and [Looksharp](https://www.looksharp.com/) for all their help\n- email_options is a hash consumed by the [Pony mail gem](https://github.com/benprew/pony)\n- Forklift's logger is [Lumberjack](https://github.com/bdurand/lumberjack) with a wrapper to also echo the log lines to stdout and save them to an array to be accessed later by the email system.\n- The mysql connections hash will be passed directly to a [mysql2](https://github.com/brianmario/mysql2) connection.\n- The elasticsearch connections hash will be passed directly to a [elasticsearch](https://github.com/elasticsearch/elasticsearch-ruby) connection.\n- Your databases must exist. Forklift will not create them for you.\n- Ensure your databases have the right encoding (eg utf8) or you will get errors like `#\u003cMysql2::Error: Incorrect string value: '\\xEF\\xBF\\xBDFal...' for column 'YOURCOLUMN’ at row 1\u003e`\n- If testing locally, mailcatcher (https://github.com/sj26/mailcatcher) is a helpful gem to test your email sending\n\n## Contributing and Testing\nSee: [CONTRIBUTING](CONTRIBUTING.md)\n\n## Alternatives\nIf you want something similar for Node.js try [Empujar](https://github.com/taskrabbit/empujar)\n","funding_links":[],"categories":["1. language"],"sub_categories":["1.1 ruby"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaskrabbit%2Fforklift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftaskrabbit%2Fforklift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftaskrabbit%2Fforklift/lists"}