https://github.com/taskrabbit/forklift

Forklift: Moving big databases around. A ruby ETL tool.
https://github.com/taskrabbit/forklift
Last synced: 2 months ago
JSON representation
Forklift: Moving big databases around. A ruby ETL tool.
Host: GitHub
URL: https://github.com/taskrabbit/forklift
Owner: taskrabbit
License: apache-2.0
Created: 2013-04-09T01:34:49.000Z (about 12 years ago)
Default Branch: master
Last Pushed: 2022-07-21T22:33:17.000Z (almost 3 years ago)
Last Synced: 2025-03-29T22:08:15.373Z (3 months ago)
Language: Ruby
Homepage:
Size: 1.41 MB
Stars: 137
Watchers: 44
Forks: 12
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project

awesome - forkift
README

        # Forklift ETL

Moving heavy databases around. [![Gem Version](https://badge.fury.io/rb/forklift_etl.svg)](http://badge.fury.io/rb/forklift_etl)

[![Build Status](https://secure.travis-ci.org/taskrabbit/forklift.png?branch=master)](http://travis-ci.org/taskrabbit/forklift)

![picture](forklift_small.jpg)

## What?

[Forklift](https://github.com/taskrabbit/forklift) is a ruby gem that makes it easy for you to move your data around.  Forklift can be an integral part of your datawarehouse pipeline or a backup tool.  Forklift can collect and collapse data from multiple sources or across a single source.  In forklift's first version, it was only a MySQL tool but now, you can create transports to deal with the data of your choice.

## Set up

Make a new directory with a `Gemfile` like this:

```ruby

source 'http://rubygems.org'

gem 'forklift_etl'

```

Then `bundle`

Use the generator by doing `(bundle exec) forklift --generate`

Make your `plan.rb` using the examples below.

Run your plan `forklift plan.rb`

You can run specific parts of your plan like `forklift plan.rb step1 step5`

### Directory structure

Forklift expects your project to be arranged like:

```bash

├── config/

|   ├── email.yml

├── connections/

|   ├── mysql/

|       ├── (DB).yml

|   ├── elasticsearch/

|       ├── (DB).yml

|   ├── csv/

|       ├── (file).yml

├── log/

├── pid/

├── template/

├── patterns/

├── transformations/

├── Gemfile

├── Gemfile.lock

├── plan.rb

```

To enable a foklift connection, all you need to do is place the yml config file for it within `/config/connections/(type)/(name).yml`

Files you place within `/patterns/` or `connections/(type)/` will be loaded automatically.

## Examples 

### Example Project

Visit the [`/example`](https://github.com/taskrabbit/forklift/tree/master/example) directory to see a whole forklift project.

### Simple extract and load (no transformations)

If you have multiple databases and want to consolidate into one, this plan

should suffice.

```ruby

plan = Forklift::Plan.new

plan.do! do

  # ==> Connections

  service1 = plan.connections[:mysql][:service1]

  service2 = plan.connections[:mysql][:service2]

  analytics_working = plan.connections[:mysql][:analytics_working]

  analytics = plan.connections[:mysql][:analytics]

  # ==> Extract

  # Load data from your services into your working database

  # If you want every table: service1.tables.each do |table|

  # Data will be extracted in 1000 row collections

  %w(users organizations).each do |table|

    service1.read("select * from `#{table}`") { |data| analytics_working.write(data, table) }

  end

  %w(orders line_items).each do |table|

    service2.read("select * from `#{table}`") { |data| analytics_working.write(data, table) }

  end

  # ==> Load

  # Load data from the working database to the final database

  analytics_working.tables.each do |table|

    # will attempt to do an incremental pipe, will fall back to a full table copy

    # by default, incremental updates happen off of the `updated_at` column, but you can modify this by setting the `matcher` in the options

    # If you want a full pipe instead of incremental, then just use `pipe` instead of `optimistic_pipe`

    # The `pipe pattern` works within the same database.  To copy across databases, try the `mysql_optimistic_import` method

    # This example show the options with their default values.

    Forklift::Patterns::Mysql.optimistic_pipe(analytics_working.current_database, table, analytics.current_database, table, matcher: 'updated_at', primary_key: 'id')

  end

end

```

### Simple MySQL ETL

```ruby

plan = Forklift::Plan.new

plan.do! do

  # Do some SQL transformations

  # SQL transformations are done exactly as they are written

  destination = plan.connections[:mysql][:destination]

  destination.exec!("./transformations/combined_name.sql")

  # Do some Ruby transformations

  # Ruby transformations expect `do!(connection, forklift)` to be defined

  destination = plan.connections[:mysql][:destination]

  destination.exec!("./transformations/email_suffix.rb")

  # mySQL Dump the destination

  destination = plan.connections[:mysql][:destination]

  destination.dump('/tmp/destination.sql.gz')

end

```

### Elasticsearch to MySQL

```ruby

plan = Forklift::Plan.new

plan.do! do

  source = plan.connections[:elasticsearch][:source]

  destination = plan.connections[:mysql][:destination]

  table = 'es_import'

  index = 'aaa'

  query = { query: { match_all: {} } } # pagination will happen automatically

  destination.truncate!(table) if destination.tables.include? table

  source.read(index, query) {|data| destination.write(data, table) }

end

```

### MySQL to Elasticsearch

```ruby

plan = Forklift::Plan.new

plan.do! do

  source = plan.connections[:mysql][:source]

  destination = plan.connections[:elasticsearch][:source]

  table = 'users'

  index = 'users'

  query = "select * from users" # pagination will happen automatically

  source.read(query) {|data| destination.write(data, table, true, 'user') }

end

```

## Forklift Emails

#### Setup

Put this at the end of your plan inside the `do!` block.

```ruby

# ==> Email

# Let your team know the outcome. Attaches the log.

email_args = {

  to: "[email protected]",

  from: "Forklift",

  subject: "Forklift has moved your database @ #{Time.new}",

  body: "So much data!"

}

plan.mailer.send(email_args, plan.logger.messages)

```

#### ERB templates

You can get fancy by using an ERB template for your email and SQL variables:

```ruby

# ==> Email

# Let your team know the outcome. Attaches the log.

email_args = {

  to: "[email protected]",

  from: "Forklift",

  subject: "Forklift has moved your database @ #{Time.new}"

}

email_variables = {

  total_users_count: service1.read('select count(1) as "count" from users')[0][:count]

}

email_template = "./template/email.erb"

plan.mailer.send_template(email_args, email_template, email_variables, plan.logger.messages)

```

Then in `template/email.erb`:

```erb

Your forklift email




  

Total Users: <%= @total_users_count %>



```

#### Config

When you run `forklift --generate`, we create `config/email.yml` for you:

```yml

# Configuration is passed to Pony (https://github.com/benprew/pony)

# ==> SMTP

# If testing locally, mailcatcher (https://github.com/sj26/mailcatcher) is a helpful gem

via: smtp

via_options:

  address: localhost

  port: 1025

  # user_name: user

  # password: password

  # authentication: :plain # :plain, :login, :cram_md5, no auth by default

  # domain: "localhost.localdomain" # the HELO domain provided by the client to the server

# ==> Sendmail

# via: sendmail

# via_options:

#   location: /usr/sbin/sendmail

#   arguments: '-t -i'

```

## Workflow

```ruby

# do! is a wrapper around common setup methods (pidfile locking, setting up the logger, etc)

# you don't need to use do! if you want finer control

def do!

  # you can use `plan.logger.log` in your plan for logging

  self.logger.log "Starting forklift"

  # use a pidfile to ensure that only one instance of forklift is running at a time; store the file if OK

  self.pid.safe_to_run?

  self.pid.store!

  # this will load all connections in /config/connections/#{type}/#{name}.yml into the plan.connections hash

  # and build all the connection objects (and try to connect in some cases)

  self.connect!

  yield # your stuff here!

  # remove the pidfile

  self.logger.log "Completed forklift"

  self.pid.delete!

end

```

### Steps

You can optionally divide up your forklift plan into steps:

```ruby

plan = Forklift::Plan.new

plan.do! do

  plan.step('Mysql Import'){

    source = plan.connections[:mysql][:source]

    destination = plan.connections[:mysql][:destination]

    source.tables.each do |table|

      Forklift::Patterns::Mysql.optimistic_pipe(source, table, destination, table)

    end

  }

  plan.step('Elasticsearch Import'){

    source = plan.connections[:elasticsearch][:source]

    destination = plan.connections[:mysql][:destination]

    table = 'es_import'

    index = 'aaa'

    query = { query: { match_all: {} } } # pagination will happen automatically

    destination.truncate!(table) if destination.tables.include? table

    source.read(index, query) {|data| destination.write(data, table) }

  }

end

```

When you use steps, you can run your whole plan, or just part if it with command line arguments.  For example, `forklift plan.rb "Elasticsearch Import"` would just run that single portion of the plan.  Note that any parts of your plan not within a step will be run each time. 

### Error Handling

By default, exceptions within your plan will raise and crash your application.  However, you can pass an optional `error_handler` lambda to your step about how to handle the error.  the `error_handler` will be passed (`step_name`,`exception`).  If you don't re-raise within your error handler, your plan will continue to excecute.  For example:

```ruby

error_handler = lambda { |name, exception|

  if exception.class =~ /connection/

    # I can't connect, I should halt

    raise e

  elsif exception.class =~ /SoftError/

    # this type of error is OK

  else

    raise e

  end

}

plan.step('a_complex_step', error_handler){

  # ...

}

```

## Transports

Transports are how you interact with your data.  Every transport defines `read` and `write` methods which handle arrays of data objects (and the helper methods required).  

Each transport should have a config file in `./config/connections/#{transport}/`. It will be loaded at boot.

Transports optionally define helper methods which are a shortcut to copy data *within* a transport, like the mysql `pipe` methods (i.e.: `insert into #{to_db}.#{to_table}; select * from #{from_db}.#{from_table})`. A transport may also define other helpers (like how to create a MySQL dump).  These should be defined in `/patterns/#{type}.rb` within the `Forklift::Patterns::#{type}` namespace.

### Creating your own transport

In the `/connections` directory in your project, create a file that defines at least the following:

```ruby

module Forklift

  module Connection

    class Mixpanel < Forklift::Base::Connection

      def initialize(config, forklift)

        @config = config

        @forklift = forklift

      end

      def config

        @config

      end

      def forklift

        @forklift

      end

      def read(index, query, args)

        # ...

        data = [] # data is an array of hashes

        # ...

        if block_given?

          yield data

        else

          return data

        end

      end

      def write(data, table)

        # data is an array of hashes

        # "table" can be any argument(s) you need to know where/how to write

        # ...

      end

      def pipe(from_table, from_db, to_table, to_db)

        # ...

      end

      private

      #/private

    end

  end

end

```

Existing transports and patterns for them are documented [here](http://www.rubydoc.info/gems/forklift_etl)

### MySQL

- [Transport](http://www.rubydoc.info/gems/forklift_etl/Forklift/Connection/Mysql)

- [Patterns](http://www.rubydoc.info/gems/forklift_etl/Forklift/Patterns/Mysql)

### Elasticsearch

- [Transport](http://www.rubydoc.info/gems/forklift_etl/Forklift/Connection/Elasticsearch)

- [Patterns](http://www.rubydoc.info/gems/forklift_etl/Forklift/Patterns/Elasticsearch)

### Csv

- [Transport](http://www.rubydoc.info/gems/forklift_etl/Forklift/Connection/Csv)

## Transformations

Forklift allows you to create both Ruby transformations and script transformations.

- It is up to the transport to define `exec_script`, and not all transports will support it.  Mysql can run `.sql` files, but there is not an equivalent for elasticsearch. Mysql scripts evaluate statement by statement. The delimeter (by default `;`) can be redefined using the `delimeter` command as described [here](http://dev.mysql.com/doc/refman/5.7/en/stored-programs-defining.html)

- `.exec` runs and logs exceptions, while `.exec!` will raise on an error.  For example, `destination.exec("./transformations/cleanup.rb")` will run cleanup.rb on the destination database.

- Script files are run as-is, but ruby transformations must define a `do!` method in their class and are passed `def do!(connection, forklift)`

- args is optional, and can be passed in from your plan

```ruby

# Example transformation to count users

# count_users.rb

class CountUsers

  def do!(connection, forklift, args)

    forklift.logger.log "counting users"

    count = connection.count('users')

    forklift.logger.log "[#{args.name}] found #{count} users"

  end

end

```

```ruby

# in your plan.rb

plan = Forklift::Plan.new

plan.do! do

  destination = plan.connections[:mysql][:destination]

  destination.exec!("./transformations/combined_name.sql", {name: 'user counter'})

  end

```

## Options & Notes

- Thanks to [@rahilsondhi](https://github.com/rahilsondhi), [@rgarver](https://github.com/rgarver) and [Looksharp](https://www.looksharp.com/) for all their help

- email_options is a hash consumed by the [Pony mail gem](https://github.com/benprew/pony)

- Forklift's logger is [Lumberjack](https://github.com/bdurand/lumberjack) with a wrapper to also echo the log lines to stdout and save them to an array to be accessed later by the email system.

- The mysql connections hash will be passed directly to a [mysql2](https://github.com/brianmario/mysql2) connection.

- The elasticsearch connections hash will be passed directly to a [elasticsearch](https://github.com/elasticsearch/elasticsearch-ruby) connection.

- Your databases must exist. Forklift will not create them for you.

- Ensure your databases have the right encoding (eg utf8) or you will get errors like `#`

- If testing locally, mailcatcher (https://github.com/sj26/mailcatcher) is a helpful gem to test your email sending

## Contributing and Testing

See: [CONTRIBUTING](CONTRIBUTING.md)

## Alternatives

If you want something similar for Node.js try [Empujar](https://github.com/taskrabbit/empujar)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/taskrabbit/forklift

Awesome Lists containing this project

README

Your forklift email