Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/seamusabshere/data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.
https://github.com/seamusabshere/data_miner

Last synced: about 23 hours ago
JSON representation

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.

Awesome Lists containing this project

README

        

# data_miner

Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

## Real-world usage

Brighter Planet logo

We use `data_miner` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at

* [Brighter Planet's reference data web service](http://data.brighterplanet.com)
* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)

The killer combination for us is:

1. [`active_record_inline_schema`](https://github.com/seamusabshere/active_record_inline_schema) - define table structure
2. [`remote_table`](https://github.com/seamusabshere/remote_table) - download data and parse it
3. [`errata`](https://github.com/seamusabshere/errata) - apply corrections in a transparent way
4. [`data_miner`](https://github.com/seamusabshere/data_miner) (this library!) - import data idempotently

## Documentation

Check out the [extensive documentation](http://rdoc.info/github/seamusabshere/data_miner).

## Quick start

You define data_miner blocks in your ActiveRecord models. For example, in app/models/country.rb:

class Country < ActiveRecord::Base
self.primary_key = 'iso_3166_code'

# the "col" class method is provided by a different library - active_record_inline_schema
col :iso_3166_code # alpha-2 2-letter like GB
col :iso_3166_numeric_code, :type => :integer # numeric like 826; aka UN M49 code
col :iso_3166_alpha_3_code # 3-letter like GBR
col :name

data_miner do
# auto_upgrade! is provided by active_record_inline_schema
process :auto_upgrade!

import("OpenGeoCode.org's Country Codes to Country Names list",
:url => 'http://opengeocode.org/download/countrynames.txt',
:format => :delimited,
:delimiter => '; ',
:headers => false,
:skip => 22) do
key :iso_3166_code, :field_number => 0
store :iso_3166_alpha_3_code, :field_number => 1
store :iso_3166_numeric_code, :field_number => 2
store :name, :field_number => 5
end
end
end

Now you can run:

>> Country.run_data_miner!
=> nil

## More advanced usage

The [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:


Model
Highlights
Reference


Aircraft
parsing Microsoft Frontpage HTML (!)
data_miner.rb


Airports
forcing column names and use of :select block (Proc)
data_miner.rb


Automobile model variants
super advanced usage of "custom parser" and errata
data_miner.rb


Country
parsing CSV and a few other tricks
data_miner.rb


EGRID regions
parsing XLS
data_miner.rb


Flight segment (stage)
super advanced usage of POSTing form data
data_miner.rb


Zip codes
downloading a ZIP file and pulling an XLSX out of it
data_miner.rb

And many more - look for the `data_miner.rb` file that corresponds to each model. Note that you would normally put the `data_miner` declaration right inside the ActiveRecord model file... it's kept separate in `earth` so that loading it is optional.

## Authors

* Seamus Abshere
* Andy Rossmeissl
* Derek Kastner
* Ian Hough
* Tower He

## Wishlist

* Make the tests real unit tests
* sql steps shouldn't shell out if binaries are missing

## Copyright

Copyright (c) 2013 Seamus Abshere