Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/seamusabshere/data_miner
Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.
https://github.com/seamusabshere/data_miner
Last synced: about 23 hours ago
JSON representation
Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.
- Host: GitHub
- URL: https://github.com/seamusabshere/data_miner
- Owner: seamusabshere
- License: mit
- Created: 2009-08-19T12:46:06.000Z (over 15 years ago)
- Default Branch: master
- Last Pushed: 2014-02-27T14:23:08.000Z (almost 11 years ago)
- Last Synced: 2024-10-20T07:46:03.045Z (about 2 months ago)
- Language: Ruby
- Homepage:
- Size: 1.38 MB
- Stars: 302
- Watchers: 14
- Forks: 18
- Open Issues: 8
-
Metadata Files:
- Readme: README.markdown
- Changelog: CHANGELOG
- License: LICENSE
Awesome Lists containing this project
- awesome-ruby - data_miner - Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. (ORM/ODM Extensions)
README
# data_miner
Download, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.
Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.
## Real-world usage
We use `data_miner` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at
* [Brighter Planet's reference data web service](http://data.brighterplanet.com)
* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)The killer combination for us is:
1. [`active_record_inline_schema`](https://github.com/seamusabshere/active_record_inline_schema) - define table structure
2. [`remote_table`](https://github.com/seamusabshere/remote_table) - download data and parse it
3. [`errata`](https://github.com/seamusabshere/errata) - apply corrections in a transparent way
4. [`data_miner`](https://github.com/seamusabshere/data_miner) (this library!) - import data idempotently## Documentation
Check out the [extensive documentation](http://rdoc.info/github/seamusabshere/data_miner).
## Quick start
You define
data_miner
blocks in your ActiveRecord models. For example, inapp/models/country.rb
:class Country < ActiveRecord::Base
self.primary_key = 'iso_3166_code'# the "col" class method is provided by a different library - active_record_inline_schema
col :iso_3166_code # alpha-2 2-letter like GB
col :iso_3166_numeric_code, :type => :integer # numeric like 826; aka UN M49 code
col :iso_3166_alpha_3_code # 3-letter like GBR
col :name
data_miner do
# auto_upgrade! is provided by active_record_inline_schema
process :auto_upgrade!import("OpenGeoCode.org's Country Codes to Country Names list",
:url => 'http://opengeocode.org/download/countrynames.txt',
:format => :delimited,
:delimiter => '; ',
:headers => false,
:skip => 22) do
key :iso_3166_code, :field_number => 0
store :iso_3166_alpha_3_code, :field_number => 1
store :iso_3166_numeric_code, :field_number => 2
store :name, :field_number => 5
end
end
endNow you can run:
>> Country.run_data_miner!
=> nil## More advanced usage
The [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:
Model
Highlights
Reference
Aircraft
parsing Microsoft Frontpage HTML (!)
data_miner.rb
Airports
forcing column names and use of:select
block (Proc
)
data_miner.rb
Automobile model variants
super advanced usage of "custom parser" and errata
data_miner.rb
Country
parsing CSV and a few other tricks
data_miner.rb
EGRID regions
parsing XLS
data_miner.rb
Flight segment (stage)
super advanced usage of POSTing form data
data_miner.rb
Zip codes
downloading a ZIP file and pulling an XLSX out of it
data_miner.rb
And many more - look for the `data_miner.rb` file that corresponds to each model. Note that you would normally put the `data_miner` declaration right inside the ActiveRecord model file... it's kept separate in `earth` so that loading it is optional.
## Authors
* Seamus Abshere
* Andy Rossmeissl
* Derek Kastner
* Ian Hough
* Tower He## Wishlist
* Make the tests real unit tests
* sql steps shouldn't shell out if binaries are missing## Copyright
Copyright (c) 2013 Seamus Abshere