{"id":13484335,"url":"https://github.com/seamusabshere/data_miner","last_synced_at":"2025-09-07T15:40:23.344Z","repository":{"id":640129,"uuid":"281943","full_name":"seamusabshere/data_miner","owner":"seamusabshere","description":"Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.","archived":false,"fork":false,"pushed_at":"2014-02-27T14:23:08.000Z","size":1442,"stargazers_count":306,"open_issues_count":8,"forks_count":18,"subscribers_count":14,"default_branch":"master","last_synced_at":"2025-08-28T18:45:28.309Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seamusabshere.png","metadata":{"files":{"readme":"README.markdown","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2009-08-19T12:46:06.000Z","updated_at":"2025-03-05T08:16:02.000Z","dependencies_parsed_at":"2022-08-16T10:35:07.279Z","dependency_job_id":null,"html_url":"https://github.com/seamusabshere/data_miner","commit_stats":null,"previous_names":[],"tags_count":124,"template":false,"template_full_name":null,"purl":"pkg:github/seamusabshere/data_miner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seamusabshere%2Fdata_miner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seamusabshere%2Fdata_miner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seamusabshere%2Fdata_miner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seamusabshere%2Fdata_miner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seamusabshere","download_url":"https://codeload.github.com/seamusabshere/data_miner/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seamusabshere%2Fdata_miner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274058725,"owners_count":25215197,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T17:01:22.693Z","updated_at":"2025-09-07T15:40:23.260Z","avatar_url":"https://github.com/seamusabshere.png","language":"Ruby","funding_links":[],"categories":["ORM/ODM Extensions","Ruby","ActiveRecord"],"sub_categories":[],"readme":"# data_miner\n\nDownload, pull out of a ZIP/TAR/GZ/BZ2 archive, parse, correct, and import XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models.\n\nTested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.\n\n## Real-world usage\n\n\u003cp\u003e\u003ca href=\"http://brighterplanet.com\"\u003e\u003cimg src=\"https://s3.amazonaws.com/static.brighterplanet.com/assets/logos/flush-left/inline/green/rasterized/brighter_planet-160-transparent.png\" alt=\"Brighter Planet logo\"/\u003e\u003c/a\u003e\u003c/p\u003e\n\nWe use `data_miner` for [data science at Brighter Planet](http://brighterplanet.com/research) and in production at\n\n* [Brighter Planet's reference data web service](http://data.brighterplanet.com)\n* [Brighter Planet's impact estimate web service](http://impact.brighterplanet.com)\n\nThe killer combination for us is:\n\n1. [`active_record_inline_schema`](https://github.com/seamusabshere/active_record_inline_schema) - define table structure\n2. [`remote_table`](https://github.com/seamusabshere/remote_table) - download data and parse it\n3. [`errata`](https://github.com/seamusabshere/errata) - apply corrections in a transparent way\n4. [`data_miner`](https://github.com/seamusabshere/data_miner) (this library!) - import data idempotently\n\n## Documentation\n\nCheck out the [extensive documentation](http://rdoc.info/github/seamusabshere/data_miner).\n\n## Quick start\n\nYou define \u003ccode\u003edata_miner\u003c/code\u003e blocks in your ActiveRecord models. For example, in \u003ccode\u003eapp/models/country.rb\u003c/code\u003e:\n\n    class Country \u003c ActiveRecord::Base\n      self.primary_key = 'iso_3166_code'\n\n      # the \"col\" class method is provided by a different library - active_record_inline_schema\n      col :iso_3166_code                            # alpha-2 2-letter like GB\n      col :iso_3166_numeric_code, :type =\u003e :integer # numeric like 826; aka UN M49 code\n      col :iso_3166_alpha_3_code                    # 3-letter like GBR\n      col :name\n  \n      data_miner do\n        # auto_upgrade! is provided by active_record_inline_schema\n        process :auto_upgrade!\n\n        import(\"OpenGeoCode.org's Country Codes to Country Names list\",\n               :url =\u003e 'http://opengeocode.org/download/countrynames.txt',\n               :format =\u003e :delimited,\n               :delimiter =\u003e '; ',\n               :headers =\u003e false,\n               :skip =\u003e 22) do\n          key   :iso_3166_code, :field_number =\u003e 0\n          store :iso_3166_alpha_3_code, :field_number =\u003e 1\n          store :iso_3166_numeric_code, :field_number =\u003e 2\n          store :name, :field_number =\u003e 5\n        end\n      end\n    end\n\nNow you can run:\n\n    \u003e\u003e Country.run_data_miner!\n    =\u003e nil\n\n## More advanced usage\n\nThe [`earth` library](https://github.com/brighterplanet/earth) has dozens of real-life examples showing how to download, pull out of a ZIP/TAR/BZ2 archive, parse, correct, and import CSVs, fixed-width files, ODS, XLS, XLSX, even HTML and XML:\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eModel\u003c/th\u003e\n    \u003cth\u003eHighlights\u003c/th\u003e\n    \u003cth\u003eReference\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/aircraft\"\u003eAircraft\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003eparsing Microsoft Frontpage HTML (!)\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/air/aircraft/data_miner.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/airports\"\u003eAirports\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003eforcing column names and use of \u003ccode\u003e:select\u003c/code\u003e block (\u003ccode\u003eProc\u003c/code\u003e)\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/air/airport/data_miner.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/automobile_make_model_year_variants\"\u003eAutomobile model variants\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003esuper advanced usage of \"custom parser\" and errata\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/automobile/automobile_make_model_year_variant/data_miner.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/countries\"\u003eCountry\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003eparsing CSV and a few other tricks\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/country/data_miner.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/egrid_regions\"\u003eEGRID regions\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003eparsing XLS\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/egrid_region/data_miner.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/flight_segments\"\u003eFlight segment (stage)\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003esuper advanced usage of POSTing form data\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/air/flight_segment/data_miner.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003ca href=\"http://data.brighterplanet.com/zip_codes\"\u003eZip codes\u003c/a\u003e\u003c/td\u003e\n    \u003ctd\u003edownloading a ZIP file and pulling an XLSX out of it\u003c/td\u003e\n    \u003ctd\u003e\u003ca href=\"https://github.com/brighterplanet/earth/blob/master/lib/earth/locality/zip_code.rb\"\u003edata_miner.rb\u003c/a\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nAnd many more - look for the `data_miner.rb` file that corresponds to each model. Note that you would normally put the `data_miner` declaration right inside the ActiveRecord model file... it's kept separate in `earth` so that loading it is optional.\n\n## Authors\n\n* Seamus Abshere \u003cseamus@abshere.net\u003e\n* Andy Rossmeissl \u003candy@rossmeissl.net\u003e\n* Derek Kastner \u003cdkastner@gmail.com\u003e\n* Ian Hough \u003cijhough@gmail.com\u003e\n* Tower He \u003ctowerhe@gmail.com\u003e\n\n## Wishlist\n\n* Make the tests real unit tests\n* sql steps shouldn't shell out if binaries are missing\n\n## Copyright\n\nCopyright (c) 2013 Seamus Abshere\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseamusabshere%2Fdata_miner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseamusabshere%2Fdata_miner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseamusabshere%2Fdata_miner/lists"}