Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sckott/elastic_data
Elasticsearch datasets ready for bulk loading
https://github.com/sckott/elastic_data
bulk-loading dataset elastic elasticsearch
Last synced: 18 days ago
JSON representation
Elasticsearch datasets ready for bulk loading
- Host: GitHub
- URL: https://github.com/sckott/elastic_data
- Owner: sckott
- Created: 2014-11-11T02:35:28.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2018-05-15T19:00:01.000Z (over 6 years ago)
- Last Synced: 2024-10-09T15:49:16.951Z (29 days ago)
- Topics: bulk-loading, dataset, elastic, elasticsearch
- Homepage:
- Size: 5.52 MB
- Stars: 43
- Watchers: 7
- Forks: 21
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
elastic datasets
================This is a collection of smallish datasets to use for playing with Elasticsearch.
You can only fit so much data in an R package. The R client for Elasticsearch we maintain
[elastic](https://github.com/ropensci/elastic) comes with some data, but of course
it's nice to have more, so here it is.See also [nodbi](https://github.com/ropensci/nodbi) for working with Elasticsearch from R.
## Datasets
* `plos_everything.json`
* `plos_introductions.json`
* `plos_data.json`
* `geonames_elastic_bulk.zip` - too big for gitub, [at dropbox](https://www.dropbox.com/s/8vcrt3g2d0pfw8l/geonames_elastic_bulk.zip?dl=0)
* `gbif_data.json`
* `gbif_geo.json`
* `gbif_geopoint.json`
* `gbif_geoshape.json`
* `gbif_geosmall.json`
* `shakespeare_data.json`
* `omdb.json`## Loading into ES
These datasets are formatted to be ready for bulk loading into Elasticsearch
via the [bulk API](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html)### geonames
`geonames_elastic_bulk.zip` is about 70 `.json` files in Elasticsearch bulk format. It was prepared from the [Geonames](http://www.geonames.org/) database at [http://download.geonames.org/export/dump/](http://download.geonames.org/export/dump/). The original data from Geonames was licensed under a Creative Commons Attribution 3.0 License, see [http://creativecommons.org/licenses/by/3.0/](http://creativecommons.org/licenses/by/3.0/).
To load the geonames data into Elasticsearch, do as you wish, but e.g., in R you could do:
First, create the index and set the `geo_shape` mapping
```r
body <- '{
"mappings": {
"record": {
"properties": {
"location" : {"type" : "geo_shape"}
}
}
}
}'
index_create(index='geonames', body=body)
```should return
```r
#> $acknowledged
#> [1] TRUE
```Note: the index type is `record`, and the index name is `geonames`. The index and index type were set in the json files.
Then use a for loop to load in each file. AKAIK there is a limit on the file size you can load in (let me know if there's a way to get around it), so that's why theres a bunch of json files instead of one big file.
```r
devtools::install_github("ropensci/elastic")
library("elastic")
files <- list.files("path/to/unzipped/files")
for(i in seq_along(files)){
invisible(
docs_bulk(
sprintf("path/geonames%s.json", files[i])
)
)
}
```The `docs_bulk()` function uses the `/_bulk` endpoint to `POST` data to an index called `geonames` in your ES server. The output of the bulk load call prints info, that's why we use `invisible()` so you don't get thousands of lines printed.
Check that it worked:
```r
Search("geonames")$hits$total
#> [1] 6646030
```You should have ~ 6.6 million records