https://github.com/apanimesh061/yelpdatasetetl
A MongoDb to Elasticsearch ETL pipeline
https://github.com/apanimesh061/yelpdatasetetl
elasticsearch etl-pipeline ingestion mongodb python-2
Last synced: about 2 months ago
JSON representation
A MongoDb to Elasticsearch ETL pipeline
- Host: GitHub
- URL: https://github.com/apanimesh061/yelpdatasetetl
- Owner: apanimesh061
- Created: 2017-10-03T02:36:51.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-10-03T03:11:00.000Z (over 7 years ago)
- Last Synced: 2025-01-07T02:59:33.277Z (4 months ago)
- Topics: elasticsearch, etl-pipeline, ingestion, mongodb, python-2
- Language: Python
- Size: 15.6 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# YelpDatasetETL
A MongoDb to Elasticsearch ETL pipelineThis project is mini ETL pipeline that is able to stream documents from a Mongodb collection to an Elasticsearch index after applying two layers of transformations.
- First transformation is done to convert the Mongo Bson to a Json that conforms with the Elasticsearch schema defined [here](https://github.com/apanimesh061/YelpDatasetETL/blob/master/es_mappins/yelp_mapping.json).
- Second transformation is performed on selected collections that text fields on which a VADER sentiment analyzer is applied.#### VADER Sentiment Analyzer
In this project, this analyzer could be applied using two ways:
- using the NLTK package tool `nltk.sentiment.vader.SentimentIntensityAnalyzer`
- creating an ingestion plugin that does the analysis
The aim of the project was to create an ETL pipeline as well as learn about the [Ingestion pipeline](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/ingest-apis.html) introduced in Elasticsearch 5.x.----
Elasticsearch Version: 5.2.1
Python: 2.7.13
[VaderSentimentJava](https://github.com/apanimesh061/VaderSentimentJava): 1.0.1
[elasticsearch-sentiment-plugin](https://github.com/apanimesh061/elasticsearch-sentiment-plugin): 1.0.1
----
Dataset used:[https://www.yelp.com/dataset](https://www.yelp.com/dataset)
business - 77445 records
photo_business - 200000 records
checkin - 55569 records
review - 2225213 records
tip - 591864 records
users - 552339 records----
#### Usage$ python YelpEtlPipeline.py -c business,user,checkin,tip,photo,review -t -n 4
Connected to MongoDB Client
Connected to ElasticSearch ClientIndexing business...
Indexed 77445 / 77445 documents with 0 failures
Time taken for business ingestion : 65.8003674392 seconds.Indexing user...
Indexed 552339 / 552339 documents with 0 failures
Time taken for user ingestion : 433.517403755 seconds.Indexing checkin...
Indexed 55569 / 55569 documents with 0 failures
Time taken for checkin ingestion : 58.5147706969 seconds.Indexing tip...
Indexed 591864 / 591864 documents with 0 failures
Time taken for tip ingestion : 410.815298934 seconds.Indexing photo...
Indexed 200000 / 200000 documents with 0 failures
Time taken for photo ingestion : 102.609901614 seconds.Indexing review...
Indexed 2225213 / 2225213 documents with 0 failures
Time taken for review ingestion : 10633.6480628 seconds.