https://github.com/viveknaskar/cloud-dataflow-with-memorystore
Cloud Dataflow pipeline that reads the file from Cloud Storage and processes and outputs in the memory store.
https://github.com/viveknaskar/cloud-dataflow-with-memorystore
cloud-dataflow dataflow-pipeline google-cloud-platform memorystore redis
Last synced: about 1 month ago
JSON representation
Cloud Dataflow pipeline that reads the file from Cloud Storage and processes and outputs in the memory store.
- Host: GitHub
- URL: https://github.com/viveknaskar/cloud-dataflow-with-memorystore
- Owner: viveknaskar
- License: mit
- Created: 2020-08-10T11:49:00.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-08-29T12:30:02.000Z (over 1 year ago)
- Last Synced: 2025-04-14T01:51:56.741Z (about 1 month ago)
- Topics: cloud-dataflow, dataflow-pipeline, google-cloud-platform, memorystore, redis
- Language: Java
- Homepage: https://thedeveloperstory.com/2020/08/30/exporting-data-from-storage-to-memorystore-using-cloud-dataflow/
- Size: 26.4 KB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cloud Dataflow with Memorystore
Code to create dataflow pipeline that reads file data from a cloud storage, processes and transforms it and outputs the transformed data in Google's own in-memory datastore which is their Redis implemenation called memorystore. The pipeline code is written in Java and have been worked upon Apache Beam's SDK.
## About Cloud dataflow
Dataflow is a fully-managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service which is fully dedicated towards transforming and enriching data in stream (real time) and batch (historical) modes. It is a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Stackdriver, which lets you monitor and troubleshoot pipelines as they are running. It acts as a convenient integration point where Tensorflow machine learning models can be added to process data pipelines.## About Memorystore
Memorystore for Redis provides a fully-managed service that is powered by the Redis in-memory data store to build application caches that provide sub-millisecond data access.
With Memorystore for Redis, you can easily achieve your latency and throughput targets by scaling up your Redis instances with minimal impact to your application's availability.## Command to execute for creating the template:
```
mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=your-project-id \
--jobName=dataflow-memstore-job \
--inputFile=gs://cloud-dataflow-input-bucket/*.txt \
--redisHost=127.0.0.1 \
--stagingLocation=gs://dataflow-pipeline-batch-bucket/staging/ \
--dataflowJobFile=gs://dataflow-pipeline-batch-bucket/templates/dataflow-custom-redis-template \
--gcpTempLocation=gs://dataflow-pipeline-batch-bucket/tmp/ \
--runner=DataflowRunner"
```## Check the data inserted in Memorystore (Redis) datastore
For checking whether the processed data is stored in the Redis instance after the dataflow pipeline is executed successfully, you must first connect to the Redis instance from any Compute Engine VM instance located within the same project, region and network as the Redis instance.1) Create a VM instance and SSH to it
2) Install telnet from apt-get in the VM instance
```
sudo apt-get install telnet
```
3) From the VM instance, connect to the ip-address of the redis instance
```
telnet instance-ip-address 6379
```
4) Once you are in the redis, check the keys inserted
```
keys *
```
5) Check whether the data is inserted using the intersection command to get the guid
```
sinter firstname: lastname: dob: postalcode:
```
6) Check with individual entry using the below command to get the guid
```
smembers firstname:
```
7) Command to clear the redis data store
```
flushall
```### References
https://redis.io/topics/data-types-intro
https://beam.apache.org/documentation/programming-guide/
https://thedeveloperstory.com/2020/07/24/cloud-dataflow-a-unified-model-for-batch-and-streaming-data-processing/