https://github.com/viveknaskar/cloud-dataflow-with-memorystore

Cloud Dataflow pipeline that reads the file from Cloud Storage and processes and outputs in the memory store.
https://github.com/viveknaskar/cloud-dataflow-with-memorystore

cloud-dataflow dataflow-pipeline google-cloud-platform memorystore redis

Last synced: 6 months ago
JSON representation

Cloud Dataflow pipeline that reads the file from Cloud Storage and processes and outputs in the memory store.

Host: GitHub
URL: https://github.com/viveknaskar/cloud-dataflow-with-memorystore
Owner: viveknaskar
License: mit
Created: 2020-08-10T11:49:00.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-08-29T12:30:02.000Z (about 2 years ago)
Last Synced: 2025-04-14T01:51:56.741Z (6 months ago)
Topics: cloud-dataflow, dataflow-pipeline, google-cloud-platform, memorystore, redis
Language: Java
Homepage: https://thedeveloperstory.com/2020/08/30/exporting-data-from-storage-to-memorystore-using-cloud-dataflow/
Size: 26.4 KB
Stars: 4
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Cloud Dataflow with Memorystore

Code to create dataflow pipeline that reads file data from a cloud storage, processes and transforms it and outputs the transformed data in Google's own in-memory datastore which is their Redis implemenation called memorystore. The pipeline code is written in Java and have been worked upon Apache Beam's SDK.

## About Cloud dataflow
Dataflow is a fully-managed service to execute pipelines within the Google Cloud Platform ecosystem. It is a service which is fully dedicated towards transforming and enriching data in stream (real time) and batch (historical) modes. It is a serverless approach where users can focus on programming instead of managing server clusters, can be integrated with Stackdriver, which lets you monitor and troubleshoot pipelines as they are running. It acts as a convenient integration point where Tensorflow machine learning models can be added to process data pipelines.

## About Memorystore
Memorystore for Redis provides a fully-managed service that is powered by the Redis in-memory data store to build application caches that provide sub-millisecond data access.
With Memorystore for Redis, you can easily achieve your latency and throughput targets by scaling up your Redis instances with minimal impact to your application's availability.

## Command to execute for creating the template:

```
mvn compile exec:java \
-Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
-Dexec.args="--project=your-project-id \
--jobName=dataflow-memstore-job \
--inputFile=gs://cloud-dataflow-input-bucket/*.txt \
--redisHost=127.0.0.1 \
--stagingLocation=gs://dataflow-pipeline-batch-bucket/staging/ \
--dataflowJobFile=gs://dataflow-pipeline-batch-bucket/templates/dataflow-custom-redis-template \
--gcpTempLocation=gs://dataflow-pipeline-batch-bucket/tmp/ \
--runner=DataflowRunner"
```

## Check the data inserted in Memorystore (Redis) datastore
For checking whether the processed data is stored in the Redis instance after the dataflow pipeline is executed successfully, you must first connect to the Redis instance from any Compute Engine VM instance located within the same project, region and network as the Redis instance.

1) Create a VM instance and SSH to it

2) Install telnet from apt-get in the VM instance
```
sudo apt-get install telnet
```
3) From the VM instance, connect to the ip-address of the redis instance
```
telnet instance-ip-address 6379
```
4) Once you are in the redis, check the keys inserted
```
keys *
```
5) Check whether the data is inserted using the intersection command to get the guid
```
sinter firstname: lastname: dob: postalcode:
```
6) Check with individual entry using the below command to get the guid
```
smembers firstname:
```
7) Command to clear the redis data store
```
flushall
```

### References

https://redis.io/topics/data-types-intro

https://beam.apache.org/documentation/programming-guide/

https://thedeveloperstory.com/2020/07/24/cloud-dataflow-a-unified-model-for-batch-and-streaming-data-processing/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/viveknaskar/cloud-dataflow-with-memorystore

Awesome Lists containing this project

README