{"id":15222152,"url":"https://github.com/googlecloudplatform/df-ml-anomaly-detection","last_synced_at":"2026-04-03T23:33:16.943Z","repository":{"id":51690998,"uuid":"243983821","full_name":"GoogleCloudPlatform/df-ml-anomaly-detection","owner":"GoogleCloudPlatform","description":"Streaming Anomaly Detection Solution by using Pub/Sub, Dataflow, BQML \u0026 Cloud DLP","archived":false,"fork":false,"pushed_at":"2025-02-04T18:27:08.000Z","size":12895,"stargazers_count":181,"open_issues_count":6,"forks_count":49,"subscribers_count":60,"default_branch":"main","last_synced_at":"2025-03-30T15:42:39.911Z","etag":null,"topics":["anomaly-detection","bqml","cybersecurity","dataflow","dlp","kmeans-clustering","log","network","pubsub"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudPlatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-29T14:32:35.000Z","updated_at":"2025-03-22T19:45:13.000Z","dependencies_parsed_at":"2024-12-25T22:08:32.645Z","dependency_job_id":"ced1818d-c209-4389-96c1-6fef67f0c21a","html_url":"https://github.com/GoogleCloudPlatform/df-ml-anomaly-detection","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdf-ml-anomaly-detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdf-ml-anomaly-detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdf-ml-anomaly-detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudPlatform%2Fdf-ml-anomaly-detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudPlatform","download_url":"https://codeload.github.com/GoogleCloudPlatform/df-ml-anomaly-detection/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246837679,"owners_count":20841902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anomaly-detection","bqml","cybersecurity","dataflow","dlp","kmeans-clustering","log","network","pubsub"],"created_at":"2024-09-28T15:10:50.647Z","updated_at":"2026-04-03T23:33:16.894Z","avatar_url":"https://github.com/GoogleCloudPlatform.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Realtime Anomaly Detection Using Google Cloud Stream Analytics and AI Services\nThis repo provides a reference implementation of a Cloud Dataflow streaming pipelines that integrates with BigQuery ML, Cloud AI Platform, and AutoML (coming soon!) to perform anomaly detection use case as part of real time AI pattern. It contains reference implementations for the following real time anomaly detection use cases:  \n\n1. Finding anomalous behaviour in netflow log to identify cyber security threat for a Telco use case.\n2. Finding anomalous transaction to identify fraudulent activities for a Financial Service use case.     \n\n## Table of Contents  \n* [Anomaly detection in Netflow log](#anomaly-detection-in-netflow-log).  \n\t* [Reference Architecture](#anomaly-detection-reference-architecture-using-bqml).      \n\t* [Quick Start](#quick-start).  \n\t* [Learn More](#learn-more-about-this-solution). \n\t* [Mock Data Generator to Pub/Sub Using Dataflow](#test). \n\t* [Train \u0026 Normalize Data Using BQ ML](#create-a-k-means-model-using-bq-ml )\n\t* [Feature Extraction Using Dataflow](#feature-extraction-after-aggregation). \n\t* [Realtime outlier detection using Dataflow](#find-the-outliers). \n\t* [Sensitive data (IMSI) de-identification using Cloud DLP](#dlp-integration). \n\t* [Looker Integration](#looker-integration). \n\t\n* [Anomaly detection in Financial Transactions](#anomaly-detection-in-financial-transactions). \n\t* [Reference Architecture](#anomaly-detection-reference-architecture-using-cloud-ai). \n\t* [Build \u0026 Run](#build--run-1)\n\t* [Test](#test-1)\n\t\n\t  \t\t \n## Anomaly Detection in Netflow log\nThis section of the repo contains a reference implementation of an ML based Network Anomaly Detection solution by using Pub/Sub, Dataflow, BQML \u0026 Cloud DLP.  It uses an easy to use built in K-Means clustering model as part of BQML to train and normalize netflow log data.   Key part of the  implementation  uses  Dataflow for  feature extraction \u0026 real time outlier detection which  has been tested to process over 20TB of data. (250k msg/sec). Finally, it also uses Cloud DLP to tokenize IMSI  (international mobile subscriber identity) number as the streaming Dataflow pipeline  ingests millions of netflow log form Pub/Sub.   \n\nSecuring its internal network from malware and security threats is critical at many customers. With the ever changing malware landscape and explosion of activities in IoT and M2M, existing signature based solutions for malware detection are no longer sufficient. This PoC highlights an ML based network anomaly detection solution using PubSub, Dataflow, BQ ML and DLP to detect mobile malware on subscriber devices and suspicious behaviour in wireless networks.\n\nThis solution implements the reference architecture highlighted below. You will execute a \u003cb\u003edataflow streaming pipeline\u003c/b\u003e to process netflow log from GCS and/or PubSub to find outliers in netflow logs  in real time.  This solution also uses a built in K-Means Clustering Model created by using \u003cb\u003eBQ-ML\u003c/b\u003e. To see a step-by-step tutorial that walks you through implementing this solution, see [Building a secure anomaly detection solution using Dataflow, BigQuery ML, and Cloud Data Loss Prevention](https://cloud.google.com/solutions/building-anomaly-detection-dataflow-bigqueryml-dlp).   \n\nIn summary, you can use this solution to demo following 3 use cases :\n\n1.  Streaming Analytics at Scale by using Dataflow/Beam. (Feature Extraction \u0026 Online Prediction).  \n2. Making Machine Learning easy to do by creating a model by using BQ ML K-Means Clustering.  \n3. Protecting sensitive information e.g:\"IMSI (international mobile subscriber identity)\" by using Cloud DLP crypto based tokenization.  \n\n## Anomaly Detection Reference Architecture Using BQML\n\n\n![ref_arch](diagram/ref_arch.png)\n\n## Quick Start\n\n[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/df-ml-anomaly-detection.git)\n\n### Enable APIs\n\n```gcloud services enable bigquery\ngcloud services enable storage_component\ngcloud services enable dataflow\ngcloud services enable cloudbuild.googleapis.com\ngcloud config set project \u003cproject_id\u003e\n```\n### Access to Cloud Build Service Account \n\n```export PROJECT_ID=$(gcloud config get-value project)\nexport PROJECT_NUMBER=$(gcloud projects list --filter=${PROJECT_ID} --format=\"value(PROJECT_NUMBER)\") \ngcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:$PROJECT_NUMBER@cloudbuild.gserviceaccount.com --role roles/editor\ngcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:$PROJECT_NUMBER@cloudbuild.gserviceaccount.com --role roles/storage.objectAdmin\n```\n\n#### Export Required Parameters \n```\nexport DATASET=\u003cvar\u003ebq-dataset-name\u003c/var\u003e\nexport SUBSCRIPTION_ID=\u003cvar\u003esubscription_id\u003c/var\u003e\nexport TOPIC_ID=\u003cvar\u003etopic_id\u003c/var\u003e\nexport DATA_STORAGE_BUCKET=${PROJECT_ID}-\u003cvar\u003edata-storage-bucket\u003c/var\u003e\n```\nYou can also export DLP template and batch size to enable DLP transformation in the pipeline\n* Batch Size is in bytes and max allowed is less than 520KB/payload\n```\nexport DEID_TEMPLATE=projects/{id}/deidentifyTemplates/{template_id}\nexport BATCH_SIZE = 350000\n```\n#### Trigger Cloud Build Script\n\n```\ngcloud builds submit scripts/. --config scripts/cloud-build-demo.yaml  --substitutions \\\n_DATASET=$DATASET,\\\n_DATA_STORAGE_BUCKET=$DATA_STORAGE_BUCKET,\\\n_SUBSCRIPTION_ID=${SUBSCRIPTION_ID},\\\n_TOPIC_ID=${TOPIC_ID},\\\n_API_KEY=$(gcloud auth print-access-token)\n```\n#### (Optional) Trigger the pipelines using flex template\nIf you have all other resources like BigQuery tables, PubSub topic and subscriber, GCS bucket already exist or created before, you can use the command below to trigger the pipeline by using a public image. This may be helpful for run the pipeline for live demo. \n\nGenerate 10k msg/sec of random net flow log data:\n```\ngcloud beta dataflow flex-template run data-generator --project=\u003cproject_id\u003e --region=\u003cregion\u003e --template-file-gcs-location=gs://df-ml-anomaly-detection-mock-data/dataflow-flex-template/dynamic_template_data_generator_template.json --parameters=autoscalingAlgorithm=\"NONE\",numWorkers=5,maxNumWorkers=5,workerMachineType=n1-standard-4,qps=10000,schemaLocation=gs://df-ml-anomaly-detection-mock-data/schema/next-demo-schema.json,eventType=net-flow-log,topic=projects/\u003cproject_id\u003e/topics/events\n```\nGenerate 1k msg/sec of random outlier data:\n```\ngcloud beta dataflow flex-template run data-generator --project=\u003cproject_id\u003e --region=\u003cregion\u003e --template-file-gcs-location=gs://df-ml-anomaly-detection-mock-data/dataflow-flex-template/dynamic_template_data_generator_template.json --parameters=autoscalingAlgorithm=\"NONE\",numWorkers=5,maxNumWorkers=5,workerMachineType=n1-standard-4,qps=1000,schemaLocation=gs://df-ml-anomaly-detection-mock-data/schema/next-demo-schema-outlier.json,eventType=net-flow-log,topic=projects/\u003cproject_id\u003e/topics/\u003ctopic_id\u003e\n```\n\nTrigger Anomaly Detection Pipeline:\n\n```\ngcloud beta dataflow flex-template run \"anomaly-detection\" --project=\u003cproject_id\u003e--region=us-central1 --template-file-gcs-location=gs://df-ml-anomaly-detection-mock-data/dataflow-flex-template/dynamic_template_secure_log_aggr_template.json --parameters=autoscalingAlgorithm=\"NONE\",numWorkers=20,maxNumWorkers=20,workerMachineType=n1-highmem-4,subscriberId=projects/\u003cproject_id\u003e/subscriptions/events-sub,tableSpec=\u003cproject_id\u003e:\u003cdataset_name\u003e.cluster_model_data,batchFrequency=10,customGcsTempLocation=gs://df-tmp-data/temp,tempLocation=gs://df-temp-data/temp,clusterQuery=gs://\u003cpath\u003e/normalized_cluster_data.sql,outlierTableSpec=\u003cproject_id\u003e:\u003cdataset_name\u003e.outlier_data,inputFilePattern=gs://df-ml-anomaly-detection-mock-data/flow_log*.json,enbaleStreamingEngine=true,windowInterval=5,writeMethod=FILE_LOADS,streaming=true\n\n```\n### Generate some mock data (1k events/sec) in PubSub topic\n```\ngradle run -DmainClass=com.google.solutions.df.log.aggregations.StreamingBenchmark \\\n -Pargs=\"--streaming  --runner=DataflowRunner --project=${PROJECT_ID} --autoscalingAlgorithm=NONE --workerMachineType=n1-standard-4 --numWorkers=3 --maxNumWorkers=3 --qps=1000 --schemaLocation=gs://df-ml-anomaly-detection-mock-data/schema/netflow_log_json_schema.json  --eventType=netflow --topic=${TOPIC_ID} --region=us-central1\"\n```\n### Publish an outlier with an unusal tx \u0026 rx bytes\n```\ngcloud pubsub topics publish events --message  \"{\\\"subscriberId\\\": \\\"00000000000000000\\\", \\\n \\\"srcIP\\\": \\\"12.0.9.4\\\", \\\n \\\"dstIP\\\": \\\"12.0.1.3\\\", \\\n \\\"srcPort\\\": 5000, \\\n \\\"dstPort\\\": 3000, \\\n \\\"txBytes\\\": 150000, \\\n \\\"rxBytes\\\": 40000, \\\n \\\"startTime\\\": 1570276550, \\\n \\\"endTime\\\": 1570276550, \\\n \\\"tcpFlag\\\": 0, \\\n \\\"protocolName\\\": \\\"tcp\\\", \\\n \\\"protocolNumber\\\": 0}\"\n```\n\n### Clean up\nPlease stop/cancel the dataflow pipeline manually from the UI. \n\n##  Learn More About This Solution\n\n### Example input log data and output after aggregation\n\nSample Input Data\n\n```\n{\n \\\"subscriberId\\\": \\\"100\\\",\n \\\"srcIP\\\": \\\"12.0.9.4\",\n \\\"dstIP\\\": \\\"12.0.1.2\\\",\n \\\"srcPort\\\": 5000,\n \\\"dstPort\\\": 3000,\n \\\"txBytes\\\": 15,\n \\\"rxBytes\\\": 40,\n \\\"startTime\\\": 1570276550,\n \\\"endTime\\\": 1570276559,\n \\\"tcpFlag\\\": 0,\n \\\"protocolName\\\": \\\"tcp\\\",\n \\\"protocolNumber\\\": 0\n}, \n{\n\\\"subscriberId\\\": \\\"100\\\",\n\\\"srcIP\\\": \\\"12.0.9.4\\\",\n\\\"dstIP\\\": \\\"12.0.1.2\\\",\n\\\"srcPort\\\": 5000,\n\\\"dstPort\\\": 3000,\n\\\"txBytes\\\": 10,\n\\\"rxBytes\\\": 40,\n\\\"startTime\\\": 1570276650,\n\\\"endTime\\\": 11570276750,,\n\\\"tcpFlag\\\": 0,\n\\\"protocolName\\\": \\\"tcp\\\",\n\\\"protocolNumber\\\": 0\n}\n```\n### Feature Extraction After Aggregation\n\n1. Added processing timestamp.\n2. Group by destination subnet and subscriberId\n3. Number of approximate unique IP \n4. Number of approximate unique port\n5. Number of unique records\n6. Max, min, avg txBytes\n7. Max, min, avg rxBytes\n8. Max, min, avg duration    \n\n```\n{\n  \"transaction_time\": \"2019-10-27 23:22:17.848000\",\n  \"subscriber_id\": \"100\",\n  \"dst_subnet\": \"12.0.1.2/22\",\n  \"number_of_unique_ips\": \"1\",\n  \"number_of_unique_ports\": \"1\",\n  \"number_of_records\": \"2\",\n  \"max_tx_bytes\": \"15\",\n  \"min_tx_bytes\": \"10\",\n  \"avg_tx_bytes\": \"12.5\",\n  \"max_rx_bytes\": \"40\",\n  \"min_rx_bytes\": \"40\",\n  \"avg_rx_bytes\": \"40.0\",\n  \"max_duration\": \"100\",\n  \"min_duration\": \"9\",\n  \"avg_duration\": \"54.5\"\n}\n```\n\n### Feature Extraction Using Beam Schema Inferring \n```.apply(\"Group By SubId \u0026 DestSubNet\",\n   Group.\u003cRow\u003ebyFieldNames(\"subscriberId\", \"dstSubnet\")\n      .aggregateField(\n         \"srcIP\",\n            new ApproximateUnique.ApproximateUniqueCombineFn\u003cString\u003e(\n                 SAMPLE_SIZE, StringUtf8Coder.of()),\n                    \"number_of_unique_ips\")\n      .aggregateField(\n          \"srcPort\",\n              new ApproximateUnique.ApproximateUniqueCombineFn\u003cInteger\u003e(\n                  SAMPLE_SIZE, VarIntCoder.of()),\n                    \"number_of_unique_ports\")\n      .aggregateField(\"srcIP\", Count.combineFn(), \"number_of_records\")\n      .aggregateField(\"txBytes\", new AvgCombineFn(), \"avg_tx_bytes\")\n      .aggregateField(\"txBytes\", Max.ofIntegers(), \"max_tx_bytes\")\n      .aggregateField(\"txBytes\", Min.ofIntegers(), \"min_tx_bytes\")\n      .aggregateField(\"rxBytes\", new AvgCombineFn(), \"avg_rx_bytes\")\n      .aggregateField(\"rxBytes\", Max.ofIntegers(), \"max_rx_bytes\")\n      .aggregateField(\"rxBytes\", Min.ofIntegers(), \"min_rx_bytes\")\n      .aggregateField(\"duration\",new AvgCombineFn(), \"avg_duration\")\n      .aggregateField(\"duration\", Max.ofIntegers(), \"max_duration\")\n      .aggregateField(\"duration\", Min.ofIntegers(), \"min_duration\"));\n```\n\n### Create a K-Means model using BQ ML \n\nPlease use the json schema (aggr_log_table_schema.json) to create the table in BQ.\nCluster_model_data table is partition by 'ingestion timestamp' and clustered by dst_subnet and subscriber_id.\n\n```--\u003e train data select\nCREATE or REPLACE TABLE network_logs.train_data as (select * from {dataset_name}.cluster_model_data \nwhere _PARTITIONDATE between 'date_from' AND 'date_to';\n--\u003e create model\nCREATE OR REPLACE {dataset_name}.log_cluster  options(model_type='kmeans', num_clusters=4, standardize_features = true) \nAS select * except (transaction_time, subscriber_id, number_of_unique_ips, number_of_unique_ports, dst_subnet) \nfrom network_logs.train_data;\n```\n\n### Normalize Data using BQ Store Procedure\n\n1. Predict on the train dataset to get the nearest distance from centroid for each record.\n2. Calculate the STD DEV for each point to normalize\n3. Store them in a table dataflow pipeline can use as side input  \n\n```CREATE or REPLACE table {dataset_name}.normalized_centroid_data AS(\nwith centroid_details AS (\nselect centroid_id,array_agg(struct(feature as name, round(numerical_value,1) as value) order by centroid_id) AS cluster\nfrom ML.CENTROIDS(model network_logs.log_cluster_2)\ngroup by centroid_id),\ncluster as (select centroid_details.centroid_id as centroid_id,\n(select value from unnest(cluster) where name = 'number_of_records') AS number_of_records,\n(select value from unnest(cluster) where name = 'max_tx_bytes') AS max_tx_bytes,\n(select value from unnest(cluster) where name = 'min_tx_bytes') AS min_tx_bytes,\n(select value from unnest(cluster) where name = 'avg_tx_bytes') AS avg_tx_bytes,\n(select value from unnest(cluster) where name = 'max_rx_bytes') AS max_rx_bytes,\n(select value from unnest(cluster) where name = 'min_rx_bytes') AS min_rx_bytes,\n(select value from unnest(cluster) where name = 'avg_rx_bytes') AS avg_rx_bytes,\n(select value from unnest(cluster) where name = 'max_duration') AS max_duration,\n(select value from unnest(cluster) where name = 'min_duration') AS min_duration,\n(select value from unnest(cluster) where name = 'avg_duration') AS avg_duration\nfrom centroid_details order by centroid_id asc),\npredict as (select * from ML.PREDICT(model {dataset_name}.log_cluster_2, (select * from network_logs.train_data)))\nselect c.centroid_id as centroid_id, \n(stddev((p.number_of_records-c.number_of_records)\n+(p.max_tx_bytes-c.max_tx_bytes)\n+(p.min_tx_bytes-c.min_tx_bytes)\n+(p.avg_tx_bytes-c.min_tx_bytes)\n+(p.max_rx_bytes-c.max_rx_bytes)\n+(p.min_rx_bytes-c.min_rx_bytes)\n+(p.avg_rx_bytes-c.min_rx_bytes)\n+(p.max_duration-c.max_duration)\n+(p.min_duration-c.min_duration)\n+(p.avg_duration-c.avg_duration)))as normalized_dest, \nany_value(c.number_of_records) as number_of_records,\nany_value(c.max_tx_bytes) as max_tx_bytes,\nany_value(c.min_tx_bytes) as min_tx_bytes ,\nany_value(c.avg_tx_bytes) as avg_tx_bytes,\nany_value(c.max_rx_bytes) as max_rx_bytes, \nany_value(c.min_tx_bytes) as min_rx_bytes, \nany_value(c.avg_rx_bytes) as avg_rx_bytes,  \nany_value(c.avg_duration) as avg_duration,\nany_value(c.max_duration) as max_duration, \nany_value(c.min_duration) as min_duration\nfrom predict as p \ninner join cluster as c on c.centroid_id = p.centroid_id\ngroup by c.centroid_id);\n```\n\n\n### Find the Outliers\n\n1. Find the nearest distance from the centroid.  \n2. Calculate STD DEV between input and centroid vectors \n3. Find the Z Score (difference between a value in the sample and the mean, and divide it by the standard deviation)\n4. A score of 2 (2 STD DEV above the mean is an OUTLIER). \n\n![outlier](diagram/outlier.png)\n\n## Before Start (Optional Step if Cloud Build Script is NOT used)\n\n````\ngcloud services enable dataflow\ngcloud services enable big query\ngcloud services enable storage_component\n````\n\n## Creating a BigQuery Dataset and Tables\n\nDataset\n\n```\nbq --location=US mk -d \\ \n--description \"Network Logs Dataset \\ \n\u003cdataset_name\u003e\n```\n\nAggregation Data Table \n\n```\nbq mk -t --schema aggr_log_table_schema.json  \\\n--time_partitioning_type=DAY \\\n--clustering_fields=dst_subnet, subscriber_id \\\n--description \"Network Log Partition Table\" \\\n--label myorg:prod \\\n\u003cproject\u003e:\u003cdataset_name\u003e.cluster_model_data \n```\n\nOutlier Table \n\n```\nbq mk -t --schema outlier_table_schema.json \\\n--label myorg:prod \\\n\u003cproject\u003e:\u003cdataset_name\u003e.network_logs.outlier_data \n```\n\n## Build \u0026 Run\nTo Build \n\n```\ngradle spotlessApply -DmainClass=com.google.solutions.df.log.aggregations.SecureLogAggregationPipeline \ngradle build -DmainClass=com.google.solutions.df.log.aggregations.SecureLogAggregationPipeline \n\ngcloud auth configure-docker\ngradle jib --image=gcr.io/${PROJECT_ID}/df-ml-anomaly-detection:latest -DmainClass=com.google.solutions.df.log.aggregations.SecureLogAggregationPipeline\n\n```\n\nTo Run  \n\n```\ngcloud beta dataflow flex-template run \"anomaly-detection\" \\\n--project=${PROJECT_ID} \\\n--region=us-central1 \\\n--template-file-gcs-location=gs://${DF_TEMPLATE_CONFIG_BUCKET}/dynamic_template_secure_log_aggr_template.json \\\n--parameters=autoscalingAlgorithm=\"NONE\",\\\nnumWorkers=5,\\\nmaxNumWorkers=5,\\\nworkerMachineType=n1-highmem-4,\\\nsubscriberId=projects/${PROJECT_ID}/subscriptions/${SUBSCRIPTION_ID},\\\ntableSpec=${PROJECT_ID}:${DATASET_NAME}.cluster_model_data,\\\nbatchFrequency=2,\\\ncustomGcsTempLocation=gs://${DF_TEMPLATE_CONFIG_BUCKET}/temp,\\\ntempLocation=gs://${DF_TEMPLATE_CONFIG_BUCKET}/temp,\\\nclusterQuery=gs://${DF_TEMPLATE_CONFIG_BUCKET}/normalized_cluster_data.sql,\\\noutlierTableSpec=${PROJECT_ID}:${DATASET_NAME}.outlier_data,\\\ninputFilePattern=gs://df-ml-anomaly-detection-mock-data/flow_log*.json,\\\nworkerDiskType=compute.googleapis.com/projects/${PROJECT_ID}/zones/us-central1-b/diskTypes/pd-ssd,\\\ndiskSizeGb=5,\\\nwindowInterval=10,\\\nwriteMethod=FILE_LOADS,\\\nstreaming=true\n\n```\n\n## Test \n\nPublish mock log data at 250k msg/sec\n\nSchema used for load test: \n\n```\n{\n \"subscriberId\": \"{{long(1111111111,9999999999)}}\",\n \"srcIP\": \"{{ipv4()}}\",\n \"dstIP\": \"{{subnet()}}\",\n \"srcPort\": {{integer(1000,5000)}},\n \"dstPort\": {{integer(1000,5000)}},\n \"txBytes\": {{integer(10,1000)}},\n \"rxBytes\": {{integer(10,1000)}},\n \"startTime\": {{starttime()}},\n \"endTime\": {{endtime()}},\n \"tcpFlag\": {{integer(0,65)}},\n \"protocolName\": \"{{random(\"tcp\",\"udp\",\"http\")}}\",\n \"protocolNumber\": {{integer(0,1)}}\n}\n```\n\nTo Run: \n\n```\nexport PROJECT_ID=$(gcloud config get-value project)\nexport TOPIC_ID=\u003cvar\u003etopic-id\u003c/var\u003e\ngcloud pubsub topics create $TOPIC_ID\ngcloud builds submit . --machine-type=n1-highcpu-8 --config scripts/cloud-build-data-generator.yaml  --substitutions _TOPIC_ID=${TOPIC_ID}\n``` \n\nOutlier Test \n```\ngcloud pubsub topics publish \u003ctopic_id\u003e \\\n --message \"{\\\"subscriberId\\\": \\\"demo1\\\",\\\"srcIP\\\": \\\"12.0.9.4\\\",\\\"dstIP\\\": \\\"12.0.1.3\\\",\\\"srcPort\\\": 5000,\\\"dstPort\\\": 3000,\\\"txBytes\\\": 150000,\\\"rxBytes\\\": 40000,\\\"startTime\\\": 1570276550,\\\"endTime\\\": 1570276550,\\\"tcpFlag\\\": 0,\\\"protocolName\\\": \\\"tcp\\\",\\\"protocolNumber\\\": 0}\"\n \ngcloud pubsub topics publish \u003ctopic_id\u003e \\\n --message \"{\\\"subscriberId\\\": \\\"demo1\\\",\\\"srcIP\\\": \\\"12.0.9.4\\\",\\\"dstIP\\\": \\\"12.0.1.3\\\",\\\"srcPort\\\": 5000,\\\"dstPort\\\": 3000,\\\"txBytes\\\": 15000000,\\\"rxBytes\\\": 4000000,\\\"startTime\\\": 1570276550,\\\"endTime\\\": 1570276550,\\\"tcpFlag\\\": 0,\\\"protocolName\\\": \\\"tcp\\\",\\\"protocolNumber\\\": 0}\"\n```\n\nFeature Extraction Test\n\n```\ngcloud pubsub topics publish \u003ctopic_id\u003e \\\n--message \"{\\\"subscriberId\\\": \\\"100\\\",\\\"srcIP\\\": \\\"12.0.9.4\\\",\\\"dstIP\\\": \\\"12.0.1.2\\\",\\\"srcPort\\\": 5000,\\\"dstPort\\\": 3000,\\\"txBytes\\\": 10,\\\"rxBytes\\\": 40,\\\"startTime\\\": 1570276550,\\\"endTime\\\": 1570276559,\\\"tcpFlag\\\": 0,\\\"protocolName\\\": \\\"tcp\\\",\\\"protocolNumber\\\": 0}\"\ngcloud pubsub topics publish \u003ctopic_id\u003e \\\n--message \"{\\\"subscriberId\\\": \\\"100\\\",\\\"srcIP\\\": \\\"13.0.9.4\\\",\\\"dstIP\\\": \\\"12.0.1.2\\\",\\\"srcPort\\\": 5001,\\\"dstPort\\\": 3000,\\\"txBytes\\\": 15,\\\"rxBytes\\\": 40,\\\"startTime\\\": 1570276650,\\\"endTime\\\": 1570276750,\\\"tcpFlag\\\": 0,\\\"protocolName\\\": \\\"tcp\\\",\\\"protocolNumber\\\": 0}\"\nOUTPUT: INFO: row value Row:[2, 2, 2, 12.5, 15, 10, 50]\n```\n\n```\ngcloud pubsub topics publish \u003ctopic_id\u003e  \\\n--message \"{\\\"subscriberId\\\": \\\"100\\\",\\\"srcIP\\\": \\\"12.0.9.4\\\",\\\"dstIP\\\": \\\"12.0.1.2\\\",\\\"srcPort\\\": 5000,\\\"dstPort\\\": 3000,\\\"txBytes\\\": 10,\\\"rxBytes\\\": 40,\\\"startTime\\\": 1570276550,\\\"endTime\\\": 1570276550,\\\"tcpFlag\\\": 0,\\\"protocolName\\\": \\\"tcp\\\",\\\"protocolNumber\\\": 0}\"\n\ngcloud pubsub topics publish \u003ctopic_id\u003e \\\n--message \"{\\\"subscriberId\\\": \\\"100\\\",\\\"srcIP\\\": \\\"12.0.9.4\\\",\\\"dstIP\\\": \\\"12.0.1.2\\\",\\\"srcPort\\\": 5000,\\\"dstPort\\\": 3000,\\\"txBytes\\\": 15,\\\"rxBytes\\\": 40,\\\"startTime\\\": 1570276550,\\\"endTime\\\": 1570276550,\\\"tcpFlag\\\": 0,\\\"protocolName\\\": \\\"tcp\\\",\\\"protocolNumber\\\": 0}\"\nOUTPUT INFO: row value Row:[1, 1, 2, 12.5, 15, 10, 0]\n```\n## Pipeline Performance at 250k msg/sec\n\nPipeline DAG (ToDo: change with updated pipeline DAG)\n\n![dag](diagram/df_dag.png)\n\n\nMsg Rate\n\n![msg_rate_](diagram/msg-rate.png)\n\n![dag](diagram/load_dag.png)\n\nAck Message Rate\n\n![ack_msg](diagram/un-ack-msg.png)\n\nCPU Utilization\n\n![cpu](diagram/cpu.png)\n\nSystem Latency \n\n![latency](diagram/latency.png)\n\n### K-Means Clustering Using BQ-ML (Model Evaluation)\n\n![ref_arch](diagram/bq-ml-kmeans-1.png)\n\n![ref_arch](diagram/bq-ml-kmeans-2.png)\n\n![ref_arch](diagram/bq-ml-kmeans-3.png)\n\n![ref_arch](diagram/bq-ml-kmeans-4.png)\n\n \n### DLP Integration\n\nTo protect any sensitive data in the log,  you can use Cloud DLP to inspect, de-identify before data is stored in BigQuery. This is an optional integration in our reference architecture. To enable, please follow the steps below:\n\n*  Update the JSON file at scripts/deid_imei_number.json to add the de-identification transformation applicable for your use case. Screen shot below used a crypto based  deterministic transformation to de-identify IMSI number.  To understand how DLP transformation can be used, please refer to this [guide](https://cloud.google.com/solutions/de-identification-re-identification-pii-using-cloud-dlp).  \n\n```\n{\n   \"deidentifyTemplate\":{\n      \"displayName\":\"Config to DeIdentify IMEI Number\",\n      \"description\":\"IMEI Number masking transformation\",\n      \"deidentifyConfig\":{\n         \"recordTransformations\":{\n            \"fieldTransformations\":[\n               {\n                  \"fields\":[\n                     {\n                        \"name\":\"subscriber_id\"\n                     }\n                  ],\n                  \"primitiveTransformation\":{\n                     \"cryptoDeterministicConfig\":{\n                        \"cryptoKey\":{\n                         \t\u003cadd _dlp_transformation_config\u003e\n                           }\n                        },\n                        \"surrogateInfoType\":{\n                           \"name\":\"IMSI_TOKEN\"\n                        }\n                     }\n                  }\n               }\n            ]\n         }\n      }\n   },\n   \"templateId\":\"dlp-deid-subid\"\n}\n```\n\n*  Run this script (deid_template.sh) to create a template in your project.  \n\n```set -x \nPROJECT_ID=$(gcloud config get-value project)\nDEID_CONFIG=\"@deid_imei_number.json\"\nDEID_TEMPLATE_OUTPUT=\"deid-template.json\"\nAPI_KEY=$(gcloud auth print-access-token)\nAPI_ROOT_URL=\"https://dlp.googleapis.com\"\nDEID_TEMPLATE_API=\"${API_ROOT_URL}/v2/projects/${PROJECT_ID}/deidentifyTemplates\"\ncurl -X POST -H \"Content-Type: application/json\" \\\n -H \"Authorization: Bearer ${API_KEY}\" \\\n \"${DEID_TEMPLATE_API}\"`` \\\n -d \"${DEID_CONFIG}\"\\\n -o \"${DEID_TEMPLATE_OUTPUT}\"  \n ```\n\n*  Pass the DLP template name Dataflow pipeline. \n\n```export DEID_TEMPLATE_NAME=$(jq -r '.name' deid-template.json);echo $DEID_TEMPLATE_NAME\n--deidTemplateName=projects/\u003cproject_id\u003e/deidentifyTemplates/dlp-deid-subid\"\n```\n\nNote: Please checkout this [repo](https://github.com/GoogleCloudPlatform/dlp-dataflow-deidentification) to learn more about a end to end data tokenization solution. \n\n![ref_arch](diagram/dlp_imsi.png)\n\nIf you click on DLP Transformation from the DAG, you will see following sub transforms:\n\n![ref_arch](diagram/new_dlp_dag.png)\n\n\n\n### Looker Integration\nAddition to feature and outlier tables, you can  also include raw netflow log data in a BigQuery table. This includes geo location data like country, city, latitude, longitude derived from the simulated IP.  Pipeline uses [MaxMind-GeoIP2 Java API](https://maxmind.github.io/GeoIP2-java/). \n#### Create a table to store raw log netflow data \n\n```\nbq mk -t --schema src/main/resources/netflow_log_raw_datat.json \\\n--time_partitioning_type=DAY \\\n--clustering_fields=\"geoCountry,geoCity\" \\\n--description \"Raw Netflow Log Data\" \\\n${PROJECT_ID}:${DATASET_NAME}.netflow_log_data\n\n```\n#### Looker repo \nYou can leverage this [open source code](https://github.com/llooker/anomaly_detection_network_logs/tree/master) to import the Looker dashboard and underlying LookML for your use.\n\n####\nPass the parameter in the pipeline. If not passed, pipeline assumes no need to store raw log data. Feature and outlier data will be stored as expected.\n\n```\n--logTableSpec=\u003cproject_id\u003e:\u003cdataset\u003e.netflow_log_data\n```\n\n\n![log_data_dag](diagram/with_log_data_dag.png)\n\n![log_data_dag](diagram/raw_log_data_schema.png)\n\n## Anomaly Detection in Financial Transactions\nThis section of this repo contains a reference implementation of finding fraudulent transactions using Dataflow and Cloud AI. First step is to create a TensorFlow  boosted tree model  by using a [Kaggle dataset](https://www.kaggle.com/ntnu-testimon/paysim1) and deploy the model in Cloud AI for online prediction. Then uses a Dataflow pipeline highlighted in the reference architecture below to micro batch the request for online prediction.  Finally, the transaction data is stored in a BigQuery table called transactions and outlier data is stored in a table called fraud_prediction. Data can be joined between two tables  by transactionId column.  By default, probability is set to 0.99 (99%) to identify the fraudulent transactions. \n\nTo see a step-by-step tutorial that walks you through implementing this solution, see [Detecting anomalies in financial transactions by using AI Platform, Dataflow, and BigQuery](https://cloud.google.com/solutions/detecting-anomalies-in-financial-transactions).\n\n## Anomaly Detection Reference Architecture Using Cloud AI\n![ref_arch](diagram/df-cloud-ai-prediction.png)\n\n\n## Trigger the pipeline using Flex Template\nYou can use the command below to trigger the pipeline using a public image as a Dataflow flex template\n\n```\ngcloud beta dataflow flex-template run \"anomaly-detection-finserv\" --project=\u003cproject\u003e --region=\u003cregion\u003e --template-file-gcs-location=gs://df-ml-anomaly-detection-mock-data/dataflow-flex-template/dynamic_template_finserv_fraud_detection.json --parameters=autoscalingAlgorithm=\"NONE\",numWorkers=30,maxNumWorkers=30,workerMachineType=n1-highmem-8,subscriberId=projects/\u003cid\u003e/subscriptions/\u003cid\u003e,tableSpec=project:dataset.transactions,outlierTableSpec=project:dataset.fraud_prediction,tempLocation=gs://\u003cbucket\u003e/temp,inputFilePattern=gs://df-ml-anomaly-detection-mock-data/finserv_fraud_detection/fraud_data_kaggle.json,modelId=\u003cid\u003e,versionId=\u003cid\u003e,keyRange=1024,batchSize=500000\n```\n## Build \u0026 Run\n\n1. Please follow this [code lab](https://codelabs.developers.google.com/codelabs/fraud-detection-ai-explanations/#7) to build and deploy the [TensorFlow model](https://www.tensorflow.org/api_docs/python/tf/estimator/BoostedTreesClassifier) in Cloud AI.  Also, notice the change in the notebook to add a transactionId as part of serving function. Updated notebook can be found [here](https://github.com/GoogleCloudPlatform/df-ml-anomaly-detection/blob/finserv-fraud-detection-tf-cloud-ai/fraud-detection-notebook/credit-card-fraud-detection-v1.ipynb).  \n\n2. Use the command below to create a Dataset and two tables in BigQuery.  \n\n```\nbq --location=US mk -d \\ \n--description \"FinServ Transaction Dataset \\ \n\u003cdataset_name\u003e\n``` \n\n```\nbq mk -t --schema src/main/resources/transactions.json \\\n--label myorg:prod \\\n\u003cproject\u003e:\u003cdataset_name\u003e.transactions \n```\n\n```\nbq mk -t --schema src/main/resources/fraud_prediction.json \\\n--label myorg:prod \\\n\u003cproject\u003e:\u003cdataset_name\u003e.fraud_prediction\n```\n\n3. Trigger the Dataflow pipeline by passing all required parameters.  Pipeline uses state and timer api to micro batch the call to prediction API.\n\n```\ngradle run -DmainClass=com.google.solutions.df.log.aggregations.FraudDetectionFinServTranPipeline -Pargs=\"--runner=DataflowRunner --project=\u003cprroject-id\u003e\u003e \\\n--inputFilePattern=gs://df-ml-anomaly-detection-mock-data/finserv_fraud_detection/fraud_data_kaggle.json \\\n--subscriberId=projects/\u003cproject-id\u003e/subscriptions/\u003csubs_id\u003e \\\n--modelId=\u003cmodel_id\u003e \\\n--versionId=\u003cversion_id\u003e \\\n--outlierTableSpec=\u003cproject_id\u003e:\u003cdataset\u003e.fraud_prediction \\\n--tableSpec=\u003cproject_id\u003e:\u003cdataset\u003e.transactions \\\n--keyRange=1024 \\\n--autoscalingAlgorithm=NONE \\\n--numWorkers=30 \\\n--maxNumWorkers=30 \\\n--workerMachineType=n1-highmem-8 \\\n--batchSize=500000 \\\n--region=us-central1 \\\n--gcpTempLocation=gs://df-temp-data/temp\"\n\n```\n## Test\n1. Reading the full kaggle dataset (6M+) form GCS bucket and publish  some random mock data in PubSub topic\n![ingest_data](diagram/ingest-data-ft.png). \n\nFull DAG\n![ingest_data](diagram/full-dag-ft.png). \n\n2. Predict Transform with 500KB payload / request\n\n![predict_data](diagram/predict-transform-ft.png). \n\n![batch_data](diagram/batch-ft.png). \n\n3. BigQuery validation\n![predict_data](diagram/bq-validation-ft.png). \n\n![trans_schema](diagram/transactions-schema-ft.png). \n\n![predict_schema](diagram/predict-schema-ft.png). \n\n4. Prediction API in Cloud AI\n![predict_data](diagram/predict_1.png). \n\n![trans_schema](diagram/predict_2.png). \n\n![predict_schema](diagram/predict_3.png). \n\n4. Custom Counter in Dataflow UI\n\n![custom_counter](diagram/custom_counter.png). \n\n\n\n\n\n\n \n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdf-ml-anomaly-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgooglecloudplatform%2Fdf-ml-anomaly-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgooglecloudplatform%2Fdf-ml-anomaly-detection/lists"}