https://github.com/datasherlock/bqml-demo
BigQuery ML demo showcasing data prep, logistic regression training, evaluation, and explainable predictions for account fraud detection on Cloud Spanner data.
https://github.com/datasherlock/bqml-demo
bigdata bigquery bigquery-ml sql
Last synced: about 22 hours ago
JSON representation
BigQuery ML demo showcasing data prep, logistic regression training, evaluation, and explainable predictions for account fraud detection on Cloud Spanner data.
- Host: GitHub
- URL: https://github.com/datasherlock/bqml-demo
- Owner: datasherlock
- Created: 2025-04-10T04:59:03.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-11T06:08:48.000Z (about 1 year ago)
- Last Synced: 2025-04-13T15:16:52.829Z (about 1 year ago)
- Topics: bigdata, bigquery, bigquery-ml, sql
- Language: Python
- Homepage: https://www.jeromerajan.com
- Size: 352 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# BigQuery ML Account Fraud Detection Demo
This SQL script demonstrates a basic workflow for building, evaluating, and explaining a machine learning model for account fraud detection using BigQuery ML. The script uses data from an external Cloud Spanner table.
## Overview
The script performs the following steps:
1. **Data Cleansing and Preparation:** Creates a new table (`demo_dataset.account_activity_cleansed`) by querying raw account activity data from a Cloud Spanner table (`account_activity_raw`) via an external connection (`demo-spanner-conn`). During this process, it:
* Casts data types for timestamp, transaction amount, successful login, and unusual activity columns.
* Adds a `classifier` column with random values between 0 and 1 to facilitate splitting the data into training, evaluation, and prediction sets.
2. **Data Split Verification:** Includes a query to check the distribution of data across the training (80%), evaluation (10%), and prediction (10%) sets based on the `classifier` column.
3. **Model Training:** Creates or replaces a BigQuery ML Logistic Regression model (`demo_dataset.fraud_model`) using the cleansed data. Key options used:
* `model_type='LOGISTIC_REG'`: Specifies the model algorithm.
* `auto_class_weights=TRUE`: Helps handle class imbalance in the target variable (`unusual_activity`).
* `data_split_method='NO_SPLIT'`: Informs BQML that the data is already split manually.
* `input_label_cols=['unusual_activity']`: Defines the target variable.
* The training uses only the data where `classifier < 0.8`.
4. **Model Evaluation:** Evaluates the trained model's performance using `ML.EVALUATE` on the evaluation dataset (`classifier between 0.8 and 0.9`).
5. **Prediction and Explanation:** Uses `ML.EXPLAIN_PREDICT` to predict unusual activity on the prediction dataset (`classifier > 0.9`) and provides feature attributions (explanations) for the predictions. The results are filtered to show only instances where the prediction differs from the actual label (misclassifications).
6. **(Commented Out) Example Update:** Includes a commented-out example `UPDATE` statement showing how to modify data in the cleansed table.
## Prerequisites
* Access to a Google Cloud project with BigQuery and Cloud Spanner APIs enabled.
* A BigQuery dataset named `demo_dataset`.
* A BigQuery connection named `datasherlock.us-central1.demo-spanner-conn` configured to access a Cloud Spanner instance.
* A Cloud Spanner table named `account_activity_raw` within the connected Spanner database, containing the necessary columns (`transaction_id`, `account_id`, `timestamp`, `location`, `device_type`, `ip_address`, `transaction_amount`, `transaction_type`, `successful_login`, `unusual_activity`).
## How to Use
1. **Ensure Prerequisites:** Verify that all prerequisites listed above are met. Pay close attention to the dataset name (`demo_dataset`) and the connection name (`datasherlock.us-central1.demo-spanner-conn`) used in the script; adjust them if your environment uses different names.
2. **Execute the Script:** Run the SQL commands sequentially in the BigQuery console, using the `bq` command-line tool, or through a BigQuery client library.
## Expected Outputs
* A BigQuery table named `demo_dataset.account_activity_cleansed` containing the prepared data.
* Query results showing the distribution of data into training, evaluation, and prediction sets.
* A BigQuery ML model named `demo_dataset.fraud_model`.
* Query results from `ML.EVALUATE` showing model performance metrics (e.g., precision, recall, accuracy, f1-score, roc_auc).
* Query results from `ML.EXPLAIN_PREDICT` showing predictions, actual labels, and feature attributions for misclassified instances in the prediction set.
#### Disclaimer: This README was generated by Gemini Code Assist