https://github.com/datasherlock/bqml-demo

BigQuery ML demo showcasing data prep, logistic regression training, evaluation, and explainable predictions for account fraud detection on Cloud Spanner data.
https://github.com/datasherlock/bqml-demo

bigdata bigquery bigquery-ml sql

Last synced: 28 days ago
JSON representation

BigQuery ML demo showcasing data prep, logistic regression training, evaluation, and explainable predictions for account fraud detection on Cloud Spanner data.

Host: GitHub
URL: https://github.com/datasherlock/bqml-demo
Owner: datasherlock
Created: 2025-04-10T04:59:03.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-11T06:08:48.000Z (over 1 year ago)
Last Synced: 2025-04-13T15:16:52.829Z (over 1 year ago)
Topics: bigdata, bigquery, bigquery-ml, sql
Language: Python
Homepage: https://www.jeromerajan.com
Size: 352 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

![bqml-demo](https://github.com/user-attachments/assets/7d98da3b-29cc-48fc-90f2-a940cbec25df)

# BigQuery ML Account Fraud Detection Demo

This SQL script demonstrates a basic workflow for building, evaluating, and explaining a machine learning model for account fraud detection using BigQuery ML. The script uses data from an external Cloud Spanner table.

## Overview

The script performs the following steps:

1. **Data Cleansing and Preparation:** Creates a new table (`demo_dataset.account_activity_cleansed`) by querying raw account activity data from a Cloud Spanner table (`account_activity_raw`) via an external connection (`demo-spanner-conn`). During this process, it:
* Casts data types for timestamp, transaction amount, successful login, and unusual activity columns.
* Adds a `classifier` column with random values between 0 and 1 to facilitate splitting the data into training, evaluation, and prediction sets.
2. **Data Split Verification:** Includes a query to check the distribution of data across the training (80%), evaluation (10%), and prediction (10%) sets based on the `classifier` column.
3. **Model Training:** Creates or replaces a BigQuery ML Logistic Regression model (`demo_dataset.fraud_model`) using the cleansed data. Key options used:
* `model_type='LOGISTIC_REG'`: Specifies the model algorithm.
* `auto_class_weights=TRUE`: Helps handle class imbalance in the target variable (`unusual_activity`).
* `data_split_method='NO_SPLIT'`: Informs BQML that the data is already split manually.
* `input_label_cols=['unusual_activity']`: Defines the target variable.
* The training uses only the data where `classifier < 0.8`.
4. **Model Evaluation:** Evaluates the trained model's performance using `ML.EVALUATE` on the evaluation dataset (`classifier between 0.8 and 0.9`).
5. **Prediction and Explanation:** Uses `ML.EXPLAIN_PREDICT` to predict unusual activity on the prediction dataset (`classifier > 0.9`) and provides feature attributions (explanations) for the predictions. The results are filtered to show only instances where the prediction differs from the actual label (misclassifications).
6. **(Commented Out) Example Update:** Includes a commented-out example `UPDATE` statement showing how to modify data in the cleansed table.

## Prerequisites

* Access to a Google Cloud project with BigQuery and Cloud Spanner APIs enabled.
* A BigQuery dataset named `demo_dataset`.
* A BigQuery connection named `datasherlock.us-central1.demo-spanner-conn` configured to access a Cloud Spanner instance.
* A Cloud Spanner table named `account_activity_raw` within the connected Spanner database, containing the necessary columns (`transaction_id`, `account_id`, `timestamp`, `location`, `device_type`, `ip_address`, `transaction_amount`, `transaction_type`, `successful_login`, `unusual_activity`).

## How to Use

1. **Ensure Prerequisites:** Verify that all prerequisites listed above are met. Pay close attention to the dataset name (`demo_dataset`) and the connection name (`datasherlock.us-central1.demo-spanner-conn`) used in the script; adjust them if your environment uses different names.
2. **Execute the Script:** Run the SQL commands sequentially in the BigQuery console, using the `bq` command-line tool, or through a BigQuery client library.

## Expected Outputs

* A BigQuery table named `demo_dataset.account_activity_cleansed` containing the prepared data.
* Query results showing the distribution of data into training, evaluation, and prediction sets.
* A BigQuery ML model named `demo_dataset.fraud_model`.
* Query results from `ML.EVALUATE` showing model performance metrics (e.g., precision, recall, accuracy, f1-score, roc_auc).
* Query results from `ML.EXPLAIN_PREDICT` showing predictions, actual labels, and feature attributions for misclassified instances in the prediction set.

#### Disclaimer: This README was generated by Gemini Code Assist

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datasherlock/bqml-demo

Awesome Lists containing this project

README