{"id":27381960,"url":"https://github.com/datasherlock/bqml-demo","last_synced_at":"2026-06-21T18:31:39.325Z","repository":{"id":287135680,"uuid":"963704909","full_name":"datasherlock/bqml-demo","owner":"datasherlock","description":"BigQuery ML demo showcasing data prep, logistic regression training, evaluation, and explainable predictions for account fraud detection on Cloud Spanner data.","archived":false,"fork":false,"pushed_at":"2025-04-11T06:08:48.000Z","size":360,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-13T15:16:52.829Z","etag":null,"topics":["bigdata","bigquery","bigquery-ml","sql"],"latest_commit_sha":null,"homepage":"https://www.jeromerajan.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datasherlock.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-10T04:59:03.000Z","updated_at":"2025-04-11T06:08:52.000Z","dependencies_parsed_at":"2025-04-13T15:16:52.733Z","dependency_job_id":null,"html_url":"https://github.com/datasherlock/bqml-demo","commit_stats":null,"previous_names":["datasherlock/bqml-demo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/datasherlock/bqml-demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasherlock%2Fbqml-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasherlock%2Fbqml-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasherlock%2Fbqml-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasherlock%2Fbqml-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datasherlock","download_url":"https://codeload.github.com/datasherlock/bqml-demo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasherlock%2Fbqml-demo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34622271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","bigquery","bigquery-ml","sql"],"created_at":"2025-04-13T15:16:51.725Z","updated_at":"2026-06-21T18:31:39.306Z","avatar_url":"https://github.com/datasherlock.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n![bqml-demo](https://github.com/user-attachments/assets/7d98da3b-29cc-48fc-90f2-a940cbec25df)\n\n\n# BigQuery ML Account Fraud Detection Demo\n\nThis SQL script demonstrates a basic workflow for building, evaluating, and explaining a machine learning model for account fraud detection using BigQuery ML. The script uses data from an external Cloud Spanner table.\n\n## Overview\n\nThe script performs the following steps:\n\n1.  **Data Cleansing and Preparation:** Creates a new table (`demo_dataset.account_activity_cleansed`) by querying raw account activity data from a Cloud Spanner table (`account_activity_raw`) via an external connection (`demo-spanner-conn`). During this process, it:\n    *   Casts data types for timestamp, transaction amount, successful login, and unusual activity columns.\n    *   Adds a `classifier` column with random values between 0 and 1 to facilitate splitting the data into training, evaluation, and prediction sets.\n2.  **Data Split Verification:** Includes a query to check the distribution of data across the training (80%), evaluation (10%), and prediction (10%) sets based on the `classifier` column.\n3.  **Model Training:** Creates or replaces a BigQuery ML Logistic Regression model (`demo_dataset.fraud_model`) using the cleansed data. Key options used:\n    *   `model_type='LOGISTIC_REG'`: Specifies the model algorithm.\n    *   `auto_class_weights=TRUE`: Helps handle class imbalance in the target variable (`unusual_activity`).\n    *   `data_split_method='NO_SPLIT'`: Informs BQML that the data is already split manually.\n    *   `input_label_cols=['unusual_activity']`: Defines the target variable.\n    *   The training uses only the data where `classifier \u003c 0.8`.\n4.  **Model Evaluation:** Evaluates the trained model's performance using `ML.EVALUATE` on the evaluation dataset (`classifier between 0.8 and 0.9`).\n5.  **Prediction and Explanation:** Uses `ML.EXPLAIN_PREDICT` to predict unusual activity on the prediction dataset (`classifier \u003e 0.9`) and provides feature attributions (explanations) for the predictions. The results are filtered to show only instances where the prediction differs from the actual label (misclassifications).\n6.  **(Commented Out) Example Update:** Includes a commented-out example `UPDATE` statement showing how to modify data in the cleansed table.\n\n## Prerequisites\n\n*   Access to a Google Cloud project with BigQuery and Cloud Spanner APIs enabled.\n*   A BigQuery dataset named `demo_dataset`.\n*   A BigQuery connection named `datasherlock.us-central1.demo-spanner-conn` configured to access a Cloud Spanner instance.\n*   A Cloud Spanner table named `account_activity_raw` within the connected Spanner database, containing the necessary columns (`transaction_id`, `account_id`, `timestamp`, `location`, `device_type`, `ip_address`, `transaction_amount`, `transaction_type`, `successful_login`, `unusual_activity`).\n\n## How to Use\n\n1.  **Ensure Prerequisites:** Verify that all prerequisites listed above are met. Pay close attention to the dataset name (`demo_dataset`) and the connection name (`datasherlock.us-central1.demo-spanner-conn`) used in the script; adjust them if your environment uses different names.\n2.  **Execute the Script:** Run the SQL commands sequentially in the BigQuery console, using the `bq` command-line tool, or through a BigQuery client library.\n\n## Expected Outputs\n\n*   A BigQuery table named `demo_dataset.account_activity_cleansed` containing the prepared data.\n*   Query results showing the distribution of data into training, evaluation, and prediction sets.\n*   A BigQuery ML model named `demo_dataset.fraud_model`.\n*   Query results from `ML.EVALUATE` showing model performance metrics (e.g., precision, recall, accuracy, f1-score, roc_auc).\n*   Query results from `ML.EXPLAIN_PREDICT` showing predictions, actual labels, and feature attributions for misclassified instances in the prediction set.\n\n#### Disclaimer: This README was generated by Gemini Code Assist","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasherlock%2Fbqml-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatasherlock%2Fbqml-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasherlock%2Fbqml-demo/lists"}