{"id":19944373,"url":"https://github.com/dataform-co/dataform-bqml","last_synced_at":"2025-07-27T20:34:15.099Z","repository":{"id":230460113,"uuid":"779436952","full_name":"dataform-co/dataform-bqml","owner":"dataform-co","description":null,"archived":false,"fork":false,"pushed_at":"2024-09-16T17:56:04.000Z","size":25,"stargazers_count":6,"open_issues_count":3,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-06T00:36:15.563Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataform-co.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-29T20:51:07.000Z","updated_at":"2025-06-13T17:14:38.000Z","dependencies_parsed_at":"2024-09-10T00:33:42.664Z","dependency_job_id":"ba52ca5b-6cf6-4c60-8b87-3151f2a36892","html_url":"https://github.com/dataform-co/dataform-bqml","commit_stats":null,"previous_names":["chtyim/dataform-bqml","dataform-co/dataform-bqml"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/dataform-co/dataform-bqml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataform-co%2Fdataform-bqml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataform-co%2Fdataform-bqml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataform-co%2Fdataform-bqml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataform-co%2Fdataform-bqml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataform-co","download_url":"https://codeload.github.com/dataform-co/dataform-bqml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataform-co%2Fdataform-bqml/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267419073,"owners_count":24084098,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T00:20:23.482Z","updated_at":"2025-07-27T20:34:15.046Z","avatar_url":"https://github.com/dataform-co.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BigQuery Remote Inference Pipeline\nBigQuery supports remote models, such as Vertex AI LLMs, to perform remote inference operations \non both structured and unstructured data. When using remote inference, the user needs to pay\nattention to [quotas and limits](https://cloud.google.com/bigquery/quotas#cloud_ai_service_functions), \nwhich can result in retryable error in a subset of rows and require reprocessing.\n\nThe BQML dataform library assists users to create BQML pipelines that are resilient to \ntransient failures by automatic reprocessing and incrementally updating the output table.\n\n## Quick Start Guide\n### Installation\nAdd the bqml package to your package.json file in your Dataform project.\n\n```javascript\n{\n  \"dependencies\": {\n    \"bqml\": \"https://github.com/dataform-co/dataform-bqml/archive/[RELEASE_VERSION].tar.gz\"\n  }\n}\n```\nYou can find the most up to date package version on the [releases](https://github.com/dataform-co/dataform-bqml/releases) page.\n\n### Usage\nThe following example shows how to generate text from images using the\nVertex AI multimodel.\n\n```javascript\n// Import the module\nconst bqml = require(\"bqml\");\n\n// Name of the multimodel that has `gemini-pro-vision` as the endpoint\nlet model = \"multi-llm\";\n// Name of the object table that points to a set of images\nlet source_table = \"product_image\";\n// Name of the table for storing the result\nlet output_table = \"product_image_description\";\n\n// Optionally declare the model and source table as dataform datasources \n// if it is not defined in other actions.\ndeclare({name: model});\ndeclare({name: source_table});\n\n// Execute the pipeline\nbqml.vision_generate_text(\n    source_table, output_table, model, \n    \"Describe the image in 20 words\", {\n        flatten_json_output: true\n    }\n);\n```\nThis is another example showing how to generate text using a source query\nto form the prompt.\n\n```javascript\n// Import the module\nconst bqml = require(\"bqml\");\n\n// Name of the model\nlet model = \"llm\";\n// Name of the table that forms the source query\nlet source_table = \"product_comment\";\n// Name of the table for storing the result\nlet output_table = \"sentiment\";\n// Columns to uniquely identify a row in the source table\nlet keys = [\"comment_id\"];\n\n// Optionally declare the model and source table as dataform datasources \n// if it is not defined in other actions.\ndeclare({name: model});\ndeclare({name: source_table});\n\n// Execute the pipeline\nbqml.generate_text(\n    output_table,\n    keys,\n    model,\n    (ctx) =\u003e `SELECT *, \"Classify the text into neutral, negative, or positive. Text: \" || comment AS prompt FROM ${ctx.ref(source_table)}`,\n    {flatten_json_output: true});\n```\n\n## Function Reference\n### Function generate_text\n#### Signature\n```javascript\nfunction generate_text(\n    output_table, unique_keys,\n    ml_model, source_query, ml_configs, options)\n```\n#### Description\nPerforms the ML.GENERATE_TEXT function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | the name of the table to store the final result |\n| unique_keys | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003eArray\u003c/code\u003e | column name(s) for identifying an unique row in the source table |\n| ml_model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model to use for the ML operation that uses one of the Vertex AI LLM endpoints |\n| source_query | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003efunction\u003c/code\u003e | either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields |\n| ml_configs | \u003ccode\u003eObject\u003c/code\u003e | configurations for the ML operation |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [table_ml](#table_ml) function |\n\n---\n### Function vision_generate_text\n#### Signature\n```javascript\nfunction vision_generate_text(\n    source_table, output_table, model, prompt, llm_config, options)\n```\n#### Description\nPerforms the ML.GENERATE_TEXT function on visual content in the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text#gemini-pro-vision](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text#gemini-pro-vision)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| source_table | \u003ccode\u003eResolvable\u003c/code\u003e | represents the source object table |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | name of the output table |\n| model | \u003ccode\u003eResolvable\u003c/code\u003e | name the remote model with the `gemini-pro-vision` endpoint |\n| prompt | \u003ccode\u003eString\u003c/code\u003e | the prompt text for the LLM |\n| llm_config | \u003ccode\u003eObject\u003c/code\u003e | extra configurations to the LLM |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [obj_table_ml](#obj_table_ml) function |\n\n---\n### Function generate_embedding\n#### Signature\n```javascript\nfunction generate_embedding(\n    output_table, unique_keys,\n    ml_model, source_query, ml_configs, options)\n```\n#### Description\nPerforms the ML.GENERATE_EMBEDDING function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | the name of the table to store the final result |\n| unique_keys | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003eArray\u003c/code\u003e | column name(s) for identifying an unique row in the source table |\n| ml_model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model to use for the ML operation that uses one of the `textembedding-gecko*` Vertex AI LLMs as endpoint |\n| source_query | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003efunction\u003c/code\u003e | either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields |\n| ml_configs | \u003ccode\u003eObject\u003c/code\u003e | configurations for the ML operation |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [table_ml](#table_ml) function |\n\n---\n### Function understand_text\n#### Signature\n```javascript\nfunction understand_text(\n    output_table, unique_keys,\n    ml_model, source_query, ml_configs, options)\n```\n#### Description\nPerforms the ML.UNDERSTAND_TEXT function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-understand-text](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-understand-text)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | the name of the table to store the final result |\n| unique_keys | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003eArray\u003c/code\u003e | column name(s) for identifying an unique row in the source table |\n| ml_model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_NATURAL_LANGUAGE_V1 |\n| source_query | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003efunction\u003c/code\u003e | either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields |\n| ml_configs | \u003ccode\u003eObject\u003c/code\u003e | configurations for the ML operation |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [table_ml](#table_ml) function |\n\n---\n### Function translate\n#### Signature\n```javascript\nfunction translate(\n    output_table, unique_keys,\n    ml_model, source_query, ml_configs, options)\n```\n#### Description\nPerforms the ML.TRANSLATE function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-translate](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-translate)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | the name of the table to store the final result |\n| unique_keys | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003eArray\u003c/code\u003e | column name(s) for identifying an unique row in the source table |\n| ml_model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_TRANSLATE_V3 |\n| source_query | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003efunction\u003c/code\u003e | either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields |\n| ml_configs | \u003ccode\u003eObject\u003c/code\u003e | configurations for the ML operation |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [table_ml](#table_ml) function |\n\n---\n### Function annotate_image\n#### Signature\n```javascript\nfunction annotate_image(\n    source_table, output_table, model, features, options)\n```\n#### Description\nPerforms the ML.ANNOTATE_IMAGE function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-annotate-image](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-annotate-image)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| source_table | \u003ccode\u003eResolvable\u003c/code\u003e | represents the source object table |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | name of the output table |\n| model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_VISION_V1 |\n| features | \u003ccode\u003eArray\u003c/code\u003e | specifies one or more feature names of supported Vision API features |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [obj_table_ml](#obj_table_ml) function |\n\n---\n### Function transcribe\n#### Signature\n```javascript\nfunction transcribe(\n    source_table, output_table, model, recognition_config, options)\n```\n#### Description\nPerforms the ML.TRANSCRIBE function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transcribe](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-transcribe)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| source_table | \u003ccode\u003eResolvable\u003c/code\u003e | represents the source object table |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | name of the output table |\n| model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_SPEECH_TO_TEXT_V2 |\n| recognition_config | \u003ccode\u003eObject\u003c/code\u003e | the recognition configuration to override the default configuration of the specified recognizer |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [obj_table_ml](#obj_table_ml) function |\n\n---\n### Function process_document\n#### Signature\n```javascript\nfunction process_document(\n    source_table, output_table, model, options)\n```\n#### Description\nPerforms the ML.PROCESS_DOCUMENT function on the given source table.\n\n**See**: [https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-process-document)\n\n| Param | Type | Description |\n| --- | --- | --- |\n| source_table | \u003ccode\u003eResolvable\u003c/code\u003e | represents the source object table |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | name of the output table |\n| model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model with a REMOTE_SERVICE_TYPE of CLOUD_AI_DOCUMENT_V1 |\n| options | \u003ccode\u003eObject\u003c/code\u003e | the configuration object for the [obj_table_ml](#obj_table_ml) function |\n\n---\n\u003ca name=\"table_ml\"\u003e\u003c/a\u003e\n### Function table_ml\n#### Signature\n```javascript\nfunction table_ml(\n    output_table, unique_keys, ml_function, ml_model,\n    source_query, accept_filter, ml_configs = {}, {\n      batch_size = 10000,\n      batch_duration_secs = 22 * 60 * 60\n} = {})\n```\n#### Description\nA generic structured table ML pipeline.\nIt incrementally performs an ML operation on rows from the source table\nand merges to the output table until all rows are processed or runs longer\nthan the specific duration.\n\n| Param | Type | Description |\n| --- | --- | --- |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | name of the output table |\n| unique_keys | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003eArray\u003c/code\u003e | column name(s) for identifying an unique row in the source table |\n| ml_function | \u003ccode\u003eString\u003c/code\u003e | the name of the BQML function to call |\n| ml_model | \u003ccode\u003eResolvable\u003c/code\u003e | the remote model to use for the ML operation |\n| source_query | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003efunction\u003c/code\u003e | either a query string or a Contextable function to produce the query on the source data for the ML operation and it must have the unique key columns selected in addition to other fields |\n| accept_filter | \u003ccode\u003eString\u003c/code\u003e | a SQL boolean expression for accepting a row to the output table after the ML operation |\n| ml_configs | \u003ccode\u003eObject\u003c/code\u003e | configurations for the ML operation |\n| batch_size | \u003ccode\u003eNumber\u003c/code\u003e | number of rows to process in each SQL job. Rows in the object table will be processed in batches according to the batch size. Default batch size is 10000 |\n| batch_duration_secs | \u003ccode\u003eNumber\u003c/code\u003e | the number of seconds to pass before breaking the batching loop if it hasn't been finished before within this duration. Default value is 22 hours |\n\n---\n\u003ca name=\"obj_table_ml\"\u003e\u003c/a\u003e\n### Function obj_table_ml\n#### Signature\n```javascript\nfunction obj_table_ml(\n    source_table, source, output_table, accept_filter, {\n      batch_size = 500,\n      unique_key = \"uri\",\n      updated_column = \"updated\",\n      batch_duration_secs = 22 * 60 * 60,\n} = {})\n```\n#### Description\nA generic object table ML pipeline.\nIt incrementally performs an ML operation on new rows from the source table\nand merges to the output table until no new row is detected or runs longer\nthan the specific duration.\nA row from the source table is considered as new if the `unique_key` (default to \"uri\")\nof a row is absent in the output table, or if the `updated_column` (default to \"updated\")\ncolumn is newer than the largest value in the output table.\n\n| Param | Type | Description |\n| --- | --- | --- |\n| source_table | \u003ccode\u003eResolvable\u003c/code\u003e | represents the source object table |\n| source | \u003ccode\u003eString\u003c/code\u003e \\| \u003ccode\u003efunction\u003c/code\u003e | either a query string or a Contextable function to produce the query on the source data |\n| output_table | \u003ccode\u003eString\u003c/code\u003e | the name of the table to store the final result |\n| accept_filter | \u003ccode\u003eString\u003c/code\u003e | a SQL expression for finding rows that contains retryable error |\n| batch_size | \u003ccode\u003eNumber\u003c/code\u003e | number of rows to process in each SQL job. Rows in the object table will be processed in batches according to the batch size. Default batch size is 500 |\n| unique_key | \u003ccode\u003eString\u003c/code\u003e | the primary key in the output table for incremental update. Default value is \"uri\". |\n| updated_column | \u003ccode\u003eString\u003c/code\u003e | the column that carries the last updated timestamp of an object in the object table. Default value is \"updated\" |\n| batch_duration_secs | \u003ccode\u003eNumber\u003c/code\u003e | the number of seconds to pass before breaking the batching loop if it hasn't been finished before within this duration. Default value is 22 hours |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataform-co%2Fdataform-bqml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataform-co%2Fdataform-bqml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataform-co%2Fdataform-bqml/lists"}