{"id":22508961,"url":"https://github.com/viant/bqwt","last_synced_at":"2025-08-03T13:31:11.335Z","repository":{"id":57608615,"uuid":"162479458","full_name":"viant/bqwt","owner":"viant","description":"BigQuery Windowed Tables","archived":false,"fork":false,"pushed_at":"2024-06-24T21:58:32.000Z","size":114,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-06-25T21:03:59.710Z","etag":null,"topics":["bigquery","etl"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/viant.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-19T19:04:32.000Z","updated_at":"2024-06-24T21:58:35.000Z","dependencies_parsed_at":"2024-06-24T20:43:47.959Z","dependency_job_id":"e7ce17c1-bf4c-49e5-906b-a3c3e1aaa007","html_url":"https://github.com/viant/bqwt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viant%2Fbqwt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viant%2Fbqwt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viant%2Fbqwt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/viant%2Fbqwt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/viant","download_url":"https://codeload.github.com/viant/bqwt/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228547517,"owners_count":17935093,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","etl"],"created_at":"2024-12-07T01:26:23.280Z","updated_at":"2024-12-07T01:26:23.784Z","avatar_url":"https://github.com/viant.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BigQuery Windowed Tables (bqwt)\n\nThis library is compatible with Go 1.11+\n\nPlease refer to [`CHANGELOG.md`](CHANGELOG.md) if you encounter breaking changes.\n\n- [Motivation](#Motivation)\n\n\n## Motivation\n\nAbility to process incrementally incoming data in a way that is both duplication free and cost-effective is of paramount importance, \nespecially when data is loaded or streamed to BigQuery in real time.\nWhen dealing with many tables at once managing processing state can add yet additional aspect that needs to be taken care.\nThis library was developed to simplify multi tables time windowing processing.\nIt can be deployed as stand alone service or as cloud function.\n\n## Introduction\n\nBig Query provides a mechanism allowing windowing data ingested within the last 7 days with [range decorators](https://cloud.google.com/bigquery/table-decorators).\n\nSyntax:\n\n```sql\nSELECT * PROJECT_ID:DATASET.TABLE@\u003ctimeFrom\u003e-\u003ctimeTo\u003e\n```\n\n\nReferences table data added between \u003ctimeFrom\u003e and \u003ctimeTo\u003e, in milliseconds since the epoch.\n- \u003ctimeFrom\u003e and \u003ctimeTo\u003e must be within the last 7 days.\n\n\nOne important factor driving Big Query table layout design that needs to be taken into account is that the range decorators are only supported with Legacy SQL, \nmeaning that standardSQL supported partition and clustered tables can not be windowed with this method currently.\n\nIn the absence of partition and clustering the following table design layout should provide good flexibility:\n\n- DATASET.TABLE_[DATE_SUFFIX]\n- DATASET.TABLE_[PARTITION_SHARD]_[DATE_SUFFIX]\n\nIn both of the scenarios it is possible to use [table template](https://cloud.google.com/bigquery/streaming-data-into-bigquery) in case when data is streamed to Big Query.\n\n\nThis project uses a meta file to store time windowed table processing state.\n\n@metafile\n\n```json\n{\n  \"URL\": \"gs://mybucket/xmeta\",\n  \"DatasetID\": \"my-project:mydataset\",\n  \"Tables\": [\n    {\n      \"ID\": \"mydataset.my_table_10_20181227\",\n      \"ProjectID\": \"my-project\",\n      \"Name\": \"my_table_10_20181227\",\n      \"Dataset\": \"mydataset\",\n      \"Window\": {\n        \"From\": \"2018-12-27T16:00:37.802Z\",\n        \"To\": \"2018-12-27T17:00:15.832Z\"\n      },\n      \"LastChangedFlag\": \"2018-12-27T17:00:57.238680333Z\",\n      \"Changed\": true,\n      \"Expression\": \"[mydataset.my_table_10_20181227@1545926437802-1545930015832]\",\n      \"AbsoluteExpression\": \"[my-project:mydataset.my_table_10_20181227@1545926437802-1545930015832]\"\n    },\n      {\n          \"ID\": \"mydataset.my_table_10_20181226\",\n          \"ProjectID\": \"my-project\",\n          \"Name\": \"my_table_10_20181226\",\n          \"Dataset\": \"mydataset\",\n          \"Window\": {\n            \"From\": \"2018-12-26T16:00:37.802Z\",\n            \"To\": \"2018-12-26T17:00:15.832Z\"\n          },\n          \"LastChangedFlag\": \"2018-12-26T17:00:57.238680333Z\",\n          \"Changed\": false\n        }\n  ],\n  \"Expression\": \"[mydataset.my_table_10_20181227@1545926437802-1545930015832]\",\n  \"AbsoluteExpression\": \"[my-project:mydataset.my_table_10_20181227@1545926437802-1545930015832]\"\n}\n```\n\n\n## Model\n\n\n- [WindowTable](table.go) \n```go\ntype WindowedTable struct {\n\tID                 string\n\tProjectID          string\n\tName               string\n\tDataset            string\n\tWindow             *TimeWindow `description:\"recent change range: from, to timestamp\"`\n\tLastChanged        time.Time \n\tChanged            bool\n\tExpression         string `description:\"represents table ranged decorator expression\"`\n\tAbsoluteExpression string `description:\"represents absolute table path ranged decorator expression\"`\n}\n```\n\n- [Meta](meta.go)\n```go\ntype Meta struct {\n\tURL                 string\n\tDatasetID           string\n\tTables              []*WindowedTable \n\tExpression          string `description:\"represents recently changed tables ranged decorator relative expression (without project id)\"`\n\tAbsoluteExpression  string `description:\"represents recently changed tables ranged decorator absolute expression (with project id)\"`\n\n}\n```\n\n## Service Contract\n\nService accepts both POST and GET http method \n \n- POST method [request](contract.go)\n```go\ntype Request struct {\n\tMode                string   `description:\"operation mode: r - take snapshot, w - persist snapshot\"`\n\tMetaURL             string   `description:\"meta-file location, if relative path is used it adds gs:// protocol\"`\n\tLocation            string   `description:\"dataset location\"`\n\tDatasetID           string   `description:\"source dataset\"`\n\tMatchingTables      []string `description:\"matching table contain expression\"`\n\tPruneThresholdInSec int      `description:\"max allowed duration in sec for unchanged windowed tables before removing\"`\n\tLoopbackWindowInSec int      `description:\"dataset max loopback window for checking changed tables in supplied dataset\"`\n\tExpression          bool     `description:\"if expression flag is set it returns only relative expression (without poejct id)\"`\n\tAbsoluteExpression  bool     `description:\"if expression flag is set it returns only abslute  expression (with poejct id)\"`\n\tMethod              string   `description:\"data insert method: stream or load by default\"`\n    StorageRegion       string   `description:\"storageRegion for standard sql\"`\n}\n```\n\n\n##### GET method query string parameters request mapping\n \n - mode: Mode\n - meta: MetaURL\n - dataset: DatasetID\n - match: MatchingTables\n - location: Location\n - prune: PruneThresholdInSec (min 7 days)\n - loopback: LoopbackWindowInSec\n - expr: Expression\n - absExpr: AbsoluteExpression\n - method: Method\n - storage: StorageRegion, with or without project id\n\ni.e: http://endpoint/WindowedTable?mode=r\u0026meta=mybucket/xmeta\u0026dataset=db1\u0026expr=true\n\nNote that changing table eviction time triggers table modification, thus prune threshold can not be less then 7 days. \n\n\n\n## Window table snapshot\n\nMode request attribute controls table time window snapshot, where r: take a snapshot, w: persist snapshot.\n\n**Taking snapshot**\n     \n  - when metafile does not exist the service reads all matching table info and create temp metafile with range decorator expression\n  - when temp meta file exists the service returns range decorator expression from that file\n  - when meta file exists the services compute changes between metafile and recently updated table, it stores updated table info and range decorator expression in a temp metafile\n\n**Persisting snapshot** \n\n - temp meta file is persisted to meta file.\n\n**Multi Read One Write scenario**\n\nThe following shows example dataset windowing timeline:\n\n1) t0: data is streamed to Big Query\n2) t1: Process X reads dataset snapshot between t0 and t1 \n    -  WindowedTable?mode=r\u0026meta=bucket/x/meta.json\u0026dataset=project:dataset\u0026expr=true'\n3) t2: more data is streamed\n4) t3: Process X completed t0 to t1 processing, flags t0-t1 completed \n    -   WindowedTable?mode=w\u0026meta=bucket/x/meta.json\u0026dataset=project:dataset\u0026expr=true'\n5) t4: more data is streamed\n6) t5: Process X reads dataset snapshot between t2 and t4 \n    -   WindowedTable?mode=r\u0026meta=bucket/x/meta.json\u0026dataset=project:dataset\u0026expr=true'\n7) t6: more data is streamed\n8) t7: Process X tries to process data but something goes wrong, thus no update\n9) t8: more data is streamed\n10) t9: Process X again reads dataset snapshot between t2 and t4 \n    -   WindowedTable?mode=r\u0026meta=bucket/x/meta.json\u0026dataset=project:dataset\u0026expr=true'\n11) t10: more data is streamed\n12) t11: Process X completed t2 to t4 processing, flags t2-t4 completed\n    -   WindowedTable?mode=w\u0026meta=bucket/x/meta.json\u0026dataset=project:dataset\u0026expr=true'\n\n\n\n## Usage\n\n##### Stand alone app\n\n\n```go\n\n\tsnapshot1, err := getHttpBody(\"http://myEndpoint/WindowedTable?mode=r\u0026meta=myBucket/meta\u0026dataset=myDataset\")\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\tif hasData :=  len(snapshoot1) \u003e 0;hasData {\n\t\tSQL := \"SELECT * FROM \" + string(snapshot1)\n\t\tfmt.Printf(\"%v\\n\", SQL)\n\n\t\t//Process query ....\n\n\t\t//Persist snapshot only if there were no processing error\n\t\t_, err = getHttpBody(\"http://myEndpoint/WindowedTable?mode=w\u0026meta=myBucket/meta\u0026dataset=myDataset\")\n\t\tif err != nil {\n\t\t\tlog.Fatal(err)\n\t\t}\n\t} \n```      \n\n\n### Apache beam \n\n**SQL Provider Class**\n\n```java\nimport com.google.common.base.Strings;\nimport org.apache.beam.sdk.options.ValueProvider;\n\nimport java.io.Serializable;\n\npublic class SQLProvider implements ValueProvider\u003cString\u003e, Serializable {\n\n    private final String baseSQL;\n    private final String windowedTableURL;\n    private final String emptyDatasetSQL;\n\n    public SQLProvider(String baseSQL, String windowedTableURL, String emptyDatasetSQL) {\n        this.baseSQL = baseSQL;\n        this.emptyDatasetSQL = emptyDatasetSQL;\n        this.windowedTableURL = windowedTableURL;\n    }\n\n\n    @Override\n    public String get() {\n        String from = Helper.getHttpBody(windowedTableURL);\n        if(Strings.isNullOrEmpty(from)) {\n            from = emptyDatasetSQL;\n        }\n        return baseSQL.replace(\"$SOURCE\", from);\n    }\n\n    @Override\n    public boolean isAccessible() {\n        return true;\n    }\n}\n```\n\n**Pipeline integration**\n\n```java\npublic class Main {\n    \n        public static final String EMPTY_QUERY = \"SELECT * FROM (SELECT INTEGER(NULL) AS field1,  STRING(NULL) AS fieldN) WHERE 1 = 0\";\n        public static final String WINDOWED_TABLE_URL = \"http://myEndpoint/WindowedTable?mode=r\u0026meta=myBucket/meta\u0026dataset=myDataset\";\n        public static final String SQL = \"SELECT * FROM $SOURCE\";\n         public static final Strin TABLE = \"myTable\";\n        \n       public static void main(String[] args)  {\n            \n           ValueProvider\u003cString\u003e sqlProvider = new SQLProvider(SQL, WINDOWED_TABLE_URL, EMPTY_QUERY);\n           Pipeline pipeline = Pipeline.create(options);\n           PCollection\u003cTableRow\u003e collection = pipeline.apply(\"read data\", BigQueryIO.readTableRows().fromQuery(sqlProvider).withTemplateCompatibility().withoutValidation());\n           \n           //add more processing collection transforms here ....\n           \n            WriteResult eventOutput = collection.apply(\"write data\",\n                           BigQueryIO.writeTableRows()\n                                   .to(schema.getTempTable(false))\n                                   .withSchema(Helper.getTableSchema())\n                                   .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)\n                                   .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));\n\n\n            //Persist window table snapshot after writing to a table\n            collection.apply(Wait.on(eventOutput.getFailedInserts()))\n                                   // Transforms each row inserted to an Integer of value 1\n                                   .apply(ParDo.of(countRows()))\n                                   .apply(Sum.integersGlobally())\n                                   .apply(ParDo.of(new SnapshpotUpdater()));\n       }\n\n       \n       \n       public static class  SnapshpotUpdater extends DoFn\u003cInteger, Void\u003e implements Serializable{\n                private final String notificationURL = \"http://myEndpoint/WindowedTable?mode=w\u0026meta=myBucket/meta\u0026dataset=myDataset\";\n    \n               @ProcessElement\n               public void processElement(ProcessContext c) {\n                   Helper.getHttpBody(notificationURL);\n               }\n      }\n       \n}    \n```\n\n\n## Deployment\n\n#### Stand alone service\n\n```bash\ngit clone https://github.com/viant/bqwt.git\ncd bqwt/server\ngo build  bqwt.go\n./bqwt -port 8080\n```\n\n#### Docker service\n\n\n```bash\ngit clone https://github.com/viant/bqwt.git\ncd bqwt\ndocker build --no-cache -t viant/bqwt:1.0 .\ncd docker/\ndocker-compose up -d\n```\n\n#### Google cloud function deployment\n\n\n- gcloud auth login\n- gcloud config set project PROJECT_ID\n- gcloud functions deploy WindowedTable --entry-point Handle --runtime go111 --trigger-http\n\n\n## Known limitation\n\n- **Non partitioning/clustering**\nWindowing table with range decorator is only supported with legacy SQL, thus only non-partition, non-clustered tables run in legacy mode at the moment.\n\n- **Substantial data delay with streaming insert method**\nIn case of using streaming insert method,  data first arrive to streaming buffer, which  retains recently inserted rows. \nWhile query engine has ability to read records directly from the streaming buffer, these records are not considered for copy, extract job or range decorators.\nWith this in mind this API uses StreamingBuffer.OldestEntryTime - 1 as table time window upper bound.\n\nPractically it  may take a while (upto ~ 90 minutes) before data is finally extracted from streaming buffer to a table.\nFind out more about [streaming lifecycle](https://cloud.google.com/blog/products/gcp/life-of-a-bigquery-streaming-insert)  \n\n\n## Running e2e tests\n\nCreate a 'test' Big Query project and service account with admin permission.\nEnable ssh on test host and create [localhost secret](https://github.com/viant/endly/tree/master/doc/secrets#ssh-credentials)\n \nCreate a test [BQ service account secrets](https://github.com/viant/endly/tree/master/doc/secrets#google-cloud-credentials), save it as ~/.secret/viant-e2e.json\n\nInstall [e2e test runner](https://github.com/viant/endly/releases)\n\n\n```bash\ngit clone https://github.com/viant/bqwt.git\ncd bqwt/e2e\nendly \n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviant%2Fbqwt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fviant%2Fbqwt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fviant%2Fbqwt/lists"}