{"id":16895934,"url":"https://github.com/stevenacoffman/psqltobq","last_synced_at":"2026-04-15T06:31:27.440Z","repository":{"id":252294426,"uuid":"627104742","full_name":"StevenACoffman/psqltobq","owner":"StevenACoffman","description":"Incremental export from PostgreSQL to BigQuery","archived":false,"fork":false,"pushed_at":"2024-08-08T19:16:23.000Z","size":45,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-25T21:00:23.553Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StevenACoffman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-12T19:45:26.000Z","updated_at":"2024-08-08T19:16:27.000Z","dependencies_parsed_at":"2024-08-08T22:07:43.516Z","dependency_job_id":null,"html_url":"https://github.com/StevenACoffman/psqltobq","commit_stats":null,"previous_names":["stevenacoffman/psqltobq"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/StevenACoffman/psqltobq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StevenACoffman%2Fpsqltobq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StevenACoffman%2Fpsqltobq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StevenACoffman%2Fpsqltobq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StevenACoffman%2Fpsqltobq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StevenACoffman","download_url":"https://codeload.github.com/StevenACoffman/psqltobq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StevenACoffman%2Fpsqltobq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31829746,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"online","status_checked_at":"2026-04-15T02:00:06.175Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T17:27:06.585Z","updated_at":"2026-04-15T06:31:27.413Z","avatar_url":"https://github.com/StevenACoffman.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# psqltobq - incrementally export AlloyDB (PostgreSQL) tables to Big Query\n\n### Wait, what?\n\nThis is CronJob to incrementally export AlloyDB tables to Big Query\n\n### Job Requirements/algorithm:\n1. For each table get the max(last_updated) from the destination big query table\n  ```\n  agg_teacher_time_daily\n  district_class_course\n  student_skill_levels\n  window_roster\n  course_skill\n  agg_student_sat_time_daily\n  ```\n2. Select all rows from the above tables where `last_updated` is \u003e `max(last_updated)` from Step 1.  Be sure to convert:\n    + all UUID fields to text\n    + DateTime's that are \u003c 0001-01-01 to be 0001-01-01\n    + DateTime's  greater than 9999-12-31 (i.e. window_roster uses ‘infinity’ a special value) to 9999-12-31\n4. Insert all these rows into an empty temporary table (truncate the table before starting)\n5. Finally, merge the temporary table with the source table i.e.: `join_cols` is the primary key rows\n```\n        MERGE INTO `{target_table}` T\n        USING `{source_table}` S\n        ON ({join_cols})\n        WHEN MATCHED THEN\n            UPDATE SET {update_cols}\n        WHEN NOT MATCHED THEN\n            INSERT ROW\n```\n\n### Implementation plan\n1. Write the above job\n2.  Create a duplicate copy of the tables `student_skill_levels`, `agg_teacher_time_daily`, and `window_roster` (initially empty but with the same column names/types orders, etc)\n3.  Run the job to those test tables 2 days in a row of weekday data ie not Sat and Sunday testing:\n    + The job will run in a reasonable time. (the original sql_export_incremental takes 47min daily)\n      **Goal**: that this job will not take more than 2hrs ideally\n    + The job will not have any memory issues or other unexpected failures\n    + The job will correctly merge the `temp_` table into the destination BigQuery export table\n\n### *Notes:*\n1. I modified the original `MERGE INTO` SQL such that instead of `INSERT ROW` I used `INSERT (...) VALUES (...)` to account for column order differences between the CSV export, the PostgreSQL/AlloyDB table, and the BigQuery table. Otherwise, it was prone to causing failures (or worse, _succeeding_ at putting the wrong data in the wrong field). For example:\n```\nMERGE INTO `khanacademy.org:deductive-jet-827`.reports_postgres_exported_temp.agg_student_sat_time_daily T USING `khanacademy.org:deductive-jet-827`.reports_postgres_exported_temp.temp_agg_student_sat_time_daily_20230412_1306 S ON (\n  T.as_of_date = S.as_of_date \n  AND T.student_kaid = S.student_kaid \n  AND T.coach_kaid = S.coach_kaid \n  AND T.class_id = S.class_id\n) WHEN MATCHED \nAND S.last_updated \u003e= '2023-04-12 13:06:02' THEN \nUPDATE \nSET \n  T.all_ms = S.all_ms, \n  T.district_id = S.district_id, \n  T.school_id = S.school_id, \n  T.last_updated = S.last_updated, \n  T.math_ms = S.math_ms, \n  T.rw_ms = S.rw_ms, \n  T.grade = S.grade WHEN NOT MATCHED THEN INSERT (\n    all_ms, district_id, school_id, as_of_date, \n    coach_kaid, class_id, student_kaid, \n    last_updated, math_ms, rw_ms, grade\n  ) \nVALUES \n  (\n    S.all_ms, S.district_id, S.school_id, \n    S.as_of_date, S.coach_kaid, S.class_id, \n    S.student_kaid, S.last_updated, \n    S.math_ms, S.rw_ms, S.grade\n  );\n```\n2. In testing from my local laptop, this new job takes 32 minutes or less to complete. I expect this to take less time when run in GCP.\n3. To export the PostgreSQL / AlloyDB tables I am using `COPY TO` as it is the most efficient method. For example:\n```\nCOPY (\n  select * from \n    agg_student_sat_time_daily \n  WHERE \n    last_updated \u003e= '2023-04-12 13:06:02'\n) TO STDOUT DELIMITER '^' CSV HEADER\n```\n4. To protect against minor schema evolution disparities between BigQuery and PostgreSQL / AlloyDB, if a column data type difference exists, a `CAST AS` is attempted. This works for similar data types (e.g. timestamp vs datetime).  If there are the wrong number of columns or major data type differences, the whole job will fail without writing anything to the table.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstevenacoffman%2Fpsqltobq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstevenacoffman%2Fpsqltobq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstevenacoffman%2Fpsqltobq/lists"}