{"id":24319691,"url":"https://github.com/dforsber/glue-table-cache","last_synced_at":"2025-09-27T04:31:27.243Z","repository":{"id":272731061,"uuid":"917518515","full_name":"dforsber/glue-table-cache","owner":"dforsber","description":"Query AWS Glue Tables efficiently with DuckDB","archived":false,"fork":false,"pushed_at":"2025-01-16T09:48:07.000Z","size":119,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-16T10:10:15.958Z","etag":null,"topics":["ast","aws-glue","duckdb","sql"],"latest_commit_sha":null,"homepage":"https://www.boilinginsights.com/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dforsber.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-16T06:20:51.000Z","updated_at":"2025-01-16T09:48:08.000Z","dependencies_parsed_at":"2025-01-16T10:12:36.955Z","dependency_job_id":"bd3933ee-cfcb-487f-95d6-f4a6ac1d63df","html_url":"https://github.com/dforsber/glue-table-cache","commit_stats":null,"previous_names":["dforsber/glue-table-cache"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dforsber%2Fglue-table-cache","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dforsber%2Fglue-table-cache/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dforsber%2Fglue-table-cache/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dforsber%2Fglue-table-cache/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dforsber","download_url":"https://codeload.github.com/dforsber/glue-table-cache/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234375149,"owners_count":18822153,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ast","aws-glue","duckdb","sql"],"created_at":"2025-01-17T15:36:23.087Z","updated_at":"2025-09-27T04:31:27.238Z","avatar_url":"https://github.com/dforsber.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Query AWS Glue Tables efficiently with DuckDB\n\nYou can use this module to efficiently query AWS Glue Tables from DuckDB while caching all Glue metadata (tables, partitions) and S3 listings.\n\nBoth hive partitioning and partition projection based Glue Tables are supported. This module converts SQL query to use e.g. `parquet_scan()` function with explicit partition pruned S3 listings without glob patterns so that DuckDB does not need to start listing the files (objects) on S3, which can be very slow.\n\nDuckDB SQL query AST manipulation is used instead of SQL string matching. Both standard Hive-style partitioned tables and AWS Glue partition projection patterns are supported, (except injected projection for now). Glue Tables are assumed to be Parquet based, but we will support also JSON and CSV based Glue Tables.\n\n\u003e NOTE: This module uses DuckDB itself to do partition pruning by filtering S3 listings stored on DuckDB in-memory Table.\n\n```sql\n-- Original unsupport DuckDB SQL query\nSELECT * FROM glue.db.tbl;\n-- Converts Glue Table to direct S3 read with partition pruned\n--   S3 file listing stored on DuckDB variable\nSELECT * FROM parquet_scan(getvariable('glue_db_tbl_files'));\n```\n\n## Features\n\n- 🚀 Convert SQL query reading Glue Table to direct S3 read query with partition pruning\n  - [x] Parquet Glue Tables\n  - [ ] JSON/CSV Glue Tables\n- 🔍 SQL-based partition filtering using DuckDB\n  - 📊 Support for Hive-style partitioned tables\n  - 🎯 Support for AWS Glue partition projection patterns tables:\n    - [x] Date-based projections\n    - [x] Integer range projections\n    - [x] Enum value projections\n    - [ ] Injected projection from the query\n- 🚀 LRU (Least Recently Used) caching mechanism for Glue metadata and S3 listings\n  - ⏰ Configurable TTL for cache entries\n  - 🔄 Automatic cache invalidation and refresh\n- [x] Allow setting local HTTP proxy block cache for accessing S3 files, so that the s3 URLs are converted to e.g. `http://localhost:3203/BUCKET/PREFIX`\n- 🔒 Type-safe TypeScript implementation\n- NOTE: DuckDB `json_serialize_sql()` does not support e.g. COPY statements\n\n## Installation\n\n```bash\nyarn add glue-table-cache\n```\n\n## Usage\n\n### Converting Glue Table SQL Queries\n\n```typescript\nimport { GlueTableCache } from \"glue-table-cache\";\nimport { DuckDBInstance } from \"@duckdb/node-api\";\n\n// Example: Convert a complex Glue Table query into DuckDB SQL statements\nconst query = `\n  WITH monthly_stats AS (\n    SELECT year, month, \n           COUNT(*) as events,\n           SUM(amount) as total_amount\n    FROM glue.mydatabase.mytable\n    WHERE year = '2024' \n      AND month IN ('01', '02', '03')\n    GROUP BY year, month\n  )\n  SELECT year, \n         SUM(events) as total_events,\n         AVG(total_amount) as avg_amount\n  FROM monthly_stats\n  GROUP BY year\n  ORDER BY year DESC\n`;\n\n// Get the complete SQL setup statements\nconst cache = new GlueTableCache({\n  region: \"eu-west-1\", // AWS region\n  maxEntries: 100, // Maximum number of tables / listings per cache\n  glueTableMetadataTtlMs: 3600000, // Cache TTL: 1 hour\n  s3ListingRefreshMs: 3600000, // S3 listing cache TTL: 1 hour\n  proxyAddress: \"http://localhost:3203/\", // Optional: Use S3 HTTP proxy cache: s3://... =\u003e http://localhost:3203/...\n});\n\n// The query above gets converted to use parquet_scan, for each Glue Table reference.\n// The returned transformed query includes all SQL statements for creating S3 listing\n// table and partition pruned SQL VARIABLE that is then used in the parquet scan.\nconst convertedQuery = await cache.convertGlueTableQuery(query);\nconst results = await db.runAndReadAll(convertedQuery);\n\n// The query above gets converted to use parquet_scan:\nconst convertedQuery = await cache.convertGlueTableQuery(query);\nconsole.log(convertedQuery);\n/* Output:\n\n  -- The S3 listing is cached\n  CREATE OR REPLACE TABLE \"mydatabase.mytable_s3_files\" AS \n    SELECT path FROM (VALUES ('s3://...'),('s3://...'),..,(s3://...)) t(path);\n\n  CREATE OR REPLACE TABLE \"mydatabase.mytable_s3_listing\" AS \n    SELECT path, regexp_extract(path, 'year=([^/]+)', 1) as year \n    FROM \"mydatabase.mytable_s3_files\";\n\n  CREATE INDEX IF NOT EXISTS idx_year ON \"mydatabase.mytable_s3_listing\" (year);\n\n  -- This is always query specific because we want to partition prune the files\n  SET VARIABLE mydatabase_mytable_files = (\n    SELECT list(path) FROM \"mydatabase.mytable_s3_listing\" \n    WHERE year \u003e= '2023' AND month IN ('01', '02', '03')\n  );\n\n  -- This is not query specific\n  SET VARIABLE mydatabase_mytable_gview_files = (\n    SELECT list(path) FROM \"mydatabase.mytable_s3_listing\"\n  );\n\n  -- There is a view as well, if you happen to check SHOW TABLES, \n  --  but it is query specific!\n  CREATE OR REPLACE VIEW GLUE__mydatabase_mytable AS \n    SELECT * FROM parquet_scan(getvariable('default_mytable_gview_files'));\n\n  WITH monthly_stats AS (\n    SELECT year, month,\n           COUNT(*) as flights,\n           AVG(delay) as avg_delay\n    FROM parquet_scan(getvariable('mydatabase_mytable_files'))\n    WHERE year \u003e= '2023'\n      AND month IN ('01', '02', '03')\n    GROUP BY year, month\n  )\n  SELECT year,\n         SUM(flights) as total_flights,\n         AVG(avg_delay) as yearly_avg_delay\n  FROM monthly_stats\n  GROUP BY year\n  ORDER BY year DESC;\n*/\n\nconst db = await (await DuckDBInstance.create(\":memory:\")).connect();\nconst results = await db.runAndReadAll(convertedQuery);\n```\n\n### Cache Management\n\n```typescript\ncache.clearCache(); // Clear entire cache\ncache.invalidateTable(\"mydatabase\", \"mytable\"); // Invalidate specific table\nawait cache.close(); // Clean up resources\n```\n\n## API Reference\n\n### Constructor\n\n```typescript\nconstructor(region: string, config?: CacheConfig)\n```\n\n- `region`: AWS region for Glue and S3 clients\n- `config`: Optional configuration object\n  - `ttlMs`: Metadata cache TTL in milliseconds (default: 1 hour)\n  - `maxEntries`: Maximum cache entries (default: 100)\n  - `s3ListingRefreshMs`: S3 listing cache TTL in milliseconds (default: 5 minutes)\n  - `proxyAddress`: (optional) Converts s3://BUCK/PREF -\u003e \u003cproxyAddress\u003eBUCK/PREF\n\n### Key Methods\n\n#### Map SQL\n\n```typescript\nconvertGlueTableQuery(query: string): Promise\u003cstring\u003e\n```\n\n- Converts Glue table references to DuckDB parquet_scan operations\n\n```typescript\ngetGlueTableViewSetupSql(query: string): Promise\u003cstring[]\u003e\n```\n\n- Generates complete SQL setup for creating a DuckDB view over a Glue table\n- Returns array of SQL statements that:\n  1. Create table for S3 file paths\n  2. Create table for partition listings with extractors\n  3. Create indexes on partition columns\n  4. Set variable with file list\n  5. Create the final view\n\n#### Metadata Operations\n\n```typescript\ngetTableMetadata(database: string, tableName: string): Promise\u003cCachedTableMetadata\u003e\n```\n\n- Retrieves Glue Table metadata with caching\n\n```typescript\nclearCache(): void\n```\n\n- Clears all cached metadata\n\n```typescript\ninvalidateTable(database: string, tableName: string): void\n```\n\n- Invalidates cache for specific table\n\n## Performance Features\n\n- ⚡️ LRU caching with configurable TTL reduces AWS API calls\n- 📈 Partition value extraction from S3 paths\n- 📊 in-memory DuckDB for efficient SQL operations\n- 🔍 Automatic index creation for partition columns\n- 🔄 Automatic cache invalidation on errors\n- 🚀 No slow regexp matching for SQL conversions but converting DuckDB AST\n- 🚀 Uses new DuckDB NodeJS (Neo) API module\n\n## Requirements\n\n- Node.js \u003e= 16.0.0\n- AWS credentials with Glue and S3 permissions\n- DuckDB-compatible system architecture\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdforsber%2Fglue-table-cache","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdforsber%2Fglue-table-cache","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdforsber%2Fglue-table-cache/lists"}