{"id":19294565,"url":"https://github.com/web3-storage/migrate-block-index-infra","last_synced_at":"2025-04-22T08:30:33.484Z","repository":{"id":192545156,"uuid":"686959937","full_name":"web3-storage/migrate-block-index-infra","owner":"web3-storage","description":"Infra to migrate legacy `blocks` index DynamoDB Table","archived":true,"fork":false,"pushed_at":"2023-11-03T09:36:58.000Z","size":229,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-24T00:28:04.586Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/web3-storage.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-04T10:06:46.000Z","updated_at":"2024-04-18T16:35:45.000Z","dependencies_parsed_at":null,"dependency_job_id":"ce16bff4-72e6-49a7-a6df-93425f02104f","html_url":"https://github.com/web3-storage/migrate-block-index-infra","commit_stats":null,"previous_names":["web3-storage/migrate-block-index-infra"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/web3-storage%2Fmigrate-block-index-infra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/web3-storage%2Fmigrate-block-index-infra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/web3-storage%2Fmigrate-block-index-infra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/web3-storage%2Fmigrate-block-index-infra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/web3-storage","download_url":"https://codeload.github.com/web3-storage/migrate-block-index-infra/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250205971,"owners_count":21392157,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-09T22:38:43.984Z","updated_at":"2025-04-22T08:30:33.156Z","avatar_url":"https://github.com/web3-storage.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Migrate Block Index\n\nInfra to migrate legacy `blocks` index DynamoDB Table.\n\nThe `Scanner` lambda does a full table scan of the legacy index table, and sends batches of records to an SQS Queue. Ported from https://github.com/alexdebrie/serverless-dynamodb-scanner\n\nThe `Consumer` lambda subscribes to the queue. It transforms the legacy records into the current format, checks if they exist in the destination table, and writes the missing ones in batches to that table.\n\n## Getting started\n\nThe repo contains the infra deployment code for the table migration process.\n\n```\n├── packages       - lambda implementations\n└── stacks         - sst and aws cdk code to deploy the lambdas\n```\n\nTo work on this codebase **you need**:\n\n- Node.js \u003e= v18 (prod env is node v18)\n- Install the deps with `npm i`\n\n## Scanner\n \nLambda to scan an entire DynamoDB table, sending batches of records to an SQS Queue.\n\nInvoke it directly:\n\n```bash\naws lambda invoke --function-name arn-or-name \\\n  --invocation-type Event \\\n  --cli-binary-format raw-in-base64-out \\\n  --payload '{\"TotalSegments\":1,\"Segment\":0}' ./migrate-out.log\n```\n\n- `--function-name` can be the ARN or the function name\n- `--invocation-type Event` causes the function to be invoked async, so we don't wait for it to complete.\n- `--cli-binary-format` is required to make passing in the payload work! see [aws cli issue](https://github.com/awsdocs/aws-lambda-developer-guide/issues/180#issuecomment-1166923381)\n- `--payload` lets you set the table scan partition parameters as a JSON string, see more details below\n- `./no-such.log` the last arg is the `outfile`, which is required, but unused when invoking async!\n\n \nThe lambda stores it's progress in SSM parameter store so it can resume.\n \nIt invokes itself again when it's remaining execution time is less than `MIN_REMAINING_TIME_MS`, as we only get 15mins max lambda execution time.\n\nCreate an SSM parameter with name `/migrate-block-index/${Config.STAGE}/stop` to abort all currently running invocations for that stage e.g `/migrate-block-index/prod/stop` and set the value to any string e.g. STOP. The existence of a value for that key is the signal to stop.\n\n### Scan partition\n\nWith ~858 million records to scan, 1 worker would take ~30days\n\n```\n@ ~1.5s processing time per 500 records.\n(858,000,000 recs / 500 batch size) * 1.5s per batch = 2,574,000s = ~30days.\n```\n\nWe can use [table scan partition](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan) parameters to split the scan into 10 segments to complete in about ~3 days, by invoking the lambda 10 times in parallel.\n\nPass `{ TotalSegments: number, Segment: number}` to control the scan partition parameters.\n- `TotalSegments` is how many partitions to divide the full set into e.g `10`\n- `Segment` is the scan partition index that this worker should operate on, e.g `0` for the first.\n\n## Consumer\n\nQueue consumer lambda to transform items to the current index format, check if they exists in destination table, and write those records that are missing.\n\nEach record is a stringified array of up to 500 BlocksIndex objects from the Scanner.\n \nAny failed writes from the DynamoDB BatchWriteCommand are send to the `unprocessedWritesQueue` for debugging and re-driving.\n\nAny other errors and the message is released back to the queue. If that batch fails 3 times, it is send to the `batchDeadLetterQueue`\n\n### Scaling\n\nLambda SQS consumer scaling is managed by AWS. But we can control how the batching is managed.\n\n- `MaximumBatchingWindowInSeconds` - _default: 500ms_. Time, in seconds, that Lambda spends gathering records before invoking the function. Any value from 0 seconds to 300 seconds in increments of seconds.\n- `BatchSize` - _default: 500ms_ The maximum number of records in each batch that Lambda pulls from your stream or queue and sends to your function. Lambda passes all of the records in the batch to the function in a single call, up to the payload limit for synchronous invocation (6 MB). When you set BatchSize to a value greater than 10, you must set MaximumBatchingWindowInSeconds to at least 1.\n\nsource: https://docs.aws.amazon.com/lambda/latest/dg/API_EventSourceMappingConfiguration.html\n\nSetting batch size to `20` will send 10k records to the per invocation. `(20 * 500 = 10,000)`\n\n\u003e When a Lambda function subscribes to an SQS queue, Lambda polls the queue as it waits for messages to arrive. Lambda consumes messages in batches, starting at five concurrent batches with five functions at a time.\n\u003e\n\u003e If there are more messages in the queue, Lambda adds up to 60 functions per minute, up to 1,000 functions, to consume those messages. This means that Lambda can scale up to 1,000 concurrent Lambda functions processing messages from the SQS queue.\n\u003e\n\u003e This scaling behavior is managed by AWS and cannot be modified.\n\u003e \n\u003e By default, Lambda batches up to 10 messages in a queue to process them during a single Lambda execution.\n- https://aws.amazon.com/blogs/compute/understanding-how-aws-lambda-scales-when-subscribed-to-amazon-sqs-queues/\n\n## Index formats\n\nWe need to convert the legacy format to the current format during the migration.\n\n### Legacy\n\n`blocks` table format.\n\n| multihash | cars    | createdAt | data  | type    |\n|-----------|---------|-----------|-------|---------|\n| `zQm...` | `[ { \"M\" : { \"offset\" : { \"N\" : \"3193520\" }, \"length\" : { \"N\" : \"100615\" }, \"car\" : { \"S\" : \"region/bucket/raw/bafy.../315.../ciq...car\" } } } ]` | 2022-05-30T17:06:12.864Z | `{}` | raw\n\n### Current\n\n`blocks-cars-position` table format.\n\n| blockmultihash | carPath | length | offset |\n|----------------|---------|--------|--------|\n| `z2D...` | region/bucket/raw/QmX.../315.../ciq...car | 2765 | 3317501 |\n\n## Costs\n\nThis will be used to migrate indexes from the `blocks` table to the `blocks-cars-position` table.\n\n### Source table scan cost\n\n`blocks` table stats\n\n| Item count    | Table size | Average item size\n|---------------|------------|-----------------\n| 858 million   | 273 GB     | 319 bytes\n\n- 4k / 319 bytes = 12 items per 1 RCU _(eventaully consistent, cheap read, not transaction)_\n- 858 million / 12 = 71 million RCUs \n- 71 * $0.25 per million = **$17 for a full table scan**\n\n### Destination table write cost\n\n`blocks-cars-position` table stats\n\n| Item count    | Table size | Average item size\n|---------------|------------|-----------------\n| 43 billion    | 11 TB      | 255 bytes\n\nAssuming we have to write 1 new record for every source record\n\n- 1kb / 255 bytes = 3 items per WCU\n- 858 million / 3 items per WCU = 286 million WCUs\n- 286 * $1.25 per million = **$357.5 total write cost**\n\nbut initial explorations suggest we will actually only need to write 1% of the source table to the dst table. So the write cost will likely be very cheap as long we check for existence before attempting writes. Alas conditional writes cost as much as a write even if not write occurs.\n\n### Lambda costs\n\nOn the consumer side, with `BatchSize: 20` we will process 10,000 records per invocation.\n- 869 million / 10,000 = 85900 invocations.\n- Estimate 10s to 30s per 10k processing time\n- 1Gb ram\n- = ~$30\n\nOn the scanner side\n- @ ~1.5s processing time per 500 records.\n- running for 11mins per invocation\n- ((11 * 60) / 1.5) * 500 = 220,000 records per invocation\n- 859 million / 220,000 recs per invocation = ~1,600 invocations\n- = ~$9\n\nper https://calculator.aws/#/addService/Lambda\n\n### SQS costs\n\n- 859,000,000 / 500 = 1,718,000 puts to the queue\n- 3,436,000 queue ops\n- ~$1\n\nper https://calculator.aws/#/addService/SQS\n\n### References\n\n\u003e Write operation costs $1.25 per million requests.\n\u003e Read operation costs $0.25 per million requests.\n– https://dashbird.io/knowledge-base/dynamodb/dynamodb-pricing/\n\n\u003eRead consumption: Measured in 4KB increments (rounded up!) for each read operation. This is the amount of data that is read from the table or index... if you read a single 10KB item in DynamoDB, you will consume 3 RCUs (10 / 4 == 2.5, rounded up to 3).\n\u003e\n\u003e Write consumption: Measured in 1KB increments (also rounded up!) for each write operation. This is the size of the item you're writing / updating / deleting during a write operation... if you write a single 7.5KB item in DynamoDB, you will consume 8 WCUs (7.5 / 1 == 7.5, rounded up to 8).\n– https://www.alexdebrie.com/posts/dynamodb-costs/\n\n\u003e If a ConditionExpression evaluates to false during a conditional write, DynamoDB still consumes write capacity from the table. The amount consumed is dependent on the size of the item (whether it’s an existing item in the table or a new one you are attempting to create or update). For example, if an existing item is 300kb and the new item you are trying to create or update is 310kb, the write capacity units consumed will be the 310kb item.\n– https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.ConditionalUpdate\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fweb3-storage%2Fmigrate-block-index-infra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fweb3-storage%2Fmigrate-block-index-infra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fweb3-storage%2Fmigrate-block-index-infra/lists"}