{"id":19522209,"url":"https://github.com/captaincodeman/datastore-mapper-test","last_synced_at":"2025-07-01T11:38:27.331Z","repository":{"id":66357872,"uuid":"82600515","full_name":"CaptainCodeman/datastore-mapper-test","owner":"CaptainCodeman","description":"Test project for Datastore Mapper","archived":false,"fork":false,"pushed_at":"2017-03-24T18:10:43.000Z","size":10,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-29T05:07:09.170Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CaptainCodeman.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-20T20:41:13.000Z","updated_at":"2017-02-20T20:44:14.000Z","dependencies_parsed_at":"2023-02-23T04:15:58.520Z","dependency_job_id":null,"html_url":"https://github.com/CaptainCodeman/datastore-mapper-test","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CaptainCodeman/datastore-mapper-test","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper-test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper-test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper-test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper-test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CaptainCodeman","download_url":"https://codeload.github.com/CaptainCodeman/datastore-mapper-test/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper-test/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259808468,"owners_count":22914652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T00:37:46.707Z","updated_at":"2025-06-14T11:32:51.645Z","avatar_url":"https://github.com/CaptainCodeman.png","language":"Go","readme":"# datastore-mapper-test\n\nSimple project used for testing performance and correctness of\n[appengine datastore-mapper](https://github.com/captaincodeman/datastore-mapper)\n\n## Generate test data\n\nTest data is generated using seed values so the 'random' entries are repeatable.\nThis allows the same data to be generated locally to sum up values and check them\nagainst the totals ultimately from BigQuery.\n\nThe test data is simple Order entities with one or more Order Items.\n\nPOST to /_ah/start?start=0\u0026finish=10000000\n\nUse start + finish to set the range of data to generate. new_only=true will make\nthe system query for data and only write new entries for any gaps found.\n\nEach entity records the seed ID it was created with. The datastore key is based on\na hashid of this integer which produces some distribution of keys so inserts are\nnot all sequential.\n\n## Run mapper jobs\n\nThere are two mapper jobs defined. When running them it's important to understand\nhow sharding and concurrent request limits affect processing.\n\nShards are how many slices the datastore dataset is sliced into. Only one request\nper shard will ever be running at the same time so this controls how much work can\nbe done in parallel.\n\nThe queue.yaml max_concurrent_requests setting controls how many shard requests\nwill be executed concurrently.\n\nThe app.yaml max_concurrent_requests setting controls how many shard requests can\nbe processed by a single instance.\n\nAll together, these allow you to control the 'scale out' of the mapper job and how\nmuch work is executed concurrently and how many instances will be fired up to do it.\n\nOf course more + bigger instances will be faster ... but more expensive.\n\n### Export to JSON\n\nThis job exports the datastore entities to a json file. The bucket to write to is\nrequired which should be owned by the project:\n\nPOST /_ah/mapper/start?name=main.ExportJson\u0026bucket=mapper-perf.appspot.com\u0026shards=16\n\nGCS writing is done in buffered chunks so there is some overhead for each shard that\nis processed concurrently by any one instance. I found that an F2 instance could run\n4 requests concurrently and used around 200Mb of RAM.\n\nFor 50 million entities and no effective limit to the number of concurrent shards:\n\n 16 shards took around 80 minutes to complete.\n 32 shards took around 40 minutes to complete.\n 64 shards took around 20 minutes to complete.\n\nOn average each 10-minute shard request processed around 400,000 - 450,000 entities\nwhich works out to roughly 700 - 750 entities per second per shard.\n\nThe operation consumed:\n\n    17.94 Instance Hours\n    50.03 Million Datastore Read Ops\n\nCosts were around $30 with the majority being datastore read operations. This could\nbe reduced by doing a 'KeysOnly' iteration of the entity keys and loading them from\nmemcache using quedus\\nds although its effectiveness would depend on the proportion\nof data is typically cached. For frequent batch-updates of new data it could be an\neffective strategy.\n\nAll produced a ~30Gb json file and importing this into BigQuery took approximately\n2 minutes. The schema.json is the BigQuery scheme definition used to define the table\nformat.\n\nF4 instance (512Mb RAM)\n\n8-900,000 per shard request\n\n10 concurrent ~360Mb\n16 concurrent ~\n\nSuggested max concurrent requests per instance type for JSON exporting\n\nClass  Memory  Max\nF1      128Mb    2\nF2      256Mb    6\nF4      512Mb   16\nF4_1G  1024Mb   32\n\nTODO: Explain balance between memory use, performance and cost\n\n### Export to BigQuery\n\nAs an alternative, I also created a job to export directly to BigQuery using streaming\ninserts. These are much slower and the chance of duplicate data being inserted is\nhigher.\n\nHowever, even though some task operations were restarted at some point, the InsertID\nfeature of the streaming inserts did it's job and the final table had exactly 50,000,000\nentries, exactly the same as per the JSON export / ingestion approach.\n\nThe memory overhead was considerably lower and an instance could easily handle 8 - 16\nconcurrent shards at once.\n\nEven though I batched the BigQuery writes, the streaming insert approach was considerably\nslower though with the overall job taking days (although that was with fewer shards).\n\nThis approach really makes more sense for having live updates of data or smaller repeat\nbatches (e.g. hourly or daily). This would require a CRON task to create the query for\nthe previous period range and an appropriate index on the datastore (e.g. a date field\nwith no time information so an equality filter could be used in the query). Also, a\ncomposite index would be needed on the date + `__scatter__` special property used for\nsplitting the query range into shards.\n\nThe table for the BigQuery export is created automatically in the example.\n\n## Mapper Options\n\nTODO: Effect of other mapper options such as request timeout","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcaptaincodeman%2Fdatastore-mapper-test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcaptaincodeman%2Fdatastore-mapper-test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcaptaincodeman%2Fdatastore-mapper-test/lists"}