{"id":19522187,"url":"https://github.com/captaincodeman/datastore-mapper","last_synced_at":"2025-04-26T09:32:06.909Z","repository":{"id":66357887,"uuid":"62970321","full_name":"CaptainCodeman/datastore-mapper","owner":"CaptainCodeman","description":"Appengine Datastore Mapper in Go","archived":false,"fork":false,"pushed_at":"2017-03-20T14:53:16.000Z","size":122,"stargazers_count":21,"open_issues_count":13,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-04T10:33:59.083Z","etag":null,"topics":["appengine","bigquery","cloud-storage","datastore","datastore-entities","datastore-mapper","go","map-reduce","shards"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CaptainCodeman.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-07-09T21:42:01.000Z","updated_at":"2022-12-07T19:55:41.000Z","dependencies_parsed_at":"2023-02-22T12:46:30.718Z","dependency_job_id":null,"html_url":"https://github.com/CaptainCodeman/datastore-mapper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CaptainCodeman%2Fdatastore-mapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CaptainCodeman","download_url":"https://codeload.github.com/CaptainCodeman/datastore-mapper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250967238,"owners_count":21515564,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["appengine","bigquery","cloud-storage","datastore","datastore-entities","datastore-mapper","go","map-reduce","shards"],"created_at":"2024-11-11T00:37:35.895Z","updated_at":"2025-04-26T09:32:06.560Z","avatar_url":"https://github.com/CaptainCodeman.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datastore Mapper\nThis is an implementation of the appengine datastore mapper functionality of the\n[appengine map-reduce framework](https://github.com/GoogleCloudPlatform/appengine-mapreduce)\ndeveloped using in [Go](https://golang.org/).\n\n## Status\nIt is alpha status while I develop the API and should not be used in production yet but\nit's been successfully used to export datastore entities to JSON for import into BigQuery,\nstreaming inserts directly into BigQuery and for schema migrations and lightweight aggregation\nreporting.\n\nMore work is needed to finish the query sharding when using ancestor queries and to make\nthe cloud storage file output writing more robust.\n\n## Usage\nThe mapper server is a regular Go `http.Handler` so can be hosted at any path. Just make\nsure that the path used is also passed in to the constructor. The default path provided\nis `/_ah/mapper/`.\n\nAdding with default options:\n\n    func init() {\n        mapperServer, _ := mapper.NewServer(mapper.DefaultPath)\n        http.Handle(mapper.DefaultPath, mapperServer)\n    }\n\nThe default configuration can be overridden by passing additional options, for example:\n\n    func init() {\n        mapperServer, _ := mapper.NewServer(mapper.DefaultPath,\n            mapper.DatastorePrefix(\"map_\"),\n            mapper.Oversampling(16),\n            mapper.Retries(8),\n            mapper.LogVerbose,\n        )\n        http.Handle(mapper.DefaultPath, mapperServer)\n    }\n\nSee the `/config.go` file for all configuration options.\n\n## Mapper Jobs\nMapper Jobs are defined by creating a Go struct that implements the `JobSpec` interface and\nregistering it with the mapper:\n\n    func init() {\n        mapper.RegisterJob(\u0026mymapper{})\n    }\n\nThe struct defines the `Query` function used to parse request parameters and create the datastore\nquery spec:\n\n    func (m *mymapper) Query(r *http.Request) (*Query, error)\n\nPlus, the function that will be called by the mapper as the datastore is iterated:\n\n    func (m *mymapper) Next(c context.Context, counters Counters, key *datastore.Key) error\n\nThe datastore will use a 'keys only' query but this can be changed to fetching the full\nentities by implementing the `JobEntity` interface to return the entity to load into (which\nshould be stored as a named field of the job):\n\n    func (m *mymapper) Make() interface{}\n\nTo make it easy to output to Cloud Storage a job can implement `JobOutput` which provides\nan `io.writer` for the slice being processed. The job can create whatever encoder it needs\nto write it's output (JSON works great for BigQuery):\n\n    func (m *mymapper) Output(w io.Writer)\n \nThere are additional interfaces that can be implemented to receive notification of various\nlifecycle events:\n\n* Job Started / Completed\n* Namespcae Started / Completed\n* Shard Started / Completed\n* Slice Started / Completed\n\nSee the [/example/](/example/) folder for some examples of the various job types:\n\n* example1: simple keysonly iteration and aggregation using counters\n* example2: simple eager iteration and aggregation using counters\n* example3: parse request parameters to create query or use defaults for CRON job\n* example4: lifecycle notifications\n* example5: export custom JSON to Cloud Storage (for batch import into BigQuery)\n* example6: streaming inserts into BigQuery\n\n### Local Development\nThe example application will run locally but currently needs a `service-account.json` credentials\nfile in order to use the cloud storage output.\n\nKicking off a job is done by POSTing to the `http://localhost:8080/_ah/mapper/start` endpoint\npassing the name of the job spec and optionally:\n\n* **shards**\n  The number of shards to use\n* **queue**\n  The taskqueue to schedule tasks on\n* **bucket**\n  The GCS bucket to write output to\n\nExample:\n\n    http://localhost:8080/_ah/mapper/start?name=main.example1\u0026shards=8\u0026bucket=staging.[my-app].appspot.com\n\n## Why only the mapper?\nTechnology moves on and Google's cloud platform now provides other services that handle\nmany of the things that the Map-Reduce framework was once the solution to. If you need \nany Analytics and Reporting functionality, for instance, you would now likely look at \nusing [BigQuery](https://cloud.google.com/bigquery/) instead of trying to create your\nown via Map-Reduce.\n\nBut the mapper is still needed: it's used to write backups of your Datastore data to \nCloud storage (which is also the way you export your Datastore data into BigQuery). Many\ntimes it's necessary to iterate over all the data in your system when you need to update\nthe schema or perform some other operational processing and the mapper provides a great\napproach for doing that by enabling easy splitting and distribution of the workload.\n\nDatastore is great as operational app storage but is weak when it comes to reporting so\na common approach is to export data to a queryable store and there is a datastore admin\ntool that can provide a backup or export of the entities.\n\nBut using the datastore backup isn't always ideal - sometimes you want more control over \nthe range and the format of the data being exported. For instance, you may not want or\nneed the full content of each entity to be exported to BigQuery in the same format is it's\nused within the application.\n\n## Why a Go version?\nSo why not just continue to use the existing Python or Java implementations?\n\nThey both continue to work as well as they always have but there were a few specific\nreasons that I wanted a solution using Go:\n\n### Performance\nMy experience of the Python version is that it tends to be rather slow and can only handle\nentities in a small batch without the instance memory going over the limit and causing errors\nand retries. The Java runtime is faster but also requires larger instances sizes just\nto execute before you even start processing anything and, well, it's Java and it's now 2016.\n\nI like that a Go version can run (and run _fast_) even on the smallest F1/B1 micro-instance\nconsuming just a few Mb of RAM but able to process thousands of entities per second.\n\n### Runtime Cost\nThe slow performance and larger instance size requirementss for Python and Java both \nresult in higher operational costs which I'd prefer to avoid. Each platform also tends\nto have it's own memcache implementation to improve Datastore performance and bypassing\nany already-populated application cache can also add to the cost due to unecessary datastore\noperations. If I'm exporting a days worth of recent 'hot' data for instance, there is a good\nchance that a sizeable proportion of that will already be in memcache and if I access it\nwith the same system I can save some costs. \n\n### Development Cost\nI already have my datastore models developed as part of my Go app and don't want to have\nto re-develop and maintain a Python or Java version as well (and sometimes it's not \nstraightforward to switch between them with different serialization methods etc...).\nThe cost of including a whole other language in your solution goes beyond just the code,\nit's also the development machine maintenance and all the upgrading, IDE tooling and skills\nmaintenance that go along with it before you get to things like the mental fatigue of context\nswitching.\n\n## Performance\nInitial testing shows it can process over 100k datastore entities in about a minute\nusing a single F1 micro instance. This works out to about 1.6k entities per second.\n\nOf course the work can often be divided into smaller pieces so it's very likely that\nincreasing the number of shards so that requests are run on multiple instances will\nresult in even greater performance and faster throughput.\n\n## Implementation\nI originally started with the idea of strictly following the existing implementations\ndown to the format of the entities for controlling the tasks so that the UI could\nbe re-used and the Go version could be a drop-in replacement but this would have also\nneeded the pipeline library to be implemented and while I got quite a long way with it,\nI ultimately decided to abandon the idea once I realized I would only ever be using the\nmapper part.\n\nThis then meant I was free to re-imagine how some of the mapping could work with the \nsame fundamental goals and many of the same ideas but some different design choices\nto suit my own app requirements and issues I'd had with the existing framework.\n\nThe biggest change is conceptually how the mapper splits up the datastore work to be\ndistributed between shards. The Python and Java versions both use the `__scatter__`\nproperty (added to about 0.78% or 1 in 128 entities) to get a random distribution of\nthe data that could be split for processing between shards. This splitting can only\nbe done within a single namespace though and if the number of namespaces used in an\napp was above a certain limit (default 10), they switched instead to sharding on the\nnamespaces with each shard iterating over a range of namespaces and data within them.\n\nIf you are using the namespacing for multi-tenancy and each tenant has very different\nvolumes of data it can easily result in one shard becoming completely overloaded which\ncompletey destroys the benefit of sharding and distribution which is why the mapper\nframework is being used in the first place.\n\nThis framework instead creates a producer task to iterate over the namespaces and do\nthe shard splitting on each of them, assigning separate shards within each namespace\nto ensure the work can be split evenly but with some heuristics so that the number of\nshards used for a namespace that doesn't contain much data is automatically reduced.\n\nSo, instead of the number of shards being fixed, it becomes more of a 'minimum target'\nfor splitting the data as long as it makes sense to do so. Ultimately, we want the\nwork split into sensible but meaninful chunks so that it can be parallelized and it\ndoesn't matter if there are 8 shards with 10 slices each or 800 shards of 1 slice, the\ntask queue processing rate and concurrency limits are what manage how fast it will\nall be scheduled and executed.\n\nWith each namespace-shard we still use the same slice approach to process the range\nof data allocated to it and allow for the app-engine request timeout limitations and\nretry / restart ability with some simplified locking / leasing to prevent duplicate\ntask execution. \n\n## Workflow\nHere is the basic execution workflow:\n\nDefine a Job (can be a simple struct) that implements the basic Job interface used to\ndefine the Query, and the Next method for processing the entities. A number of additional\njob lifecycle interfaces can be implemented if required to receive notifications when a job,\nnamespace, shard or slice starts and ends. Each job must be registered with the mapper\nframework.\n\nInitiate Job execution by POSTing to the jobs endpoint. This can be done manually or \ncould be setup to be kicked off from a cron task. Your job Query function will\nbe passed the request to extract any parameters it requires to create the query object\nfor filtering the datastore entitites to process. e.g. you could pass the date range of\nitems to process, a cron task could default to processing the previous days entries.\n\nJob execution will begin with the scheduling of a namespace iterator that will iterate\nover all the namespaces specified in the query. Usually, this will be 'all' if none\nhave been defined but if only certain namespaces should be included they can be set.\n\nFor each namespace, the system will attempt to use the `__scatter__` index to split the\nrange of keys between shards, falling back to an ordinal splitting of property values if\nthe index does not exist.Depending on the number of shards configured and an estimate\nof the minimum dataset size (based on the number of random keys returned) the system\nwill choose and appropriate number of shards and split the key range between them to\nprovide potential for parallelization.\n\nEach namespace-shard will be schedule to execute and will iterate over it's range of\nkeys in slices to limit the impact of any failure and to cope with the request time\nlimits of the appengine runtime. Each slice will continue where the previous left off\nby using a datastore cursor, thus eventally iterating over the entire dataset assigned\nto that shard and, when all the shards complete, the entire dataset for the namespace.\n\nYour job functions will be called at the appropriate points to perform whatever work\nthey need to.\n\n## Output\nFor simple aggregation operations the inbuilt counter object can be used but be aware\nthat it is serialised and stored in the datastore entities of the mapper so the number\nof entries should be limited. The counters are rolled up from slice, shard, namespace\nand eventually to the job.\n\nAlthough I have no plans to build in any shuffle or reduce steps, I wanted to provide\nan easy way to write data to cloud storage (a primary use-case will be exporting data\nin JSON format for BigQuery). Cloud Storage provides the ability to write each file in\nchunks and then combine them which simplifies the file-writing operation substantially\n(probably not available when the original libraries were written). Each slice can write\nto it's own file, overwriting on retry, and those slices can then be quickly rolled up\ninto a single shard file and then into a single namespace file (eventually to a single job\nfile). This is working but needs to be optimized and made more robust (e.g. to cope with\nthe limits on how many files can be combined).\n\nAnother option is to stream inserts directly into BigQuery which saves the intermediate\noutput writing and subsequent import from cloud storage (once streamed inserting into\npartitioned tables is an option, this will make this a better option). The lifecycle\nfunctions can be used to check for and create the BigQuery datasets and tables as part\nof the operation.\n\n### Example\nHere's an example of a mapper job outputting to cloud storage. The files written by each\nslice are rolled up into a single shard file and are fairly evenly distributed (there\nare around 55-60,000 JSON entries in each file):\n\n![Shard Files Example](https://cloud.githubusercontent.com/assets/304910/16933919/1185e24c-4d0f-11e6-84dc-c6e10e07be46.png)\n\nThe shard files are then rolled up into a single file for the namespace ready for\nimporting into BigQuery (which could be automated on namespace or job completion):\n\n![Namespace File Example](https://cloud.githubusercontent.com/assets/304910/16934010/b3658efa-4d0f-11e6-88bf-5f9463ad7f81.png)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcaptaincodeman%2Fdatastore-mapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcaptaincodeman%2Fdatastore-mapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcaptaincodeman%2Fdatastore-mapper/lists"}