{"id":13558544,"url":"https://github.com/openspending/audit-for-migration-2020","last_synced_at":"2026-05-18T13:38:24.503Z","repository":{"id":137668997,"uuid":"276983689","full_name":"openspending/audit-for-migration-2020","owner":"openspending","description":"Audit of OpenSpending (DBs, APIs etc) for 2020 migration to Datopian stewardship","archived":false,"fork":false,"pushed_at":"2020-12-25T18:39:09.000Z","size":2324,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-04T09:37:25.533Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openspending.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-07-03T20:39:32.000Z","updated_at":"2021-04-03T22:22:46.000Z","dependencies_parsed_at":"2024-01-17T06:11:39.096Z","dependency_job_id":"0fc641b3-895c-4689-853e-76e12603dc3a","html_url":"https://github.com/openspending/audit-for-migration-2020","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openspending%2Faudit-for-migration-2020","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openspending%2Faudit-for-migration-2020/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openspending%2Faudit-for-migration-2020/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openspending%2Faudit-for-migration-2020/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openspending","download_url":"https://codeload.github.com/openspending/audit-for-migration-2020/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247009621,"owners_count":20868578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T12:05:00.987Z","updated_at":"2025-12-27T10:32:02.232Z","avatar_url":"https://github.com/openspending.png","language":"Shell","funding_links":[],"categories":["Shell","others"],"sub_categories":[],"readme":"# Table of contents\n\n  * [More on the OpenSpending database](#more-on-the-openspending-database)\n  * [OpenSpending database - Basic statistics](#openspending-database---basic-statistics)\n  * [Database samples](#database-samples)\n  * [How we filter what to keep when auditing](#how-we-filter-what-to-keep-when-auditing)\n  * [How we retrieve the metadata from Elasticsearch](#how-we-retrieve-the-metadata-from-elasticsearch)\n    * [Prerequisites](#prerequisites)\n    * [Guide](#guide)\n* [How we retrieve data from S3 and store it in Google Cloud Storage with a Giftless server](#how-we-retrieve-data-from-s3-and-store-it-in-google-cloud-storage-with-a-giftless-server)\n## More on the OpenSpending database\n\nYou can find more information about how we are auditing the database in this public spreadsheet:\n\n- https://docs.google.com/spreadsheets/d/1Cno0jUkfl8ozf0qjaNhuCmJEmMaL2U65sNONqVTEEnM/edit?usp=sharing\n\n## OpenSpending database - Basic statistics\n\n| Item                                                                    | Value       |\n| ----------------------------------------------------------------------- | ----------- |\n| Total number of tables                                                  | 8,233       |\n| Total number of rows across all tables                                  | 153,884,300 |\n| Number of tables with at least 100 rows                                 | 3,361       |\n| Number of tables with number of rows between 10 and 99 (both inclusive) | 1,796       |\n| Number of empty tables reported                                         | 298         |\n\n## Database samples\n\nYou can find relevant samples for tables with high read/write access under the [samples/](samples) directory with an accompanying `datapackage.json` for each sample, which is a useful descriptor file containing metadata about a data package, a format extensively used as part of the Frictionless Data tooling as a convenient way to represent tabular data. Get to know more about this practical open-source toolkit at [frictionlessdata.io](https://frictionlessdata.io).\n\n---\n\n## How we filter what to keep when auditing\n\nWe have a list of filters to determine what to keep/discard from the database (see https://github.com/openspending/openspending/issues/1482 for details). Those filters are applied through formulae in the [spreadsheet](https://docs.google.com/spreadsheets/d/1Cno0jUkfl8ozf0qjaNhuCmJEmMaL2U65sNONqVTEEnM/edit):\n\n- Column `D`, titled `Keep?`, contains the formula with the filters. It outputs either `yes` or `no` to answer the question \"should we keep it\" when inquiring about a table.\n- The same formula is applied to all the rows in column `D`, once for each table in the database.\n\nThis technique is used to quickly list what's relevant and hide what is not by filtering column `D` to select only the rows where `yes` appears so that we can export a shorter list of tables containing only useful data.\n\n---\n\n## How we retrieve the metadata from Elasticsearch\n\n_The following guide was written by [Victor Nițu](https://github.com/nightsh)._\n\n### Prerequisites\n\n- Access to the OKFN k8s cluster / `production` namespace.\n- `kubectl` binary.\n- `elasticsearch-dump` or Docker. See\n  https://github.com/elasticsearch-dump/elasticsearch-dump\n- Internet :slightly_smiling_face:.\n\n\u003e _NOTE: The steps described below were only tested in a GNU/Linux environment. While they might work in other OSes, some adjustments might be nceessary._\n\n### Guide\n\n1. Make sure you're using the right k8s context (check with `kubectl config get-context`).\n2. Port forward the ES coordinating service to your local:\n    `kubectl port-forward -n production service/openspending-production-elasticsearch-coordinating-only 9200:9200`\n3. (optional) Install `elasticsearch-dump`:\n    `npm i -g elasticsearch-dump`\n4. Check indices and their names:\n    `curl -X GET \"localhost:9200/_aliases\" -H 'Content-Type: application/json'`\n5. If you installed via `npm`, run the binary, else you can just spin a temporary Docker container. Here's how to output to `STDOUT`:\n\n        $ elasticsearch-dump \\\n        --input=http://localhost:9200/packages/ \\\n        --output=$ \\\n        --type=data\n\n    OR\n\n        $ docker run --rm -ti elasticdump/elasticsearch-dump \\\n        --input=http://localhost:9200/packages/ \\\n        --output=$ \\\n        --type=data\n\n    If you see a lot of output, it works. Now to create a more useful dump, we will dump the analyzer, mapping and data into a file for each.\n\n6. Let's run:\n\n        $ docker run --net=host -v /tmp:/data --rm -ti \\\n        elasticdump/elasticsearch-dump \\\n        --input=http://localhost:9200/packages/ \\\n        --output=/data/packages_analyzer.json \\\n        --type=analyzer\n\n        $ docker run --net=host -v /tmp:/data --rm -ti \\\n        elasticdump/elasticsearch-dump \\\n        --input=http://localhost:9200/packages/ \\\n        --output=/data/packages_mapping.json \\\n        --type=mapping\n\n        $ docker run --net=host -v /tmp:/data --rm -ti \\\n        elasticdump/elasticsearch-dump \\\n        --input=http://localhost:9200/packages/ \\\n        --output=/data/packages_data.json \\\n        --type=data\n\n7. Now all your dumps are stored into `/tmp/packages_{analyzer,mapping,data}.json`\n8. To import, use the reverse command (input is a file, output is a ES endpoint).\n\nMore info re:usage [here](https://github.com/elasticsearch-dump/elasticsearch-dump#use).\n\n---\n\n# How we retrieve data from S3 and store it in Google Cloud Storage with a Giftless server\n\nThe whole process is described in [this short guide](./s3_to_google_cloud_storage/README.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenspending%2Faudit-for-migration-2020","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenspending%2Faudit-for-migration-2020","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenspending%2Faudit-for-migration-2020/lists"}