{"id":29639169,"url":"https://github.com/neotomadb/clean_backup","last_synced_at":"2025-10-14T01:32:05.114Z","repository":{"id":284449153,"uuid":"870351564","full_name":"NeotomaDB/clean_backup","owner":"NeotomaDB","description":"A repository to generate automated backups for the Neotoma PostgreSQL database using AWS infrastructure","archived":false,"fork":false,"pushed_at":"2025-08-26T17:26:25.000Z","size":157,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-05T15:35:02.707Z","etag":null,"topics":["aws-batch","aws-cli","aws-ecr","aws-fargate","bash","dockerfile","neotoma","neotoma-database","nsf","paleoecology","postgresql"],"latest_commit_sha":null,"homepage":"https://neotomadb.org","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NeotomaDB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"code_of_conduct.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-09T22:08:36.000Z","updated_at":"2025-07-28T19:40:21.000Z","dependencies_parsed_at":"2025-03-26T00:30:45.940Z","dependency_job_id":"c67509cd-3962-4d8b-80e0-e4737c214a65","html_url":"https://github.com/NeotomaDB/clean_backup","commit_stats":null,"previous_names":["neotomadb/clean_backup"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/NeotomaDB/clean_backup","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2Fclean_backup","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2Fclean_backup/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2Fclean_backup/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2Fclean_backup/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NeotomaDB","download_url":"https://codeload.github.com/NeotomaDB/clean_backup/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NeotomaDB%2Fclean_backup/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279017490,"owners_count":26086084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-batch","aws-cli","aws-ecr","aws-fargate","bash","dockerfile","neotoma","neotoma-database","nsf","paleoecology","postgresql"],"created_at":"2025-07-21T20:08:18.678Z","updated_at":"2025-10-14T01:32:05.102Z","avatar_url":"https://github.com/NeotomaDB.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![NSF-1948926](https://img.shields.io/badge/NSF-1948926-blue.svg)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1948926)\n[![NSF-2410961](https://img.shields.io/badge/NSF-2410961-blue.svg)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2410961)\n\n[![lifecycle](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://www.tidyverse.org/lifecycle/#stable)\n\n\n# Neotoma Anonymized Backups\n\nThis repository generates a container service for Neotoma that copies the [Neotoma Paleoecology Database](https://neotomadb.org) into a Docker container and overwrites sensitive data using a random `md5` hash. The bash script running in the container then uploads the data to a Neotoma AWS S3 bucket where the snapshot is made publically available through a URL that is shared on the Neotoma website.\n\nThe compressed file (`neotoma_clean_{DATETIME}.tr.gz`) includes a [bash script](archives/regenbash.sh) that will re-build the database in a user's local Postgres instance. Currently the bash script only runs for Mac and Linux. There is an experimental [Windows batch script](archives/experimental_win_restore.bat) that can be used with caution.\n\nWe welcome any user contributions see the [contributors guide](CONTRIBUTING.md).\n\n## Restoring the Database\n\nThe most recent snapshot of the Neotoma Database will always be tagged as `neotoma_clean_latest` in the compressed file, but the actual SQL file used to restore the database will be named with the date the snapshot was taken. Generally, the snapshots will be taken every month. If there is a need for a more recent snapshot, please contact the database administrators to request a newer snapshot.\n\n### Postgres Extensions Used\n\nThe Docker container uses Postgres 15, and the current RDS database version is PostgreSQL v15.14. The local database requires the following extensions to be installed before you can restore Neotoma locally:\n\n* [pg_trgm](https://www.postgresql.org/docs/current/pgtrgm.html): Helps with full-text searching of publications.\n* [intarray](https://www.postgresql.org/docs/9.1/intarray.html)\n* [unaccent](https://www.postgresql.org/docs/current/unaccent.html): Helps with searches for terms that may include accents (sitenames, contact names).\n* External: [postgis](https://postgis.net/): Helps manage spatial data.\n\nThese extensions are used to improve functionality within the Neotoma Database. The `pg_grgm`, `intarray`, and `unaccent` extensions are included with PostgreSQL. External tools such as `postgis` must be installed prior to creation within the Postgres server.\n\nThe [regenbash.sh](archives/regenbash.sh) script automates some of the creation of the extensions within the restored database.\n\n### Restoring from the Cloud\n\nThe *most recent* version of the clean database is always uploaded as a `.tar.gz` file to Neotoma S3 cloud storage. You can download it directly by clicking the badge below. Note that this download is over 2 Gigs in size.\n\n[![Download Snapshot](https://img.shields.io/badge/Download-Neotoma--Snapshot-orange.svg)](https://neotoma-remote-store.s3.us-east-2.amazonaws.com/neotoma_clean_latest.tar.gz)\n\nOnce the file is downloaded, you can extract it locally. The file archive contains the following files (the terminal date for the sql file may differ):\n\n* dbsetup.sql\n* experimental_win_restore.bat\n* regenbash.sh\n* neotoma_clean_2025-07-01.sql\n\nOnce you execute `regenbash.sh` (Mac/Linux) or `experimental_win_restore.bat` (Windows) the database will be restored from the text file to your local database within a database `neotoma` at which point you can use the database from whichever database management system you'd like to use.\n\n## AWS Infrastructure\n\nThe backup itself is generated through AWS. There are two steps, the first is packaging the Docker image and sending it to ECR, the second is initiating the Batch job, which will run the scripts in the Docker container.\n\n![AWS Configuration](/assets/AWS_scrub_database_infrastructure.svg)\n\nAll files (with the exception of files that directly expose secrets) are available in this repository. All secrets are contained in a `parameters.yaml` file in the `./infrastructure` folder. We provide a [`parameters-template.yaml`](./infrastructure/parameters-template.json) file for convenience, so that users can see which key-value pairs are needed for full implementation of the workflow.\n\n### Docker Configuration\n\nThe Docker [configuration file](batch.Dockerfile) sets up a container with PostgreSQL 15 and PostGIS. The Docker container sets up the system, creates a connection to a containerized Postgres database, and then uses `pg_dump` to create a plaintext SQL dump of the remote Neotoma database that is restored within the container. To sanitize the database of sensitive data we execute the script [`app/scrubbed_database.sh`](app/scrubbed_database.sh). The SQL statements write over rows in the Data Stewards tables as well as the Contacts tables.\n\nThe Docker container is built and deployed to the AWS ECR using the script [`build-and-push.sh`](build-and-push.sh). For this script to work, the user must have the AWS CLI installed, and have permissions to access Neotoma AWS services.\n\n### AWS Infrastructure Builder\n\nThe scripts [`deploy.sh`](deploy.sh) and [`update.sh`](update.sh) are used to deploy the [Batch Infrastructure](infrastructure/batch-infrastructure.yaml) configuration to CloudFormation, which will then be used to define the AWS Batch run when a job is submitted.\n\nWithin the infrastructure file there is a defined `ScheduleRule`, which uses the EventBridge [`cron()`](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-scheduled-rule-pattern.html) scheduler to execute the backup snapshot at 2am on the first day of each month.  Single instances of the job can also be executed using [`test_job.sh`](test_job.sh).\n\n## Final Overview\n\nWith this repository, we implement a monthly backup system using AWS infrastructure to provide Neotoma users with a sanitized version of the database for local use on their personal systems.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Fclean_backup","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneotomadb%2Fclean_backup","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneotomadb%2Fclean_backup/lists"}