{"id":19983174,"url":"https://github.com/lsst-dm/qserv-ingest","last_synced_at":"2025-05-04T05:33:07.470Z","repository":{"id":37467514,"uuid":"258486595","full_name":"lsst-dm/qserv-ingest","owner":"lsst-dm","description":"Tools for loading extremely large datasets inside Qserv","archived":false,"fork":false,"pushed_at":"2024-10-29T07:25:19.000Z","size":74477,"stargazers_count":6,"open_issues_count":2,"forks_count":1,"subscribers_count":11,"default_branch":"main","last_synced_at":"2024-10-29T08:23:11.212Z","etag":null,"topics":["bigdata","cosmology","database","kubernetes","parallel"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lsst-dm.png","metadata":{"files":{"readme":"README.dp02_dc2_catalogs","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-24T10:56:58.000Z","updated_at":"2024-10-29T07:25:24.000Z","dependencies_parsed_at":"2024-08-01T22:33:35.131Z","dependency_job_id":"3f0b0a04-9b88-460c-b106-16340bc356e0","html_url":"https://github.com/lsst-dm/qserv-ingest","commit_stats":null,"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsst-dm%2Fqserv-ingest","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsst-dm%2Fqserv-ingest/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsst-dm%2Fqserv-ingest/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lsst-dm%2Fqserv-ingest/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lsst-dm","download_url":"https://codeload.github.com/lsst-dm/qserv-ingest/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224385342,"owners_count":17302468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","cosmology","database","kubernetes","parallel"],"created_at":"2024-11-13T04:14:19.359Z","updated_at":"2024-11-13T04:14:19.797Z","avatar_url":"https://github.com/lsst-dm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Scope\n----------------------------------------------------------------------------------\nThis folder has a collection of thei configuration files for the Ingest system and\nthe partitioned CSV files of the `dp02_dc2_catalogs` that are ready to be ingested\ninto Qserv.\n\nThe configuration file for the catalog 'dp02_dc2_catalogs'\n------------------------------------------------------------------------------------------------------------------\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/dp02_dc2_catalogs.json\n\n\nThe configuration files for the tables\n---------------------------------------------------------------------------------------------------------\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/CcdVisit.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/DiaObject.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/DiaSource.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/ForcedSource.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/ForcedSourceOnDiaObject.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/MatchesTruth.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/Object.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/Source.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config/Visit.json\n\n\nThe configuration files for creating table-level indexes at workers for the tables\n--------------------------------------------------------------------------------------------------------------------------------\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_CcdVisit_ccdVisitId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_CcdVisit_visitId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_DiaObject_diaObjectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_DiaObject_subChunkId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_DiaSource_diaObjectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_DiaSource_diaSourceId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_ForcedSource_objectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_ForcedSourceOnDiaObject_ccdVisitId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_ForcedSourceOnDiaObject_diaObjectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_ForcedSourceOnDiaObject_forcedSourceOnDiaObjectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_MatchesTruth_id.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_MatchesTruth_index.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_MatchesTruth_match_objectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Object_coord_dec.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Object_coord_ra.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Object_objectId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Object_subChunkId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Source_ccdVisitId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Source_sourceId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Source_subChunkId.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Source_visit.json\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/config_indexes/idx_Visit_visit.json\n\n\nCollections of links to the CSV files (contributions) for each table\n-----------------------------------------------------------------------------------------------------------------------------\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_CcdVisit.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_DiaObject.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_DiaSource.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_ForcedSource.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_ForcedSourceOnDiaObject.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_MatchesTruth.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_Object.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_Source.https.url\nhttps://storage.googleapis.com/qserv-us-central1-argo-artifact/dp02/PREOPS-905/in2p3/csv/dp02_dc2_catalogs_Visit.https.url\n\n\nImportant notes and additional instructions on the ingest\n---------------------------------------------------------\n\n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\nThe configuration files and links to the CSV files for the partitioned table 'TruthSummary'\nare not presently available due to an issues with duplicate rows found in the table.\nThe issue is being investigated.\n\n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\n'Source' is the most problematic table for the Ingest system as it has over 1.6 million\nindividual file contributions. Based on the prior experience of ingesting the table\ninto Qserv at IDF (5 worker nodes) it may take up to 48 hours or longer to ingest the table.\nThe large number of files may also cause the Qserv worker ingest service to run out\nof memory and be terminated by Kubernetes due to a known problem in the memory management\nmodel of the Ingest service. The curent implementaton of the service won't\nrelease memory allocated for each file contribution. This memory is used for maintaining\nthe status of the contributions and it was (originally) meant to speed up statues inqueries\nfor the previously submitted contribution requests. While serving its purpose, the\nmodel also results in the steady memory growth of the process 'qserv-replica-worker'.\nIn the 5 workers configuration each worker process may grow up to 128 GB by the end of the\ningest. A possible solution to the problem is to split the collection of the 'Source'\ncontributions into smaller subsets and ingest each subset in a separate set of\nthe super-transactions. It's importat to restart the worker ingest service 'qserv-replica-worker'\nbefore ingesting each such subset. In IDF (5 workers, 64 GB RAM per worker) the collection\nhad to be split into 2 subsets. In Qserv instances that have larger number of worker\nnodes (at least 10) with the same (or larger) amount of memory per node the split may\nnot be necessary.\n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\nIt's recommented to ingest each table in its own set of the super-transactions.\n\nIt's recommented to start at least 10 super-transactions for ingesting the large tables\nObject, Source, ForcedSource, DiaObject, DiaSource and ForcedSourceOnDiaObject\n\nOne super-transaction is sufficient for ingesting the small \"regular\" (fully-replicated)\ntables: Visit, CcdVisit and MatchesTruth.\n\n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\nDue to the large amount of data in the catalog (over 36 TB), the catalog publishing stage\nwill take many hours (12 hours or longer).\n\n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\nFor the very same reason building the table-level indexing will also take many hours.\n\n\n- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -\nAfter publishing the catalog, it's recommended to build the row counter statistics that\nis used for optimizing unconstrained queries like:\n\n  SELECT COUNT(*) FROM dp02_dc2_catalogs.Object\n\nOtherwise Qserv will resort to using the shared scan to count rows in the tables. Given\nthe large scale of the catalog, the query may take a while (it takes over 2 hours in IDF\nfor the table 'ForcedSource').\nThe following will build the desired statistics and deploy it in the Qserv Czar database:\n\n  mkdir logs;\n  for table in Object Source ForcedSource DiaObject DiaSource ForcedSourceOnDiaObject Visit CcdVisit MatchesTruth; do\n    curl 'http://localhost:8080/ingest/table-stats' \\\n      -X POST \\\n      -H 'Content-Type: application/json' \\\n      -d'{\"auth_key\":\"\",\"database\":\"dp02_dc2_catalogs\",\"table\":\"'${table}'\",\"row_counters_state_update_policy\":\"ENABLED\",\"row_counters_deploy_at_qserv\":1,\"force_rescan\":1}' \\\n      -ologs/table-stats.${table}.json \\\n      \u003e\u0026 logs/table-stats.${table}.log;\n  done;\n\nDue to the large amount of data in the catalog (over 36 TB), this operation will take many hours\nas it requires scanning each table.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flsst-dm%2Fqserv-ingest","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flsst-dm%2Fqserv-ingest","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flsst-dm%2Fqserv-ingest/lists"}