{"id":19309968,"url":"https://github.com/cedadev/ceda-fbs","last_synced_at":"2025-10-06T01:07:46.649Z","repository":{"id":51058303,"uuid":"50506142","full_name":"cedadev/ceda-fbs","owner":"cedadev","description":"Repository for the fbs project.","archived":false,"fork":false,"pushed_at":"2021-09-22T14:24:17.000Z","size":2512,"stargazers_count":1,"open_issues_count":8,"forks_count":1,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-02-24T03:26:32.381Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cedadev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-01-27T12:35:03.000Z","updated_at":"2021-09-22T14:24:20.000Z","dependencies_parsed_at":"2022-08-19T21:11:19.052Z","dependency_job_id":null,"html_url":"https://github.com/cedadev/ceda-fbs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cedadev/ceda-fbs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2Fceda-fbs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2Fceda-fbs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2Fceda-fbs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2Fceda-fbs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cedadev","download_url":"https://codeload.github.com/cedadev/ceda-fbs/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedadev%2Fceda-fbs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278543101,"owners_count":26004131,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T00:22:02.109Z","updated_at":"2025-10-06T01:07:46.622Z","avatar_url":"https://github.com/cedadev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Recipe for running CEDA FBS on the whole archive\n\n## Login to jasmin-sci2 server and locate yourself\n\n```\n$ ssh ${USER}@jasmin-sci2.ceda.ac.uk\n$ cd /group_workspaces/jasmin4/cedaproc/${USER}/\n$ mkdir fbs\n$ export BASEDIR=$PWD\n$ cd fbs/\n```\n\n## Get and install ceda-fbs code from Git (with install script)\n\n```\n$ wget https://raw.githubusercontent.com/cedadev/ceda-fbs/master/install-ceda-fbs.sh\n$ .  ./install-ceda-fbs.sh\n```\n\nThis will build you a `virtualenv` locally so your environment should look like:\n\n```\n$ ls\nceda-fbs  install-ceda-fbs.sh  venv-ceda-fbs\n```\n\n## Create a little setup script\n\n```\n$ cat setup_env.sh\nexport BASEDIR=/group_workspaces/jasmin4/cedaproc/$USER/fbs\nexport PYTHONPATH=$BASEDIR/ceda-fbs/python:$BASEDIR/ceda-fbs/python/src/fbs:$PYTHONPATH\nexport PATH=$PATH:$BASEDIR/ceda-fbs/python/src/fbs/cmdline\n. venv-ceda-fbs/bin/activate\n```\n\n## Configure servers and Elasticsearch index\n\nYou need to tell `ceda-fbs` some key things in the config file (ceda_fbs.ini) at:\n\n`$BASEDIR/ceda-fbs/python/config/ceda_fbs.ini`\n\nYou will need to edit the following sections:\n\n```\n log-path = /group_workspaces/jasmin/cedaproc/__INSERT_USERID_HERE__/fbs/logs-level-2\n es-host = jasmin-es1.ceda.ac.uk\t\t\t\t\t\t\n es-index = ceda-archive-level-2\t\t\t\t\t\t\t\n es-index-settings = /group_workspaces/jasmin/cedaproc/__INSERT_USERID_HERE__/fbs/ceda-fbs/elasticsearch/mapping/index_mapping.json\t\n num-files = 10000\t\t\n num-processes = 128\t\t\n```\n\nNOTE: change `__INSERT_USERID_HERE__` to your userid.\n\n## Check that your userid has access to the required groups to read the archive\n\nThe CEDA archive is made up of numerous datasets that are managed through Unix group permissions. You will need access to the following in order to successfully read files across the archive:\n\n* byacl\n* open\n* badcint\n* gws_specs\n* cmip5_research\n* esacat1\n* ecmwf\n* ukmo\n* eurosat\n* ukmo_wx\n* ukmo_clim\n\n## 1. Scan the file system for a list of all CEDA datasets\n\n```\n$ ceda-fbs/python/src/fbs/cmdline/create_datasets_ini_file.sh\nWrote datasets file to: ceda_all_datasets.ini\n```\n\nYou should now have an INI file that maps identifiers to dataset paths, i.e.:\n\n```\n$ head -3 ceda_all_datasets.ini\nbadc__abacus=/badc/abacus/data\nbadc__accacia=/badc/accacia/data\nbadc__accmip=/badc/accmip/data\n```\n\n## 2. Create file lists for every dataset (ready for the actual scanning)\n\nMake directories ready for file lists and log files:\n\n```\n$ mkdir logs datasets lotus_errors\n```\n\n*!WARNING: files lists can be *many Gbytes in size* so don't do this in your home directory.*\n\nNow run the first LOTUS jobs to generate lists of all files in each dataset.\n\n```\n$ make_file_lists.py -f ceda_all_datasets.ini -m $BASEDIR/datasets --host lotus\n```\n\nThis will submit lots of jobs to LOTUS.\n\n*NOTE:* To run a subset of these jobs locally you might do:\n\n```\n$ head -4 ceda_all_datasets.ini \u003e redo_datasets.ini\n$ make_file_lists.py -f redo_datasets.ini -m $BASEDIR/datasets --host localhost\n```\n\n### Running a test scan\n\nYou can run a test scan at this point. This will scan a single dataset on the local host - and post the content to Elasticsearch:\n\n```\n$ scan_dataset.py -f ceda_all_datasets.ini -d badc__ukmo-nimrod --make-list $BASEDIR/datasets/badc__ukmo-nimrod.txt\n```\n\nAt this stage you might want to examine which datasets were not scanned - and why. The above command gives you a method of running for individual datasets.\n\n## 3. Create a set of commands to run the full scan\n\nCreate a set of commands ready to send to LOTUS that will scan the entire archive. They will use the file list files from (2) as their inputs.\n\n```\n$ scan_archive.py --file-paths-dir $BASEDIR/datasets --num-files 10000 --level 2 --host lotus\n```\n\nThis generates a file inside the current directory called: `lotus_commands.txt`. Each command specifies a list of up to 10,000 data files that are to be scanned when the job runs on LOTUS. (The `lotus_commands.txt` file will contain about 25,000 lines/commands).\n\n## 4. Execute the scan commands on LOTUS\n\nBefore you do this: Create: `~/.forward` (containing just your email address) - so that LOTUS messages will be mailed to you.\n\nNext, run the `run_commands_in_lotus.py` script to work its way through the list of commands inside the `lotus_commands.txt` file by submitting up to 128 at any one time.\n\nOn `jasmin-sci[12].ceda.ac.uk`, run:\n\n```\n$ run_commands_in_lotus.py -f lotus_commands.txt\n```\n\nYou can then view your job queue on lotus with:\n\n```\n$ squeue -u $USER\n```\n\n## 5. Watch the file count building\n\nYou can see how things are progressing in the web-interface:\n\n https://kibana.ceda.ac.uk\n \n## 6. Make some optimisations to the Elasticsearch settings\n\nMake these settings using `curl`, `wget` or in kibana.\n\nSet the Index to NOT use replica shards by calling the following:\n\n```\nPUT ceda-archive-level-2/_settings\n{\n    \"number_of_replicas\": 0\n}\n```\n\nSet the number of shards for each host to 1 by calling the following:\n\n```\nPUT /ceda-archive-level-2/_settings\n{\n    \"index.routing.allocation.total_shards_per_node\": 1\n\n}\n```\n\n## Analyse the log files to see where there were failures\n\nThere is a script to help us work which files we could not scan:\n\n```\n$ scan_logfiles.py log-levels-2\n```\n\nIt shows a table of details like:\n\n```\nDataset                                  Indexed              Total files          Properties errors    Database errors      Status\n---------------------------------------------------------------------------------------------------------------------------------------------------\nbadc__abacus                             6                    6                    0                    0                    ok\nbadc__accacia                            1398                 1408                 10                   0                    ok\nbadc__accmip                             93322                283218               0                    0                    errors\nbadc__acid-deposition                    20                   20                   0                    0                    ok\nbadc__acsoe                              2106                 2106                 0                    0                    ok\nbadc__active                             92                   92                   0                    0                    ok\nbadc__adriex                             4222                 4222                 0                    0                    ok\nbadc__amazonica                          5                    5                    0                    0                    ok\nbadc__amma                               177118               189322               0                    0                    errors\nbadc__amps_antarctic                     7953                 7953                 0                    0                    ok\n```\n\nNOTE: The FBS package will *ignore* files that are symbolic links. These will show up as \"Properties errors\".\n\n\n# Querying the results\n\nHere are some example queries you might use:\n\n## Query everything\n\n```\nPOST _search\n{\n  \"size\": 20, \n   \"query\": {\n      \"match_all\": {}\n   }\n}\n```\n\nNote the \"size\" tag is the number of results to return (the default is 10).\n\n## Query a specific filename\n\nLet’s look for the file: `SCI_NL__1PWDPA20100911_194148_000060212092_00472_44614_2837.N1.gz`\n\n```\nPOST _search\n{\n   \"query\": {\n      \"query_string\": {\n         \"query\": \"SCI_NL__1PWDPA20100911_194148_000060212092_00472_44614_2837.N1.gz\",\n         \"fields\": [\"info.name\"]\n      }\n   }\n}\n```\n\nThe important thing here is that we are searching for \"info.name\".\n\n## Filter specific file type (extension)\n\n```\nPOST _search\n{\n   \"query\": {\n      \"constant_score\": {\n         \"filter\": {\n            \"term\": {\n               \"info.type\": \"nc\"\n            }\n         }\n      }\n   }\n}\n```\n\n## List the contents of a directory\n\n```\nPOST _search\n{\n   \"query\": {\n      \"constant_score\": {\n         \"filter\": {\n            \"term\": {\n               \"info.directory\": \"/neodc/sciamachy/data/l1b/v7-04/2010/09/11\"\n            }\n         }\n      }\n   }\n}\n```\n\nSome useful (internal to CEDA) links on querying ES:\n\nhttp://team.ceda.ac.uk/trac/ceda/wiki/FBS/PhenomenonSearch\n\nhttp://team.ceda.ac.uk/trac/ceda/wiki/ElasticSearch/BasicQueries\n\nhttp://team.ceda.ac.uk/trac/ceda/ticket/23247\n\n## Command Line Scripts\n\n|           Script name            |                  Description               |\n| -------------------------------- | ------------------------------------------ |\n| check_incompete_spots.py         |                                            |\n| create_datasets_ini_from_spot.sh |                                            |\n| display_es_stats.py              |                                            |\n| fbs_api.py                       |                                            |\n| fbs_live.py                      |                                            |\n| get_es_stats.py                  |                                            |\n| make_file_lists.py               |                                            |\n| run_commands_in_lotus.py         |                                            |\n| scan_archive.py                  |                                            |\n| scan_dataset.py                  |                                            |\n| scan_logfiles.py                 |                                            |","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedadev%2Fceda-fbs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcedadev%2Fceda-fbs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedadev%2Fceda-fbs/lists"}