{"id":25279917,"url":"https://github.com/christophgil/jit-fileprovider","last_synced_at":"2025-10-06T09:29:43.025Z","repository":{"id":276329215,"uuid":"887799177","full_name":"christophgil/jit-fileprovider","owner":"christophgil","description":"Just-in-time file provider","archived":false,"fork":false,"pushed_at":"2025-02-14T17:50:52.000Z","size":25,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-14T18:33:43.491Z","etag":null,"topics":["cluster","computing","file-system","file-transfer","filesystem","high-performance-computing","hpc","linux-cluster","rsync","scp","zip","zip-files"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/christophgil.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-13T10:00:25.000Z","updated_at":"2025-02-14T17:50:56.000Z","dependencies_parsed_at":"2025-02-07T15:37:37.592Z","dependency_job_id":"146e5906-c211-47da-8749-83f57b77aefe","html_url":"https://github.com/christophgil/jit-fileprovider","commit_stats":null,"previous_names":["christophgil/jit-fileprovider"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophgil%2Fjit-fileprovider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophgil%2Fjit-fileprovider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophgil%2Fjit-fileprovider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christophgil%2Fjit-fileprovider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/christophgil","download_url":"https://codeload.github.com/christophgil/jit-fileprovider/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247479440,"owners_count":20945460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cluster","computing","file-system","file-transfer","filesystem","high-performance-computing","hpc","linux-cluster","rsync","scp","zip","zip-files"],"created_at":"2025-02-12T18:05:50.548Z","updated_at":"2025-10-06T09:29:37.978Z","avatar_url":"https://github.com/christophgil.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"Just-in-time file provider\n\nStatus:   Testing\n\n\nDESCRIPTION\n===========\n\nJust-in-time-file-provider provides files for long running computations from archives or remote sources.  In\nmass-spectrometry analysis, for example, thousands of mass-spectrometry files are processed one after the\nother by the analysis software. The files might be located in a NAS storage. In our case they are organized as ZIP archives.\n\n\nThe program logic is implemented as a pre-loaded shared library.  At the time the analysis software\nis started, the required files are usually not accessible yet.  JIT-file-provider\nobserves all  file requests by the analysis software and provides the requested files\njust-in-time.\n\n\nThe program logic for obtaining the files is implemented in shell scripts with the name *hook.sh*\nand *hook_configuration.sh* and can easily be customized.\n\nOptionally, a file list can be provided such that for each file the successor file is known.  This\nallows to load files already during the computation of the previous file.\n\nMotivation\n==========\n\nWe are processing huge mass spectrometry data on a high performance Linux cluster.\n\nThe data is located in ZIP archives on a WORM storage.\n\n\u003c!-- Fuse file systems are not supported on the cluster.  Anyway, mass-spectrometry file loading by the application would not work well with files --\u003e\n\u003c!-- on remote or fuse file systems.  This is because file reading is saltatoric and random rather than sequential. --\u003e\n\nThe conventional approach is to copy all required files to the cluster prior starting the\ncomputation. However, copying that many files takes some hours to days and requires large disk space\non the cluster which is often not available.\n\n\nImplementation\n==============\n\nMethod calls to the C-library are intercepted in order to get notification what files are going to be loaded by the application.\nJIT-file-provider is a pre-loaded shared library which calls a Bash script *hook.sh* to load the required file or files.\n\n\nThe user gives rules which files are  obtained by what method.   In our case\na file may be  loaded by running  */usr/bin/unzip*:\n\n    sshpass -e ssh   user@hostname  nocache unzip -p zip-file.zip  zip-entry \u003e file\n\nSubsequently, the crc32-checksum is compared to the checksum in the Zip file.\nAlternatively, files are copied with */usr/bin/scp* or ZIP archives are mounted with */usr/bin/fuse-zip*.\n\nThe last-access time is used for clean-up.  Files which have not been used for a given number of minutes are automatically removed\nto free disk space on the cluster.  The last-access time is updated explicitly for the case that the mount option  *noatime* is activated.\n\n\nInstallation\n=============\n\n\n\nJIT-file-provider is installed  on the target machine like the  HPC cluster by running the installer script *libjit_file_provider.compile.sh*.\nThe C compiler gcc or clang is sufficient.\n\nThree files are generated:\n\n - ~/.jit_file_provider/libjit_file_provider.so\n - ~/.jit_file_provider/hook.sh\n - ~/.jit_file_provider/hook_configuration.sh\n\nRequired Linux packages: fuse-zip nocache unzip sshpass openssh\n\nTesting\n=======\n\nSSH needs to be set up to work unattended without entering a password. This may be done for the current or a different user ID.\nA simple and secure approach is to create a user ID with read access to the data and to set the variable *SSHPASS* with the password of this user ID:\n\n    export SSHPASS='the secret password'\n\nCheck\n\n     sshpass -e ssh the-user-id@localhost date\n\nThis command  displays the date and time  without asking for  the password.\nRun the script\n\n    testing/testing_JIT_file_provider.sh\n\nThis script creates a ZIP file repository simulates the file repository from which the files need to be extracted in\n\n    ~/test_JIT_file_provider\n\n\n\u003c!-- This folder name serves as a pattern in the configuration files *hook_configuration.sh* and *jit_file_provider_configuration.c*. --\u003e\n\u003c!-- JIT-file-provider accesses the ZIP entries using one of the methods --\u003e\n\nThe program menu, lets you specify a user ID. Then you can choose one of the above methods.\n\n  - fuse-zip.  No user ID and password required.\n  - ssh unzip. In this example it will be applied to files ending with .txt\n  - scp.       In this example it will be applied to files ending with .zip\n\n\nThe test script simulates an application which expects files in\n\n    ~/.jit_file_provider/files\n\nWatch out for green success-reports in the output and observe how files appear in this folder.\n\nConfiguration\n=============\n\nFor configuration, it is recommended to install and test JIT-file-provider on the working Linux PC before going to a high-performance-cluster.\n\nIn the configuration files, the  rules for  obtaining files are specified.\nThen a  test  command like the following may be used for validation:\n\n    LD_PRELOAD=~/.jit_file_provider/libjit_file_provider.so head ~/.jit_file_provider/files/file-path | strings\n\nThe respective files  appear in  *~/.jit_file_provider/files*  as soon as they are loaded by the command, here */usr/bin/head*.\n\nVerbosity can be activated with environment variables.\n\n    export VERBOSE_HOOK=1\n    export VERBOSE_SO=1\n\n\nConfigurable files:\n\n - jit_file_provider_symbols_configuration.c\n   This file lists the C-functions to be observed by JIT-file-provider.\nUsually,  this file does not need to be modified.\n   However, problems may  occur when C functions are implemented by other library functions.\n   For example  JIT-file-provider worked fine on our development machine, but failed on the  HPC cluster because\n   the function *stat()* was implemented with the method   statx() in the standard C-library.\n   To identify problems like this, JIT-file-provider reports all caught functions once as   *Calling hook ...*, however stat() did not appear.\n   With the tool */usr/bin/strace* we found that  *statx()* and not *stat()* is reported . Adding it to the list in jit_file_provider_symbols_configuration.c and solved the problem.\n   Please report cases like this.\n\n- jit_file_provider_configuration.c:\n   When jit_file_provider.so catches  calls to methods like fopen() the paths are evaluated by the function  *configuration_filelist()* which   returns\n   a NULL terminated list of files needed along with the given path.\n   This list may be empty for files not to be managed by jit-file-provider.\n   In our example a path with the ending \".d\" returns *path/analysis.tdf* and *path/analysis.tdf_bin*.\n   In other cases the list might contain only the file path  itself.\n\n - ~/.jit-file-provider/hook_configuration.sh\n   This script describes the methods how files are obtained from the file source.\n   This can be scp, ssh unzip or fuse-zip.\n   It also contains the rules for cleanup i.e. the removal of files which have not been used for some time.\n\n\n\n\nUsage\n=====\n\nThe JIT-file-provider shared library is pre-loaded when the  the software is run.\n\n     LD_PRELOAD=~/.jit_file_provider/libjit_file_provider.so    the-command  the arguments\n\n\n\nAhead of time\n=============\n\nComputation time on the cluster is valuable and network loading and  computation can run simultaneously.\n\nBy providing a  list of files in the  environment variable *FILELIST*, it is known, what file will come next. These files can be prefetched in anticipation of them being needed soon.\n\nThe shell script itself can serve as this list since only those strings are regarded that like an absolute path.\n\nFuse-zip\n========\n\nIf the ZIP archives are accessible through the file system, the software can also mount ZIP files such that the analysis software can load  ZIP entries as files.\n\nJIT-file-provider can unmount ZIP files that have not been used for a number of minutes.\nThis avoids large numbers of  simultaneous mounts which may cause problems.\n\n\nParallel access to conventional spinning hard-disks\n===================================================\n\nSeveral parallel HPC jobs may lead to increased seeks and movements of the head of the HD  hosting the  file archive.\nThis may deteriorates performance, put strain on the HD and shorten life span.\n\nWe are currently experimenting with increasing the read-ahead and the unzip buffer:\n\n    echo 2048 \u003e /sys/block/sda/queue/read_ahead_kb\n    echo 24 \u003e /sys/block/sdd/queue/iosched/fifo_batch\n\n\nFurthermore we are experimenting with INBUFSIZ of unzip:\n\n    apt-get source unzip\n\nAt the top of unzip.c:\n\n    #define INBUFSIZ 0x400000\n\nCompile with\n\n    make -f unix/Makefile unzip\n\n\nApply JIT-file-provider for High-Performance-Clusters computations\n==================================================================\n\nIt is recommended to set-up the configuration on the user Linux machine and not on the HPC.\nOnce the configuration works, it can be tested on the HPC.\n\nThe script  which is run on the user machine to start the HPC jobs should:\n\n    - rsync the source files to the HPC.\n    - run a HPC job which  calls ~libjit_file_provider.compile.sh\n    - Check for existence on the HPC of\n        + ~/.jit_file_provider/libjit_file_provider.so\n        + ~/.jit_file_provider/crc32\n\n\nHistory\n=======\n\nOriginally, this library was developed to run Diann computations with thousands of files.  Since the\nmass-spectrometry files were archived as ZIP files, all required files were mounted and then Diann\nwas started.  Unfortunately, the Linux computer became slow with that many simultaneous mounts.\n\nThe idea came up, to mount only the ZIP file which are currently required by Diann and to unmount\nthe ZIP file when the next file gets loaded.\n\nUnfortunately, this did not work yet at this time when Diann was a Windows program running in the  Wine environment on a Linux PC.\n\nAs a workaround, the ZIPsFS fuse file system was developed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristophgil%2Fjit-fileprovider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchristophgil%2Fjit-fileprovider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristophgil%2Fjit-fileprovider/lists"}