{"id":26788340,"url":"https://github.com/vertebrateresequencing/muxfys","last_synced_at":"2025-04-19T20:51:12.000Z","repository":{"id":57491125,"uuid":"90880091","full_name":"VertebrateResequencing/muxfys","owner":"VertebrateResequencing","description":"High performance multiplexed user fuse mounting","archived":false,"fork":false,"pushed_at":"2023-03-07T02:10:20.000Z","size":355,"stargazers_count":19,"open_issues_count":8,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-06-21T02:19:34.127Z","etag":null,"topics":["fuse-filesystem","golang","multiplexer","s3"],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VertebrateResequencing.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-10T15:30:48.000Z","updated_at":"2023-09-08T17:24:51.000Z","dependencies_parsed_at":"2024-06-19T00:07:46.440Z","dependency_job_id":null,"html_url":"https://github.com/VertebrateResequencing/muxfys","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VertebrateResequencing%2Fmuxfys","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VertebrateResequencing%2Fmuxfys/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VertebrateResequencing%2Fmuxfys/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VertebrateResequencing%2Fmuxfys/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VertebrateResequencing","download_url":"https://codeload.github.com/VertebrateResequencing/muxfys/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246187203,"owners_count":20737463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fuse-filesystem","golang","multiplexer","s3"],"created_at":"2025-03-29T13:17:22.788Z","updated_at":"2025-03-29T13:17:23.458Z","avatar_url":"https://github.com/VertebrateResequencing.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# muxfys\n\n[![GoDoc](https://godoc.org/github.com/VertebrateResequencing/muxfys?status.svg)](https://godoc.org/github.com/VertebrateResequencing/muxfys)\n[![Go Report Card](https://goreportcard.com/badge/github.com/VertebrateResequencing/muxfys)](https://goreportcard.com/report/github.com/VertebrateResequencing/muxfys)\n[![Build Status](https://travis-ci.org/VertebrateResequencing/muxfys.svg?branch=master)](https://travis-ci.org/VertebrateResequencing/muxfys)\n[![Coverage Status](https://coveralls.io/repos/github/VertebrateResequencing/muxfys/badge.svg?branch=master)](https://coveralls.io/github/VertebrateResequencing/muxfys?branch=master)\n\n    go get github.com/VertebrateResequencing/muxfys\n\n\nmuxfys is a pure Go library for temporarily in-process mounting multiple\ndifferent remote file systems or object stores on to the same mount point as a\n\"filey\" system. Currently only support for S3-like systems has been implemented.\n\nIt has high performance, and is easy to use with nothing else to install, and no\nroot permissions needed (except to initially install/configure fuse: on old\nlinux you may need to install fuse-utils, and for macOS you'll need to install\nosxfuse; for both you must ensure that 'user_allow_other' is set in\n/etc/fuse.conf or equivalent).\n\nIt has good S3 compatibility, working with AWS Signature Version 4 (Amazon S3,\nMinio, et al.) and AWS Signature Version 2 (Google Cloud Storage, Openstack\nSwift, Ceph Object Gateway, Riak CS, et al).\n\nIt allows \"multiplexing\": you can mount multiple different S3 buckets (or sub\ndirectories of the same bucket) on the same local directory. This makes commands\nyou want to run against the files in your buckets much simpler, eg. instead of\nmounting s3://publicbucket, s3://myinputbucket and s3://myoutputbucket to\nseparate mount points and running:\n\n    $ myexe -ref /mnt/publicbucket/refs/human/ref.fa -i /mnt/myinbucket/xyz/123/\n      input.file \u003e /mnt/myoutputbucket/xyz/123/output.file\n\nYou could multiplex the 3 buckets (at the desired paths) on to the directory you\nwill work from and just run:\n\n    $ myexe -ref ref.fa -i input.file \u003e output.file\n\nIt is a \"filey\" system ('fys' instead of 'fs') in that it cares about\nperformance and efficiency first, and POSIX second. It is designed around a\nparticular use-case:\n\nNon-interactively read a small handful of files who's paths you already know,\nprobably a few times for small files and only once for large files, then upload\na few large files. Eg. we want to mount S3 buckets that contain thousands of\nunchanging cache files, and a few big input files that we process using those\ncache files, and finally generate some results.\n\nIn particular this means we hold on to directory and file attributes forever and\nassume they don't change externally. Permissions are ignored and only you get\nread/write access.\n\nWhen using muxfys, you 1) mount, 2) do something that needs the files in your S3\nbucket(s), 3) unmount. Then repeat 1-3 for other things that need data in your\nS3 buckets.\n\n# Performance\n\nTo get a basic sense of performance, a 1GB file in a Ceph Object Gateway S3\nbucket was read, twice in a row for tools with caching, using the methods that\nworked for me (I had to hack minfs to get it to work); units are seconds\n(average of 3 attempts) needed to read the whole file:\n\n    | method         | fresh | cached |\n    |----------------|-------|--------|\n    | s3cmd          | 5.9   | n/a    |\n    | mc             | 7.9   | n/a    |\n    | minfs          | 40    | n/a    |\n    | s3fs           | 12.1  | n/a    |\n    | s3fs caching   | 12.2  | 1.0    |\n    | muxfys         | 5.7   | n/a    |\n    | muxfys caching | 5.8   | 0.7    |\n\nIe. minfs is very slow, and muxfys is about 2x faster than s3fs, with no\nnoticeable performance penalty for fuse mounting vs simply downloading the files\nyou need to local disk. (You also get the benefit of being able to seek and read\nonly small parts of the remote file, without having to download the whole\nthing.)\n\nThe same story holds true when performing the above test 100 times\n~simultaneously; while some reads take much longer due to Ceph/network overload,\nmuxfys remains on average twice as fast as s3fs. The only significant change is\nthat s3cmd starts to fail.\n\nFor a real-world test, some data processing and analysis was done with samtools,\na tool that can end up reading small parts of very large files.\nwww.htslib.org/workflow was partially followed to map fastqs with 441 read pairs\n(extracted from an old human chr20 mapping). Mapping, sorting and calling was\ncarried out, in addition to creating and viewing a cram. The different caching\nstrategies used were: cup == reference-related files cached, fastq files\nuncached, working in a normal POSIX directory; cuf == as cup, but working in a\nfuse mounted writable directory; uuf == as cuf, but with no caching for the\nreference-related files. The local(mc) method involved downloading all files\nwith mc first, with the cached result being the maximum possible performance:\nthat of running bwa and samtools when all required files are accessed from the\nlocal POSIX filesystem. Units are seconds (average of 3 attempts):\n\n    | method     | fresh | cached |\n    |------------|-------|--------|\n    | local(mc)  | 157   | 40     |\n    | s3fs.cup   | 175   | 50     |\n    | muxfys.cup | 80    | 45     |\n    | muxfys.cuf | 79    | 44     |\n    | muxfys.uuf | 88    | n/a    |\n\nIe. muxfys is about 2x faster than just downloading all required files manually,\nand over 2x faster than using s3fs. There isn't much performance loss when the\ndata is cached vs maximum possible performance. There's no noticeable penalty\n(indeed it's a little faster) for working directly in a muxfys-mounted\ndirectory.\n\nFinally, to compare to a highly optimised tool written in C that has built-in\nsupport (via libcurl) for reading from S3, samtools was once again used, this\ntime to read 100bp (the equivalent of a few lines) from an 8GB indexed cram\nfile. The builtin(mc) method involved downloading the single required cram cache\nfile from S3 first using mc, then relying on samtools' built-in S3 support by\ngiving it the s3:// path to the cram file; the cached result involves samtools\nreading this cache file and the cram's index files from the local POSIX\nfilesystem, but it still reads cram data itself from the remote S3 system. The\nother methods used samtools normally, giving it paths within the fuse mount(s)\ncreated. The different caching strategies used were: cu == reference-related\nfiles cached, cram-related files uncached; cc == everything cached; uu ==\nnothing cached. Units are seconds (average of 3 attempts):\n\n    | method      | fresh | cached |\n    |-------------|-------|--------|\n    | builtin(mc) | 1.3   | 0.5    |\n    | s3fs.cu     | 4.3   | 1.7    |\n    | s3fs.cc     | 4.4   | 0.5    |\n    | s3fs.uu     | 4.4   | 2.2    |\n    | muxfys.cu   | 0.3   | 0.1    |\n    | muxfys.cc   | 0.3   | 0.06   |\n    | muxfys.uu   | 0.3   | 0.1    |\n\nIe. muxfys is much faster than s3fs (more than 2x faster probably due to much\nfaster and more efficient stating of files), and using it also gives a\nsignificant benefit over using a tools' built-in support for S3.\n\n# Status \u0026 Limitations\n\nThe only `RemoteAccessor` implemented so far is for S3-like object stores.\n\nIn cached mode, random reads and writes have been implemented.\n\nIn non-cached mode, random reads and serial writes have been implemented.\n(It is unlikely that random uncached writes will be implemented.)\n\nNon-POSIX behaviours:\n\n* does not store file mode/owner/group\n* does not support hardlinks\n* symlinks are only supported temporarily in a cached writeable mount: they\n  can be created and used, but do not get uploaded\n* `atime` (and typically `ctime`) is always the same as `mtime`\n* `mtime` of files is not stored remotely (remote file mtimes are of their\n  upload time, and muxfys only guarantees that files are uploaded in the order\n  of their mtimes)\n* does not upload empty directories, can't rename remote directories\n* `fsync` is ignored, files are only flushed on `close`\n\n# Guidance\n\n`CacheData: true` will usually give you the best performance. Not setting an\nexplicit CacheDir will also give the best performance, as if you read a small\npart of a large file, only the part you read will be downloaded and cached in\nthe unique CacheDir.\n\nOnly turn on `Write` mode if you have to write.\n\nUse `CacheData: false` if you will read more data than can be stored on local\ndisk.\n\nIf you know that you will definitely end up reading the same data multiple times\n(either during a mount, or from different mounts) on the same machine, and have\nsufficient local disk space, use `CacheData: true` and set an explicit CacheDir\n(with a constant absolute path, eg. starting in /tmp). Doing this results in any\nfile read downloading the whole remote file to cache it, which can be wasteful\nif you only need to read a small part of a large file. (But this is the only way\nthat muxfys can coordinate the cache amongst independent processes.)\n\n# Usage\n\n```go\nimport \"github.com/VertebrateResequencing/muxfys\"\n\n// fully manual S3 configuration\naccessorConfig := \u0026muxfys.S3Config{\n    Target:    \"https://s3.amazonaws.com/mybucket/subdir\",\n    Region:    \"us-east-1\",\n    AccessKey: os.Getenv(\"AWS_ACCESS_KEY_ID\"),\n    SecretKey: os.Getenv(\"AWS_SECRET_ACCESS_KEY\"),\n}\naccessor, err := muxfys.NewS3Accessor(accessorConfig)\nif err != nil {\n    log.Fatal(err)\n}\nremoteConfig1 := \u0026muxfys.RemoteConfig{\n    Accessor: accessor,\n    CacheDir: \"/tmp/muxfys/cache\",\n    Write:    true,\n}\n\n// or read configuration from standard AWS S3 config files and environment\n// variables\naccessorConfig, err = muxfys.S3ConfigFromEnvironment(\"default\",\n    \"myotherbucket/another/subdir\")\nif err != nil {\n    log.Fatalf(\"could not read config from environment: %s\\n\", err)\n}\naccessor, err = muxfys.NewS3Accessor(accessorConfig)\nif err != nil {\n    log.Fatal(err)\n}\nremoteConfig2 := \u0026muxfys.RemoteConfig{\n    Accessor:  accessor,\n    CacheData: true,\n}\n\ncfg := \u0026muxfys.Config{\n    Mount:     \"/tmp/muxfys/mount\",\n    CacheBase: \"/tmp\",\n    Retries:   3,\n    Verbose:   true,\n}\n\nfs, err := muxfys.New(cfg)\nif err != nil {\n    log.Fatalf(\"bad configuration: %s\\n\", err)\n}\n\nerr = fs.Mount(remoteConfig, remoteConfig2)\nif err != nil {\n    log.Fatalf(\"could not mount: %s\\n\", err)\n}\nfs.UnmountOnDeath()\n\n// read from \u0026 write to files in /tmp/muxfys/mount, which contains the\n// contents of mybucket/subdir and myotherbucket/another/subdir; writes will\n// get uploaded to mybucket/subdir when you Unmount()\n\nerr = fs.Unmount()\nif err != nil {\n    log.Fatalf(\"could not unmount: %s\\n\", err)\n}\n\nlogs := fs.Logs()\n```\n\n# Provenance\n\nThere are many ways of accessing data in S3 buckets. Common tools include s3cmd\nfor direct up/download of particular files, and s3fs for fuse-mounting a bucket.\nBut these are not written in Go.\n\nAmazon provide aws-sdk-go for interacting with S3, but this does not work with\n(my) Ceph Object Gateway and possibly other implementations of S3.\n\nminio-go is an alternative Go library that provides good compatibility with a\nwide variety of S3-like systems.\n\nThere are at least 3 Go libraries for creating fuse-mounted file-systems.\ngithub.com/jacobsa/fuse was based on bazil.org/fuse, claiming higher\nperformance. Also claiming high performance is github.com/hanwen/go-fuse.\n\nThere are at least 2 projects that implement fuse-mounting of S3 buckets:\n\n* github.com/minio/minfs is implemented using minio-go and bazil, but in my\n  hands was very slow. It is designed to be run as root, requiring file-based\n  configuration.\n* github.com/kahing/goofys is implemented using aws-sdk-go and jacobsa/fuse,\n  making it incompatible with (my) Ceph Object Gateway.\n\nBoth are designed to be run as daemons as opposed to being used in-process.\n\nmuxfys is implemented using minio-go for compatibility, and hanwen/go-fuse for\nspeed. (In my testing, hanwen/go-fuse and jacobsa/fuse did not have noticeably\ndifference performance characteristics, but go-fuse was easier to write for.)\nHowever, some of its read code is inspired by goofys. Thanks to minimising\nremote calls to the remote S3 system, and only implementing what S3 is generally\ncapable of, it shares and adds to goofys' non-POSIX behaviours.\n\n## Versioning\n\nThis project adheres to [Semantic Versioning](http://semver.org/). See\nCHANGELOG.md for a description of changes.\n\nIf you want to rely on a stable API, specify the desired version when getting\nthis module, eg:\n\n    $ go get github.com/VertebrateResequencing/muxfys/v4\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvertebrateresequencing%2Fmuxfys","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvertebrateresequencing%2Fmuxfys","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvertebrateresequencing%2Fmuxfys/lists"}