{"id":28280525,"url":"https://github.com/fsspec/alluxiofs","last_synced_at":"2025-07-24T12:15:51.493Z","repository":{"id":203833796,"uuid":"704269146","full_name":"fsspec/alluxiofs","owner":"fsspec","description":"Speed up fsspec data access with Alluxio distributed caching.","archived":false,"fork":false,"pushed_at":"2025-06-27T06:50:59.000Z","size":3287,"stargazers_count":14,"open_issues_count":3,"forks_count":10,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-07-11T06:52:31.764Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fsspec.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-10-12T22:57:14.000Z","updated_at":"2025-06-27T06:51:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"d3d5e0fa-9396-45c9-991d-36e33a6c810f","html_url":"https://github.com/fsspec/alluxiofs","commit_stats":null,"previous_names":["luqqiu/alluxiofs"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fsspec/alluxiofs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fsspec%2Falluxiofs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fsspec%2Falluxiofs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fsspec%2Falluxiofs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fsspec%2Falluxiofs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fsspec","download_url":"https://codeload.github.com/fsspec/alluxiofs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fsspec%2Falluxiofs/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264951699,"owners_count":23687982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-21T10:17:12.104Z","updated_at":"2025-07-15T04:33:44.066Z","avatar_url":"https://github.com/fsspec.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Alluxio FileSystem\n\nThis quickstart shows how you can use the FSSpec interface to connect to [Alluxio](https://github.com/Alluxio/alluxio).\nFor more information on what to expect, please read the blog [Accelerate data loading in large scale ML training with Ray and Alluxio](https://www.alluxio.io/blog/accelerating-data-loading-in-large-scale-ml-training-with-ray-and-alluxio/).\n\n## Dependencies\n\n### A running Alluxio server with ETCD membership service\n\nAlluxio version \u003e= 309\n\nLaunch Alluxio clusters with the example configuration\n```config\n# only one master, one worker are running in this example\nalluxio.master.hostname=localhost\nalluxio.worker.hostname=localhost\n\n# Critical properties for this example\n# UFS address (e.g., the src of data to cache), change it to your bucket\nalluxio.dora.client.ufs.root=s3://example_bucket/datasets/\n# storage dir\nalluxio.worker.page.store.dirs=/tmp/page_ufs\n# size of storage dir\nalluxio.worker.page.store.sizes=10GB\n# use etcd to keep consistent hashing ring\nalluxio.worker.membership.manager.type=ETCD\n# default etcd endpoint\nalluxio.etcd.endpoints=http://localhost:2379\n# number of vnodes per worker on the ring\nalluxio.user.consistent.hash.virtual.node.count.per.worker=5\n\n# Other optional settings, good to have\nalluxio.job.batch.size=200\nalluxio.master.journal.type=NOOP\nalluxio.master.scheduler.initial.wait.time=10s\nalluxio.network.netty.heartbeat.timeout=5min\nalluxio.underfs.io.threads=50\n```\n\n### Python Dependencies\n\nPython in range of [3.8, 3.9, 3.10]\nray \u003e= 2.8.2\nfsspec released after 2023.6\n\n#### Install fsspec implementation for underlying data storage\n\nAlluxio fsspec acts as a cache on top of an existing underlying data lake storage connection.\nThe fsspec implementation corresponding to the underlying data lake storage needs to be installed.\nIn the below Alluxio configuration example, Amazon S3 is the data lake storage where the dataset is read from.\n\nTo connect to an existing underlying storage, there are two requirements\n- Install the underlying storage fsspec\n  - For all [built-in storage fsspec](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations), no extra python libraries are needed to be installed.\n  - For all [third-party storage fsspec](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations), the third-party fsspec python libraries are needed to be installed.\n- Set credentials for the underlying data lake storage\n\nExample: Deploy S3 as the underlying data lake storage\n[Install third-party S3 fsspec](https://s3fs.readthedocs.io/en/latest/)\n\n```commandline\npip install s3fs\n```\n\n#### Install alluxiofs\n\nDirectly install the latest published alluxiofs\n```\npip install alluxiofs\n```\n\n[Optional] Install from the source code\n```shell\ngit clone git@github.com:fsspec/alluxiofs.git\ncd alluxiofs \u0026\u0026 python3 setup.py bdist_wheel \u0026\u0026 \\\n     pip3 install dist/alluxiofs-\u003calluxiofs_version\u003e-py3-none-any.whl\n```\n\n## Running a Hello World Example\n\n### Load the dataset\n\n#### Load dataset using Alluxio CLI load command\n\n````commandline\nbin/alluxio job load --path s3://example_bucket/datasets/ --submit\n````\nThis will trigger a load job asynchronously with a job ID specified. You can wait until the load finishes or check the progress of this loading process using the following command:\n\n````commandline\nbin/alluxio job load --path s3://example_bucket/datasets/ --progress\n````\n\n### Create a AlluxioFS (backed by S3)\n\nCreate the Alluxio Filesystem with data backed in S3\n\n```\nimport fsspec\nfrom alluxiofs import AlluxioFileSystem\n\n# Register Alluxio to fsspec\nfsspec.register_implementation(\"alluxiofs\", AlluxioFileSystem, clobber=True)\n\n# Create Alluxio filesystem\nalluxio_fs = fsspec.filesystem(\"alluxiofs\", etcd_hosts=\"localhost\", etcd_port=2379, target_protocol=\"s3\")\n```\n\n### Run Alluxio FileSystem operations\n\nSimilar to [fsspec examples](https://filesystem-spec.readthedocs.io/en/latest/usage.html#use-a-file-system) and [alluxiofs](https://github.com/fsspec/alluxiofs/blob/main/tests/test_alluxio_fsspec.py) examples.\nNote that all the read operations can only succeed if the parent folder has been loaded into Alluxio.\n```\n# list files\ncontents = alluxio_fs.ls(\"s3://apc999/datasets/nyc-taxi-csv/green-tripdata/\", detail=True)\n\n# Read files\nwith alluxio_fs.open(\"s3://apc999/datasets/nyc-taxi-csv/green-tripdata/green_tripdata_2021-01.csv\", \"rb\") as f:\n    data = f.read()\n```\n\n### Running an example with Ray\n\n```\nimport fsspec\nimport ray\nfrom alluxiofs import AlluxioFileSystem\n\n# Register the Alluxio fsspec implementation\nfsspec.register_implementation(\"alluxiofs\", AlluxioFileSystem, clobber=True)\nalluxio_fs = fsspec.filesystem(\n  \"alluxiofs\", etcd_hosts=\"localhost\", target_protocol=\"s3\"\n)\n\n# Pass the initialized Alluxio filesystem to Ray and read the NYC taxi ride data set\nds = ray.data.read_csv(\"s3://example_bucket/datasets/example.csv\", filesystem=alluxio_fs)\n\n# Get a count of the number of records in the single CSV file\nds.count()\n\n# Display the schema derived from the CSV file header record\nds.schema()\n\n# Display the header record\nds.take(1)\n\n# Display the first data record\nds.take(2)\n\n# Read multiple CSV files:\nds2 = ray.data.read_csv(\"s3://apc999/datasets/csv_dir/\", filesystem=alluxio_fs)\n\n# Get a count of the number of records in the twelve CSV files\nds2.count()\n\n# End of Python example\n```\n\n#### Enable alluxiocommon enhancement module\n\nalluxiocommon package is a native enhancement module for alluxiofs based on PyO3 rust bindings.\nCurrently it enhances big reads (multi-page reads from alluxio) by issuing multi-threaded requests to alluxio.\n\nto enable it, first install alluxiocommon package:\n```\npip install alluxiocommon\n```\nand when start the Alluxio fsspec instance, add an additional option flag:\n```\nalluxio_options = {\"alluxio.common.extension.enable\" : \"True\"}\nalluxio_fs = fsspec.filesystem(\n  \"alluxiofs\", etcd_hosts=\"localhost\", target_protocol=\"s3\",\n  options=alluxio_options\n)\n```\n\n### Running examples with Pyarrow\n\n```\nimport fsspec\nfrom alluxiofs import AlluxioFileSystem\n\n# Register the Alluxio fsspec implementation\nfsspec.register_implementation(\"alluxiofs\", AlluxioFileSystem, clobber=True)\nalluxio_fs = fsspec.filesystem(\n  \"alluxiofs\", etcd_hosts=\"localhost\", target_protocol=\"s3\"\n)\n\n# Example 1\n# Pass the initialized Alluxio filesystem to Pyarrow and read the data set from the example parquet file\nimport pyarrow.dataset as ds\ndataset = ds.dataset(\"s3://example_bucket/datasets/example.parquet\", filesystem=alluxio_fs)\n\n# Get a count of the number of records in the parquet file\ndataset.count_rows()\n\n# Display the schema derived from the parquet file header record\ndataset.schema\n\n# Display the first record\ndataset.take(0)\n\n# Example 2\n# Create a python-based PyArrow filesystem using FsspecHandler\npy_fs = PyFileSystem(FSSpecHandler(alluxio_file_system))\n\n# Read the data by using the Pyarrow filesystem interface\nwith py_fs.open_input_file(\"s3://example_bucket/datasets/example.parquet\") as f:\n    alluxio_file_data = f.read()\n\n# End of Python example\n```\n\n## benchmark\nIf you want to benchmark the Python SDK against FUSE, you can run the following command:\n```bash\n/bin/bash benchmark_launch.sh\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffsspec%2Falluxiofs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffsspec%2Falluxiofs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffsspec%2Falluxiofs/lists"}