https://github.com/dentiny/duck-read-cache-fs
This repository is made as read-only filesystem for remote access.
https://github.com/dentiny/duck-read-cache-fs
Last synced: 7 months ago
JSON representation
This repository is made as read-only filesystem for remote access.
- Host: GitHub
- URL: https://github.com/dentiny/duck-read-cache-fs
- Owner: dentiny
- License: mit
- Created: 2025-01-25T19:58:17.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-04-05T07:23:19.000Z (7 months ago)
- Last Synced: 2025-04-05T07:24:21.243Z (7 months ago)
- Language: C++
- Size: 379 KB
- Stars: 37
- Watchers: 3
- Forks: 1
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-duckdb - `cache_httpfs` - Adds a read caching layer to duckdb filesystem to improve query performance and reduce egress cost. (Extensions / [Community Extensions](https://duckdb.org/community_extensions/))
README
# duck-read-cache-fs
A DuckDB extension for remote filesystem access cache.
## Loading cache httpfs
Since DuckDB v1.0.0, cache httpfs can be loaded as a community extension without requiring the `unsigned` flag. From any DuckDB instance, the following two commands will allow you to install and load the extension:
```sql
INSTALL cache_httpfs from community;
-- Or upgrade to latest version with `FORCE INSTALL cache_httpfs from community;`
LOAD cache_httpfs;
```
See the [cache httpfs community extension page](https://community-extensions.duckdb.org/extensions/cache_httpfs.html) for more information.
## Introduction
This repository is made as read-only filesystem for remote access, which serves as cache layer above duckdb [httpfs](https://github.com/duckdb/duckdb-httpfs).
Key features:
- Caching for data, which adds support for remote file access to improve IO performance and reduce egress cost; several caching options and entities are supported
+ in-memory, cache fetched file content into blocks and leverages a LRU cache to evict stale blocks
+ on-disk (default), already read blocks are stored to load filesystem, and evicted on insufficient disk space based on their access timestamp
+ no cache, it's allowed to disable cache and fallback to httpfs without any side effects
- Parallel read, read operations are split into size-tunable chunks to increase cache hit rate and improve performance
- Apart from data blocks, the extension also supports cache file handle, file metadata and glob operation
+ The cache for these entities are enabled by default.
- Profiling helps us to understand system better, key metrics measured include cache access stats, and IO operation latency, we plan to support multiple types of profile result access; as of now there're three types of profiling
+ temp, all access stats are stored in memory, which could be retrieved via `SELECT cache_httpfs_get_profile();`
+ duckdb (under work), stats are stored in duckdb so we could leverage its rich feature for analysis purpose (i.e. use histogram to understant latency distribution)
+ profiling is by default disabled
- 100% Compatibility with duckdb `httpfs`
+ Extension is built upon `httpfs` extension and automatically load it beforehand, so it's fully compatible with it; we provide option `SET cache_httpfs_type='noop';` to fallback to and behave exactly as httpfs.
- Able to wrap **ALL** duckdb-compatible filesystem with one simple SQL `SELECT cache_httpfs_wrap_cache_filesystem()`, and get all the benefit of caching, parallel read, IO performance stats, you name it.
Caveat:
- The extension is implemented for object storage, which is expected to be read-heavy workload and (mostly) immutable, so it only supports read cache (at the moment), cache won't be cleared on write operation for the same object.
+ We provide workaround for overwrite -- user could call `cache_httpfs_clear_cache` to delete all cache content, and `cache_httpfs_clear_cache_for_file` for a certain object.
+ All types of cache provides eventual consistency guarantee, which gets evicted after a tunable timeout.
- Filesystem requests are split into multiple sub-requests and aligned with block size for parallel IO requests and cache efficiency, so for small requests (i.e. read 1 byte) could suffer read amplification.
A workaround for reducing amplification is to tune down block size via `cache_httpfs_cache_block_size` or fallback to native httpfs.
## Example usage
```sql
-- No need to load httpfs.
D LOAD cache_httpfs;
-- Create S3 secret to access objects.
D CREATE SECRET my_secret ( TYPE S3, KEY_ID '', SECRET '', REGION 'us-east-1', ENDPOINT 's3express-use1-az6.us-east-1.amazonaws.com');
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true │
└─────────┘
-- Set cache type to in-memory.
D SET cache_httpfs_type='in_mem';
-- Access remote file.
D SELECT * FROM 's3://s3-bucket-user-2skzy8zuigonczyfiofztl0zbug--use1-az6--x-s3/t.parquet';
┌───────┬───────┐
│ i │ j │
│ int64 │ int64 │
├───────┼───────┤
│ 0 │ 1 │
│ 1 │ 2 │
│ 2 │ 3 │
│ 3 │ 4 │
│ 4 │ 5 │
├───────┴───────┤
│ 5 rows │
└───────────────┘
```
For more example usage, checkout [example usage](/doc/example_usage.md)
## [More About Benchmark](/benchmark/README.md)


## Platform support
At the moment macOS and Linux are supported, shoot us a [feature request](https://github.com/dentiny/duck-read-cache-fs/issues/new?template=feature_request.md) if you would like to run extension on other platforms.
## Development
For development, the extension requires [CMake](https://cmake.org), and a `C++14` compliant compiler. Run `make` in the root directory to compile the sources. For development, use `make debug` to build a non-optimized debug version. You should run `make unit`.
Please also refer to our [Contribution Guide](https://github.com/dentiny/duck-read-cache-fs/blob/main/CONTRIBUTING.md).