https://github.com/pwwang/pipen-gcs
A plugin for pipen to handle files in Google Cloud Storage
https://github.com/pwwang/pipen-gcs
google-cloud google-cloud-storage pipeline pipeline-framework
Last synced: 13 days ago
JSON representation
A plugin for pipen to handle files in Google Cloud Storage
- Host: GitHub
- URL: https://github.com/pwwang/pipen-gcs
- Owner: pwwang
- License: apache-2.0
- Created: 2024-07-23T07:04:35.000Z (9 months ago)
- Default Branch: master
- Last Pushed: 2025-03-08T01:42:32.000Z (about 2 months ago)
- Last Synced: 2025-03-26T20:49:36.249Z (about 1 month ago)
- Topics: google-cloud, google-cloud-storage, pipeline, pipeline-framework
- Language: Python
- Homepage:
- Size: 562 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pipen-gcs
A plugin for [pipen][1] to handle files in Google Cloud Storage.
> [!NOTE]
> Since pipen v0.16.0, it introduced cloud support natively. See [here](https://pwwang.github.io/pipen/cloud/) for more information.
> However, when the pipeline working directory is a local path, but the input/output files are in the cloud, we need to handle the cloud files ourselves and in the job script.
> To avoid that, we can use this plugin to download the input files and upload the output files automatically.> [!NOTE]
> Also note that this plugin does not synchronize the meta files to the cloud storage; they are already handled by pipen when needed. This plugin only handles the input/output files when the working directory is a local path. When the pipeline output directory is a cloud path, the output files will be uploaded to the cloud storage automatically.
## Installation
```bash
pip install -U pipen-gcs
```## Usage
```python
from pipen import Proc, Pipen
import pipen_gcs # Import and enable the pluginclass MyProc(Proc):
input = "infile:file"
input_data = ["gs://bucket/path/to/file"]
output = "outfile:file:{{in.infile.name}}.out"
# We can deal with the files as if they are local
script = "cat {{in.infile}} > {{out.outfile}}"class MyPipen(Pipen):
starts = MyProc
# input files/directories will be downloaded to /tmp
# output files/directories will be generated in /tmp and then uploaded
# to the cloud storage
plugin_opts = {"gcs_cache": "/tmp"}if __name__ == "__main__":
# The working directory is a local path
# The output directory can be a local path, but if it is a cloud path,
# the output files will be uploaded to the cloud storage automatically
MyPipen(workdir="./.pipen", outdir="./myoutput").run()
```> [!NOTE]
> When checking the meta information of the jobs, for example, whether a job is cached, the plugin will make `pipen` to use the cloud files.## Configuration
- `gcs_cache`: The directory to save the cloud storage files.
- `gcs_loglevel`: The log level for the plugin. Default is `INFO`.
- `gcs_logmax`: The maximum number of files to log while syncing. Default is `5`.[1]: https://github.com/pwwang/pipen