Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/trisongz/hfsync
Sync Huggingface Transformer Models to Cloud Storage (GCS/S3 + more WIP)
https://github.com/trisongz/hfsync
Last synced: about 2 months ago
JSON representation
Sync Huggingface Transformer Models to Cloud Storage (GCS/S3 + more WIP)
- Host: GitHub
- URL: https://github.com/trisongz/hfsync
- Owner: trisongz
- License: mit
- Created: 2021-05-05T05:57:16.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-05-05T06:58:38.000Z (over 3 years ago)
- Last Synced: 2024-10-11T09:19:15.502Z (3 months ago)
- Language: Python
- Size: 12.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# hfsync
Sync Huggingface Transformer Models to Cloud Storage (GCS/S3 + more soon)## Quickstart
```python
!pip install --upgrade git+https://github.com/trisongz/hfsync.git
!pip install --upgrade hfsync```
## Usage
```python
from hfsync import GCSAuth, S3Auth # AZAuth (not yet supported)
from hfsync import Sync# Note: Only use one auth client.
# To have Auth picked up from env vars / implicitly
auth_client = GCSAuth()
auth_client = S3Auth()# To set Auth directly / explicitly
auth_client = GCSAuth(service_account='service_account', token='gcs_token')
auth_client = S3Auth(access_key='access_key', secret_key='secret_key', session_token='token')# Local and Cloud Paths
local_path = '/content/model'
cloud_path = 'gs://bucket/model/experiment' # or 's3://bucket/model/experiment'
sync_client = Sync(local_path=local_path, cloud_path=cloud_path, auth_client=auth_client)# Or Implicitly without an auth_client to have it figure things out based on your cloud path
sync_client = Sync(local_path=local_path, cloud_path=cloud_path)# After training loop, sync your pretrained model to both local and cloud.
# You don't need to explicitly call model.save_pretrained(path) as this function will do that automatically
results = sync_client.save_pretrained(model, tokenizer)
# results = {
# '/content/model/pytorch_model.bin': 'gs://bucket/model/experiment/pytorch_model.bin'
# ...
# }# Pull Down from your bucket to local
results = sync_client.sync_to_local(overwrite=False)
# results = {
# 'gs://bucket/model/experiment/pytorch_model.bin': '/content/model/pytorch_model.bin'
# ...
# }# Or set explicit paths if you are changing paths, for example, different dirs for each checkpoint
new_local = '/content/model2'
results = sync_client.sync_to_local(local_path=new_local, cloud_path=cloud_path, overwrite=False)
# results = {
# 'gs://bucket/model/experiment/pytorch_model.bin': '/content/model2/pytorch_model.bin'
# ...
# }# Or set paths to use
new_cloud = 'gs://bucket/model/experiment2'
sync_client.set_paths(local_path=new_local, cloud_path=new_cloud)
results = sync_client.save_pretrained(model, tokenizer)
# results = {
# '/content/model2/pytorch_model.bin': 'gs://bucket/model/experiment2/pytorch_model.bin'
# ...
# }# You can also use the underlying filesystem to copy a file directly
# Implicitly & Explicitly
filename = '/content/model2/pytorch_model.bin'
sync_client.copy(filename) # Copies to the set cloud_path variable -> 'gs://bucket/model/experiment2/pytorch_model.bin'
sync_client.copy(filename, dest='gs://bucket/model/experiment3') # Copies to dest variable -> 'gs://bucket/model/experiment3/pytorch_model.bin'
sync_client.copy_file(filename, dest='gs://bucket/model/experiment3/model.bin') # Copies to dest variable -> 'gs://bucket/model/experiment3/model.bin'filename = 'gs://bucket/model/experiment2/pytorch_model.bin'
sync_client.copy(filename) # Copies to the set local_path variable -> '/content/model2/pytorch_model.bin'
sync_client.copy(filename, dest='/content/model3') # Copies to dest variable -> '/content/model3/pytorch_model.bin'
sync_client.copy_file(filename, dest='/content/model3/model.bin') # Copies to dest variable -> '/content/model3/model.bin'# Copy Explicitly
src_file = '/content/mydataset.pb'
dest_file = 's3://bucket/data/dataset.pb'
sync_client.copy_file(src_file, dest_file)```
### Environment Variables UsedGoogle Cloud Storage
- `GOOGLE_API_TOKEN`
- `GOOGLE_APPLICATION_CREDENTIALS`AWS S3
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- `AWS_SESSION_TOKEN`### Limitations
While the library tries to respect and check prior to overwriting where `overwrite=false` is available, it's not currently supported agnostically across all cloud.
Support for other Cloud FS is WIP.### Credits
- [file-io](https://github.com/trisongz/file_io)
- [smart-open](https://github.com/RaRe-Technologies/smart_open)