Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danielfrg/s3contents
Jupyter Notebooks in S3 - Jupyter Contents Manager implementation
https://github.com/danielfrg/s3contents
aws aws-s3 jupyter-notebook python s3
Last synced: 5 days ago
JSON representation
Jupyter Notebooks in S3 - Jupyter Contents Manager implementation
- Host: GitHub
- URL: https://github.com/danielfrg/s3contents
- Owner: danielfrg
- License: apache-2.0
- Created: 2016-10-22T00:03:09.000Z (over 8 years ago)
- Default Branch: main
- Last Pushed: 2024-08-09T18:47:04.000Z (6 months ago)
- Last Synced: 2024-10-29T17:21:48.355Z (3 months ago)
- Topics: aws, aws-s3, jupyter-notebook, python, s3
- Language: Python
- Homepage:
- Size: 541 KB
- Stars: 248
- Watchers: 12
- Forks: 88
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# S3Contents - Jupyter Notebooks in S3
A transparent, drop-in replacement for Jupyter standard filesystem-backed storage system.
With this implementation of a
[Jupyter Contents Manager](https://jupyter-server.readthedocs.io/en/latest/developers/contents.html)
you can save all your notebooks, files and directory structure directly to a
S3/GCS bucket on AWS/GCP or a self hosted S3 API compatible like [MinIO](http://minio.io).## Installation
```shell
pip install s3contents
```Install with GCS dependencies:
```shell
pip install s3contents[gcs]
```## s3contents vs X
While there are some implementations of an S3 Jupyter Content Manager such as
[s3nb](https://github.com/monetate/s3nb) or [s3drive](https://github.com/stitchfix/s3drive)
s3contents is the only one tested against new versions of Jupyter.
It also supports more authentication methods and Google Cloud Storage.This aims to be a fully tested implementation and it's based on [PGContents](https://github.com/quantopian/pgcontents).
## Configuration
Create a `jupyter_notebook_config.py` file in one of the
[Jupyter config directories](https://jupyter.readthedocs.io/en/latest/use/jupyter-directories.html#id1)
for example: `~/.jupyter/jupyter_notebook_config.py`.**Jupyter Notebook Classic**: If you plan to use the Classic Jupyter Notebook
interface you need to change `ServerApp` to `NotebookApp` for all the examples on this page.## AWS S3
```python
from s3contents import S3ContentsManagerc = get_config()
# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.bucket = ""# Fix JupyterLab dialog issues
c.ServerApp.root_dir = ""
```### Authentication
Additionally you can configure multiple authentication methods:
Access and secret keys:
```python
c.S3ContentsManager.access_key_id = ""
c.S3ContentsManager.secret_access_key = ""
```Session token:
```python
c.S3ContentsManager.session_token = ""
```### AWS EC2 role auth setup
It also possible to use IAM Role-based access to the S3 bucket from an Amazon EC2 instance or AWS resource.
To do that just leave any authentication options (`access_key_id`, `secret_access_key`) to their default of `None`
and ensure that the EC2 instance has an IAM role which provides sufficient permissions (read and write) for the bucket.### Optional settings
```python
# A prefix in the S3 buckets to use as the root of the Jupyter file system
c.S3ContentsManager.prefix = "this/is/a/prefix/on/the/s3/bucket"# Server-Side Encryption
c.S3ContentsManager.sse = "AES256"# Authentication signature version
c.S3ContentsManager.signature_version = "s3v4"# See AWS key refresh
c.S3ContentsManager.init_s3_hook = init_function
```### AWS key refresh
The optional `init_s3_hook` configuration can be used to enable AWS key rotation (described [here](https://dev.to/li_chastina/auto-refresh-aws-tokens-using-iam-role-and-boto3-2cjf) and [here](https://www.owenrumney.co.uk/2019/01/15/implementing-refreshingawscredentials-python/)) as follows:
```python
from aiobotocore.credentials import AioRefreshableCredentials
from aiobotocore.session import get_session
from configparser import ConfigParserfrom s3contents import S3ContentsManager
def refresh_external_credentials():
config = ConfigParser()
config.read('/home/jovyan/.aws/credentials')
return {
"access_key": config['default']['aws_access_key_id'],
"secret_key": config['default']['aws_secret_access_key'],
"token": config['default']['aws_session_token'],
"expiry_time": config['default']['aws_expiration']
}async def async_refresh_credentials():
return refresh_external_credentials()def make_key_refresh_boto3(this_s3contents_instance):
session_credentials = AioRefreshableCredentials.create_from_metadata(
metadata = refresh_external_credentials(),
refresh_using = async_refresh_credentials,
method = 'custom-refreshing-key-file-reader'
)
refresh_session = get_session() # from aibotocore.session
refresh_session._credentials = session_credentials
this_s3contents_instance.boto3_session = refresh_session# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManagerc.S3ContentsManager.init_s3_hook = make_key_refresh_boto3
```### MinIO playground example
You can test this using the [`play.minio.io:9000`](https://play.minio.io:9000) playground:
Just be sure to create the bucket first.
```python
from s3contents import S3ContentsManagerc = get_config()
# Tell Jupyter to use S3ContentsManager
c.ServerApp.contents_manager_class = S3ContentsManager
c.S3ContentsManager.access_key_id = "Q3AM3UQ867SPQQA43P2F"
c.S3ContentsManager.secret_access_key = "zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG"
c.S3ContentsManager.endpoint_url = "https://play.minio.io:9000"
c.S3ContentsManager.bucket = "s3contents-demo"
c.S3ContentsManager.prefix = "notebooks/test"
```## Access local files
To access local file as well as remote files in S3 you can use [hybridcontents](https://github.com/viaduct-ai/hybridcontents).
Install it:
```shell
pip install hybridcontents
```Use a configuration similar to this:
```python
from s3contents import S3ContentsManager
from hybridcontents import HybridContentsManager
from notebook.services.contents.largefilemanager import LargeFileManagerc = get_config()
c.ServerApp.contents_manager_class = HybridContentsManager
c.HybridContentsManager.manager_classes = {
# Associate the root directory with an S3ContentsManager.
# This manager will receive all requests that don"t fall under any of the
# other managers.
"": S3ContentsManager,
# Associate /local_directory with a LargeFileManager.
"local_directory": LargeFileManager,
}c.HybridContentsManager.manager_kwargs = {
# Args for root S3ContentsManager.
"": {
"access_key_id": "",
"secret_access_key": "",
"bucket": "",
},
# Args for the LargeFileManager mapped to /local_directory
"local_directory": {
"root_dir": "/Users/danielfrg/Downloads",
},
}
```## GCP - Google Cloud Storage
Install the extra dependencies with:
```shell
pip install s3contents[gcs]
``````python
from s3contents.gcs import GCSContentsManagerc = get_config(
c.ServerApp.contents_manager_class = GCSContentsManager
c.GCSContentsManager.project = ""
c.GCSContentsManager.token = "~/.config/gcloud/application_default_credentials.json"
c.GCSContentsManager.bucket = ""
```Note that the file `~/.config/gcloud/application_default_credentials.json` assumes
a POSIX system when you did `gcloud init`.## Other configuration
### File Save Hooks
If you want to use pre/post file save hooks here are some examples.
A `pre_save_hook` is written in the exact same way as normal, operating on the
file in local storage before committing it to the object store.```python
def scrub_output_pre_save(model, **kwargs):
"""
Scrub output before saving notebooks
"""# only run on notebooks
if model["type"] != "notebook":
return# only run on nbformat v4
if model["content"]["nbformat"] != 4:
returnfor cell in model["content"]["cells"]:
if cell["cell_type"] != "code":
continue
cell["outputs"] = []
cell["execution_count"] = Nonec.S3ContentsManager.pre_save_hook = scrub_output_pre_save
```A `post_save_hook` instead operates on the file in object storage,
because of this it is useful to use the file methods on the `contents_manager`
for data manipulation.
In addition, one must use the following function signature (unique to `s3contents`):```python
def make_html_post_save(model, s3_path, contents_manager, **kwargs):
"""
Convert notebooks to HTML after saving via nbconvert
"""
from nbconvert import HTMLExporterif model["type"] != "notebook":
returncontent, _format = contents_manager.fs.read(s3_path, format="text")
my_notebook = nbformat.reads(content, as_version=4)html_exporter = HTMLExporter()
html_exporter.template_name = "classic"(body, resources) = html_exporter.from_notebook_node(my_notebook)
base, ext = os.path.splitext(s3_path)
contents_manager.fs.write(path=(base + ".html"), content=body, format=_format)c.S3ContentsManager.post_save_hook = make_html_post_save
```