https://github.com/googleclouddataproc/jupyterhub-dataprocspawner
https://github.com/googleclouddataproc/jupyterhub-dataprocspawner
google-cloud-dataproc
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/googleclouddataproc/jupyterhub-dataprocspawner
- Owner: GoogleCloudDataproc
- License: apache-2.0
- Created: 2019-08-08T17:53:48.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2022-05-27T01:04:56.000Z (over 3 years ago)
- Last Synced: 2025-03-21T06:33:26.010Z (7 months ago)
- Topics: google-cloud-dataproc
- Language: Python
- Size: 400 KB
- Stars: 14
- Watchers: 30
- Forks: 11
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# DataprocSpawner
DataprocSpawner enables [JupyterHub][jupyterhub] to spawn single-user [jupyter_notebooks][Jupyter notebooks] that run on [Dataproc][dataproc] clusters. This provides users with ephemeral clusters for data science without the pain of managing them.
- [Product Documentation][dataproc]
- DISCLAIMER: DataprocSpawner only supports zonal DNS names. If your project uses global DNS names, click [this][dns] for instructions on how to migrate.[jupyterhub]: https://jupyterhub.readthedocs.io/en/stable/
[jupyter_notebooks]: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html
[dataproc]: https://cloud.google.com/dataproc
[dns]: https://cloud.google.com/compute/docs/internal-dns#migrating-to-zonalSupported Python Versions: Python >= 3.6
## Before you begin
In order to use this library, you first need to go through the following steps:
1. [Select or create a Cloud Platform project][create_project]
2. [Enable billing for your project][enable_billing]
3. [Enable the Google Cloud Dataproc API][enable_api]
4. [Setup Authentication][authentication][create_project]: https://console.cloud.google.com/project
[enable_billing]: https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project
[enable_api]: https://cloud.google.com/dataproc
[authentication]: https://cloud.google.com/docs/authentication/getting-started#auth-cloud-implicit-python## Installation example
### Locally
To try is locally for development purposes. From the root folder:
```sh
chmod +x deploy_local_example.sh
./deploy_local_example.sh
```The script will start a local container image and authenticate it using your local credentials.
Note: Although you can try the Dataproc Spawner image locally, you might run into networking communication problems.
### Google Compute Engine
To try it out in the Cloud, the quickest way is to to use a test Compute Engine instance. The following takes you through the process.
1. Set your working project
```bash
PROJECT_ID=
VM_NAME=vm-spawner
```1. Run the example script which:
a. Creates a Dockerfile
b. Creates a jupyter_config.py example file that uses a dummy authenticator.
c. Deploy a Docker image of the JupyterHub spawner in [Google Container Registry][gcr]
d. Create a container-based Compute Engine
e. Returns the IP of the instance that runs JupyterHub.```bash
bash deploy_gce_example.sh ${PROJECT_ID} ${VM_NAME}
```1. After the script finishes, you should see an IP displayed. You can use that IP to access your setup at `:8000`. You might have to wait for a few minutes until the container is deployed on the instance.
[gcr]: https://cloud.google.com/container-registry/
## Troubleshooting
To troubleshoot
1. ssh into the VM:
```bash
gcloud compute ssh ${VM_NAME}
```1. From the VM console, install some useful tools:
```bash
apt-get update
apt-get install vim
```1. From the VM console, you can:
- List the running containers with `docker ps`
- Display container logs `docker logs -f `
- Execute code in the container `docker exec -it /bin/bash`
- Restart the container for changes to take effect `docker restart `## Notes
- DataprocSpawner defaults to port 12345, the port can be set within `jupyterhub_config.py`. More info in JupyterHub's [jupyterhub_documentation].
c.Spawner.port = {port number}
- The region default is ``us-central1`` for Dataproc clusters. The zone default is ``us-central1-a``. Using ``global`` is currently unsupported. To change region, pick a region and zone from this [list][locations_list] and include the following lines in ``jupyterhub_config.py``:
.. code-block:: console
c.DataprocSpawner.region = '{region}'
c.DataprocSpawner.zone = '{zone that is within the chosen region}'[jupyterhub_documentation]: https://jupyterhub.readthedocs.io/en/stable/api/spawner.html#jupyterhub.spawner.Spawner.port
[locations_list]: https://cloud.google.com/compute/docs/regions-zones/## Next
- For an example on how to run the Dataproc spawner in production, refer to the [ai-notebook-extended Github repository](https://github.com/GoogleCloudPlatform/ai-notebooks-extended).
- For a Google-supported version of the Dataproc Spawner, refer to the official [Dataproc Hub documentation](https://cloud.google.com/dataproc/docs/tutorials/dataproc-hub-admins).## Disclaimer
[This is not an official Google product.](https://opensource.google.com/docs/releasing/publishing/#disclaimer)