Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/susumuota/kaggleenv
GCP + Kaggle Docker + VSCode
https://github.com/susumuota/kaggleenv
docker docker-compose google-cloud google-cloud-platform jupyter jupyter-notebook jupyterlab kaggle python python3 visual-studio-code vscode
Last synced: about 1 month ago
JSON representation
GCP + Kaggle Docker + VSCode
- Host: GitHub
- URL: https://github.com/susumuota/kaggleenv
- Owner: susumuota
- Created: 2021-04-01T17:24:50.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-28T04:31:30.000Z (almost 3 years ago)
- Last Synced: 2024-07-30T18:53:06.134Z (5 months ago)
- Topics: docker, docker-compose, google-cloud, google-cloud-platform, jupyter, jupyter-notebook, jupyterlab, kaggle, python, python3, visual-studio-code, vscode
- Language: Dockerfile
- Homepage:
- Size: 156 KB
- Stars: 14
- Watchers: 3
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GCP (or local machine) + Kaggle Docker + VSCode
![vscode_jupyter](https://user-images.githubusercontent.com/1632335/113431667-0d1b8c80-9417-11eb-8183-e89084670f39.png)
This document describes how to setup [Kaggle Python docker image](https://github.com/Kaggle/docker-python) environment on [Google Cloud Platform (GCP)](https://cloud.google.com/) or your local machine by [Docker](https://www.docker.com/) and how to setup [Visual Studio Code (VSCode)](https://code.visualstudio.com/) to connect the environment.
A primally information source comes from [Kaggle's docker-python repository](https://github.com/Kaggle/docker-python). Also, there is a [guide](https://medium.com/kaggleteam/how-to-get-started-with-data-science-in-containers-6ed48cb08266), but unfortunately it's a bit obsoleted guide written in 2016.
> **_Note:_** This method may take 20-30 minutes and over 18.5GB disks for data downloads.
> **_Note:_** If you do not use VSCode, no need to read this document. See [here](https://www.kaggle.com/product-feedback/159602).
All files in this document are available on [my repository](https://github.com/susumuota/kaggleenv).
There are 2 options, GCP or local machine. If you are going to setup the environment on your local machine, skip to `[Option 2] Setup the environment on your local machine` section.
## [Option 1] Setup the environment on GCP
On GCP, ["Vertex AI Workbench"](https://cloud.google.com/vertex-ai/docs/workbench) would be easier than ["Compute Engine"](https://cloud.google.com/compute/docs/) (GCE) to setup [Kaggle Python docker image](https://github.com/Kaggle/docker-python).
### Create a Vertex AI Workbench
- Access https://console.cloud.google.com/vertex-ai/workbench
- Select a project e.g. `kaggle-myproject-1` (You must [create a project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project) beforehand)
- Click `USER-MANAGED NOTEBOOK`
- Click `NEW NOTEBOOK`
- Choose `Customize...`
- Instance name: e.g. `kaggle-test-1`
- Environment: `Kaggle Python [BETA]` (This option will automatically prepare [Kaggle Python docker image](https://github.com/Kaggle/docker-python) at startup the VM instance)
- GPU type: e.g. `NVIDIA Tesla T4` (You must [increase GPU quota](https://cloud.google.com/compute/quotas#requesting_additional_quota) beforehand)
- Mark the checkbox `Install NVIDIA GPU driver automatically for me`![gcp_notebook_1](https://user-images.githubusercontent.com/1632335/115653028-636e5200-a369-11eb-9bda-8c34036591f4.png)
- Open `Networking` section
- Mark the radio button `Networks in this project`
- Clear the checkbox `Allow proxy access when it's available` (This option will avoid to load unnecessary proxy Docker container)
- Click `CREATE`![gcp_notebook_2](https://user-images.githubusercontent.com/1632335/115653237-c7911600-a369-11eb-88b1-db382a1f997e.png)
- Wait for around 20-30 minutes to start up the VM instance. I guess it's because of `docker pull`. If you choose GPU type: `None`, it takes a few minutes. Check the console logs at [here](https://console.cloud.google.com/logs/).
### Connect to the VM instance
- [Install Cloud SDK](https://cloud.google.com/sdk/docs/quickstart). If you are using macOS and Homebrew, `brew install --cask google-cloud-sdk` may be convenient.
After that, `gcloud` command should be available on your terminal.
- SSH to the VM instance with port forwarding
```
% gcloud compute --project "kaggle-myproject-1" ssh --zone "us-west1-b" "kaggle-test-1" -- -L 8080:localhost:8080
```> **_Note:_** You must wait to start up the VM instance. Check the console logs at [here](https://console.cloud.google.com/logs/).
> **_Note:_** I recommend to limit source IP ranges for SSH and RDP port. See [here](https://cloud.google.com/vpc/docs/using-firewalls#creating_firewall_rules).
- Open web browser and try to access `http://localhost:8080`
> **_Note:_** There is no `token=...`.
If you do not use VSCode, that's all. You do not have to do anything below.
### Stop pre-installed Docker container
If you use VSCode to connect GCP Notebook, you must tweak Docker container. At the moment, VSCode can only access to remote Jupyter servers with `token` option enabled. But pre-installed Docker container disables `token` option by `c.NotebookApp.token = ''`. You must stop pre-installed Docker container and run a new Docker container with `token` option enabled instead.
- Stop pre-installed Docker container
Stop pre-installed Docker container and turn off the startup option. See details [here](https://docs.docker.com/config/containers/start-containers-automatically/).
```
% docker ps -a
% docker inspect -f "{{.Name}} {{.HostConfig.RestartPolicy.Name}}" $(docker ps -aq)
% docker update --restart no payload-container
% docker inspect -f "{{.Name}} {{.HostConfig.RestartPolicy.Name}}" $(docker ps -aq)
% docker stop payload-container
% docker ps -a
```- Install `docker-compose`
`docker-compose` will be convenient to run containers, even on a single container. See details [here](https://docs.docker.com/compose/install/).
```
% sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
% sudo chmod +x /usr/local/bin/docker-compose
```Skip to `Run Docker container` section.
## [Option 2] Setup the environment on your local machine
If you setup the environment on your local machine, [install and setup Docker](https://docs.docker.com/get-docker/).
After that, `docker` and `docker-compose` commands should be available on your terminal.
```sh
% docker -v
Docker version 20.10.8, build 3967b7d
% docker-compose -v
docker-compose version 1.29.2, build 5becea4c
```## Run Docker container (both GCP and local machine)
I prepared a [sample repository](https://github.com/susumuota/kaggleenv) of the `Dockerfile`, etc. If you do not care about details, execute these commands and skip to `Open Notebook by web browser` section.
```
% git clone https://github.com/susumuota/kaggleenv.git
% cd kaggleenv
% docker-compose build
% docker-compose up -d
% docker-compose logs
# Find and copy http://localhost:8080/?token=...
```Otherwise, follow the instructions below.
### Create `Dockerfile`
Create a directory (e.g. `kaggleenv`) and go there. If you clone the sample repository, just `cd kaggleenv`.
Create `Dockerfile` like the following. See details [here](https://docs.docker.com/engine/reference/builder/#format). If you use CPU instead of GPU, edit `FROM` lines.
```Dockerfile
# for CPU
# FROM gcr.io/kaggle-images/python:v109
# for GPU
FROM gcr.io/kaggle-gpu-images/python:v109# apply patch to enable token and change notebook directory to /kaggle/working
# see jupyter_notebook_config.py.patch
COPY jupyter_notebook_config.py.patch /opt/jupyter/.jupyter/
RUN (cd /opt/jupyter/.jupyter/ && patch < jupyter_notebook_config.py.patch)# add extra modules here
# RUN pip install -U pip
```You can specify a tag (e.g. `v109` or `latest`) to keep using the same environment. You can find tags from [GCR page](https://gcr.io/kaggle-images/python).
### Create `jupyter_notebook_config.py.patch`
This Docker image will run Jupyter Lab with startup script `/run_jupyter.sh` and config `/opt/jupyter/.jupyter/jupyter_notebook_config.py`. It needs to be tweaked like the following.
- Enable token (so that VSCode can connect properly)
- Change notebook directory to `/kaggle/working`Create `jupyter_notebook_config.py.patch` like the following.
```patch
--- jupyter_notebook_config.py.orig 2021-12-19 07:04:25.000000000 +0000
+++ jupyter_notebook_config.py 2022-01-29 18:19:29.016821460 +0000
@@ -11 +11 @@
-c.ServerApp.token = ""
+# c.ServerApp.token = ""
@@ -17 +17,2 @@
-c.ServerApp.notebook_dir = "/home/jupyter"
+# c.ServerApp.notebook_dir = "/home/jupyter"
+c.ServerApp.notebook_dir = "/kaggle/working"
```> **_Note:_** This patch may not work in the future version of [Kaggle Python docker image](https://github.com/Kaggle/docker-python). In that case, create a new patch with `diff -u original new > patch`. At least I confirmed this patch work on `v109` tag.
### Create `docker-compose.yml`
Create `docker-compose.yml` like the following. See details [here](https://docs.docker.com/compose/). This setting mounts current directory on your local machine to `/kaggle/working` on the container. If you use CPU instead of GPU, comment out `runtime: nvidia`.
```yaml
version: "3"
services:
jupyter:
build: .
volumes:
- $PWD:/kaggle/working
working_dir: /kaggle/working
ports:
- "8080:8080"
hostname: localhost
restart: always
# for GPU
runtime: nvidia
```### Create `.dockerignore`
Create `.dockerignore` like the following. See details [here](https://docs.docker.com/engine/reference/builder/#dockerignore-file). This setting specifies subdirectories and files that should be ignored when building Docker images. You will **mount** the current directory, so you do not need to **include** subdirectories and files into image. Especially, `input` directory should be ignored because it may include large files so that build process may take long time.
```
README.md
input
output
.git
.gitignore
.vscode
.ipynb_checkpoints
```### Run `docker-compose build`
Run `docker-compose build` to build the Docker image. See details [here](https://docs.docker.com/compose/reference/build/).
> **_Note:_** This process may take 20-30 minutes and over 18.5GB disks for data downloads on your local machine.
```sh
% docker-compose build
```Confirm the image by `docker images`.
```sh
% docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
kaggleenv_jupyter latest ............ 28 minutes ago 18.5GB
```### Run `docker-compose up -d`
Run `docker-compose up -d` to start Docker container in the background. In addition, the container will automatically run at startup VM instance or local machine. See details [here](https://docs.docker.com/compose/reference/up/) and [here](https://docs.docker.com/config/containers/start-containers-automatically/).
```sh
% docker-compose up -d
% docker ps -a
% docker inspect -f "{{.Name}} {{.HostConfig.RestartPolicy.Name}}" $(docker ps -aq)
```Find the Notebook URL on the log and copy it.
```
% docker-compose logshttp://localhost:8080/?token=...
```### Open Notebook by web browser
- Open web browser and type the Notebook URL (`http://localhost:8080/?token=...`).
- Create a `Python 3` Notebook.
- Create code cells and execute `!pwd`, `!ls` and `!pip list` to confirm Python environment.![jupyter_kaggle](https://user-images.githubusercontent.com/1632335/113484058-5afcc700-94e1-11eb-9f2e-a6fd01a0121a.png)
### Setup Kaggle API
[Setup Kaggle API credentials](https://github.com/Kaggle/kaggle-api#api-credentials).
After that, `~/.kaggle/kaggle.json` file should be on your local machine.
- Copy `~/.kaggle/kaggle.json` to current directory **on your local machine** (so that it can be accessed from the container at `/kaggle/working/kaggle.json`)
```sh
% cp -p ~/.kaggle/kaggle.json .
```- Create a code cell on the Notebook and confirm `/kaggle/working/kaggle.json` on the container.
```sh
!ls -l /kaggle/working/kaggle.json
-rw------- 1 root root 65 Mar 22 07:59 /kaggle/working/kaggle.json
```- Copy it to `~/.kaggle` directory on the container.
```sh
!cp -p /kaggle/working/kaggle.json ~/.kaggle/
```- Remove `kaggle.json` on the current directory **on your local machine**.
```sh
% rm -i kaggle.json
```- Try `kaggle` command on the Notebook.
```sh
!kaggle competitions list
```### Shutdown the Vertex AI Workbench (GCP)
After you finished your work, stop the VM instance.
- Access https://console.cloud.google.com/vertex-ai/workbench/list/instances
- Check the VM instance on the list
- Click `STOP` or `DELETE`If you `DELETE` the VM instance, you will not be charged anything (as far as I know).
However, if you `STOP` the VM instance, you will be charged for resources (e.g. persistent disk) until you `DELETE` it. You should `DELETE` if you do not use it for a long time (though you must setup the environment again). See details [here](https://cloud.google.com/compute/docs/instances/stop-start-instance#billing).
### Run `docker-compose down` (local machine)
After you finished your work, run `docker-compose down` to stop Docker container. See details [here](https://docs.docker.com/compose/reference/down/).
```sh
% docker-compose down
```## Setup VSCode to open remote Notebooks
If you are using [Visual Studio Code (VSCode)](https://code.visualstudio.com/), you can setup VSCode to connect to the remote Notebook.
### [Optional] Install the latest Notebook extension
There is a revamped version of Notebook extension. See details [here](https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/). I recommend installing it because this new version can handle custom extensions (e.g. key bindings) properly inside code cells, etc.
![vscode_jupyter](https://user-images.githubusercontent.com/1632335/113431667-0d1b8c80-9417-11eb-8183-e89084670f39.png)
### Connect to the remote Notebook
Connect to the remote Notebook. See details [here](https://code.visualstudio.com/docs/python/jupyter-support#_connect-to-a-remote-jupyter-server).
- Open `Command Palette...`
- Type `Jupyter: Specify local or remote Jupyter server for connections`![vscode_palette](https://user-images.githubusercontent.com/1632335/113466765-3bca4f00-9479-11eb-914e-7d90ac073daf.png)
- Choose `Existing: Specify the URI of an existing server`
![vscode_existing](https://user-images.githubusercontent.com/1632335/113467276-01fb4780-947d-11eb-93f6-a4f5a974d323.png)
- Specify the Notebook URL (`http://localhost:8080/?token=...`)
> **_Note:_** `token` must be specified.
![vscode_uri](https://user-images.githubusercontent.com/1632335/113467238-c2ccf680-947c-11eb-9388-1ecd2297eb6b.png)
- Press `Reload` button
![vscode_reload](https://user-images.githubusercontent.com/1632335/113467629-31ab4f00-947f-11eb-9062-1bbc5566ab86.png)
- Open `Command Palette...`
- Type `Jupyter: Create New Blank Notebook`![vscode_create](https://user-images.githubusercontent.com/1632335/113467560-9f0ab000-947e-11eb-865e-62beeed43f12.png)
- Create code cells and execute `!pwd`, `!ls` and `!pip list` to confirm Python environment.
![vscode_new_notebook](https://user-images.githubusercontent.com/1632335/113467525-75518900-947e-11eb-86e1-e9e79d84e610.png)
## Increase Docker resources (local machine)
Sometimes containers need much resources (e.g. memory or disk). You can increase the amount of resources from Docker preferences.
- Click Docker icon
- Choose `Preferences...`
- Click `Resources`
- Click `ADVANCED`
- Increase `Memory`, e.g. `8GB`
- Increase `Disk image size`, e.g. `128GB`
- Click `Apply & Restart`![docker_preferences](https://user-images.githubusercontent.com/1632335/137115613-88386ee7-807e-4252-920f-73ef04d9a18a.png)
## Maintain Docker containers, images and cache
Basically `docker-compose up -d` and `docker-compose down` work well, but sometimes you may need to use these commands to maintain Docker containers, images and cache.
- How to remove containers. See details [here](https://docs.docker.com/engine/reference/commandline/rm/).
```sh
% docker ps -a # confirm container ids to remove
% docker rm CONTAINER # remove container by id
% docker rm $(docker ps --filter status=exited -q) # remove all containers that have exited
```- How to remove images. See details [here](https://docs.docker.com/engine/reference/commandline/rmi/).
```sh
% docker images # confirm image ids to remove
% docker rmi IMAGE # remove image by id
```- How to remove cache. See details [here](https://docs.docker.com/engine/reference/commandline/builder_prune/) and [here](https://docs.docker.com/engine/reference/commandline/volume_prune/).
```sh
% docker system df # confirm how much disk used by cache
% docker builder prune
% docker volume prune
```## TODO
- Workflow to submit local Notebook to Kaggle
## Links
- https://github.com/Kaggle/docker-python
- https://medium.com/kaggleteam/how-to-get-started-with-data-science-in-containers-6ed48cb08266
- https://github.com/susumuota/kaggleenv
- https://cloud.google.com/vertex-ai/docs/workbench
- https://cloud.google.com/sdk/docs/quickstart
- https://code.visualstudio.com/docs/python/jupyter-support#_connect-to-a-remote-jupyter-server
- https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/
- https://www.kaggle.com/product-feedback/159602
- https://amalog.hateblo.jp/entry/data-analysis-docker (Japanese)## Author
Susumu OTA