https://github.com/susumuota/kaggleenv

GCP + Kaggle Docker + VSCode
https://github.com/susumuota/kaggleenv
docker docker-compose google-cloud google-cloud-platform jupyter jupyter-notebook jupyterlab kaggle python python3 visual-studio-code vscode
Last synced: 29 days ago
JSON representation
GCP + Kaggle Docker + VSCode
Host: GitHub
URL: https://github.com/susumuota/kaggleenv
Owner: susumuota
Created: 2021-04-01T17:24:50.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-02-28T04:31:30.000Z (about 3 years ago)
Last Synced: 2025-04-22T15:08:55.339Z (29 days ago)
Topics: docker, docker-compose, google-cloud, google-cloud-platform, jupyter, jupyter-notebook, jupyterlab, kaggle, python, python3, visual-studio-code, vscode
Language: Dockerfile
Homepage:
Size: 156 KB
Stars: 15
Watchers: 2
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # GCP (or local machine) + Kaggle Docker + VSCode

![vscode_jupyter](https://user-images.githubusercontent.com/1632335/113431667-0d1b8c80-9417-11eb-8183-e89084670f39.png)

This document describes how to setup [Kaggle Python docker image](https://github.com/Kaggle/docker-python) environment on [Google Cloud Platform (GCP)](https://cloud.google.com/) or your local machine by [Docker](https://www.docker.com/) and how to setup [Visual Studio Code (VSCode)](https://code.visualstudio.com/) to connect the environment.

A primally information source comes from [Kaggle's docker-python repository](https://github.com/Kaggle/docker-python). Also, there is a [guide](https://medium.com/kaggleteam/how-to-get-started-with-data-science-in-containers-6ed48cb08266), but unfortunately it's a bit obsoleted guide written in 2016.

> **_Note:_** This method may take 20-30 minutes and over 18.5GB disks for data downloads.

> **_Note:_** If you do not use VSCode, no need to read this document. See [here](https://www.kaggle.com/product-feedback/159602).

All files in this document are available on [my repository](https://github.com/susumuota/kaggleenv).

There are 2 options, GCP or local machine. If you are going to setup the environment on your local machine, skip to `[Option 2] Setup the environment on your local machine` section.

## [Option 1] Setup the environment on GCP

On GCP, ["Vertex AI Workbench"](https://cloud.google.com/vertex-ai/docs/workbench) would be easier than ["Compute Engine"](https://cloud.google.com/compute/docs/) (GCE) to setup [Kaggle Python docker image](https://github.com/Kaggle/docker-python).

### Create a Vertex AI Workbench

- Access https://console.cloud.google.com/vertex-ai/workbench

- Select a project e.g. `kaggle-myproject-1` (You must [create a project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project) beforehand)

- Click `USER-MANAGED NOTEBOOK`

- Click `NEW NOTEBOOK`

- Choose `Customize...`

- Instance name: e.g. `kaggle-test-1`

- Environment: `Kaggle Python [BETA]` (This option will automatically prepare [Kaggle Python docker image](https://github.com/Kaggle/docker-python) at startup the VM instance)

- GPU type: e.g. `NVIDIA Tesla T4` (You must [increase GPU quota](https://cloud.google.com/compute/quotas#requesting_additional_quota) beforehand)

  - Mark the checkbox `Install NVIDIA GPU driver automatically for me`

![gcp_notebook_1](https://user-images.githubusercontent.com/1632335/115653028-636e5200-a369-11eb-9bda-8c34036591f4.png)

- Open `Networking` section

  - Mark the radio button `Networks in this project`

  - Clear the checkbox `Allow proxy access when it's available` (This option will avoid to load unnecessary proxy Docker container)

- Click `CREATE`

![gcp_notebook_2](https://user-images.githubusercontent.com/1632335/115653237-c7911600-a369-11eb-88b1-db382a1f997e.png)

- Wait for around 20-30 minutes to start up the VM instance. I guess it's because of `docker pull`. If you choose GPU type: `None`, it takes a few minutes. Check the console logs at [here](https://console.cloud.google.com/logs/).

### Connect to the VM instance

- [Install Cloud SDK](https://cloud.google.com/sdk/docs/quickstart). If you are using macOS and Homebrew, `brew install --cask google-cloud-sdk` may be convenient.

After that, `gcloud` command should be available on your terminal.

- SSH to the VM instance with port forwarding

```

% gcloud compute --project "kaggle-myproject-1" ssh --zone "us-west1-b" "kaggle-test-1" -- -L 8080:localhost:8080

```

> **_Note:_** You must wait to start up the VM instance. Check the console logs at [here](https://console.cloud.google.com/logs/).

> **_Note:_** I recommend to limit source IP ranges for SSH and RDP port. See [here](https://cloud.google.com/vpc/docs/using-firewalls#creating_firewall_rules).

- Open web browser and try to access `http://localhost:8080`

> **_Note:_** There is no `token=...`.

If you do not use VSCode, that's all. You do not have to do anything below.

### Stop pre-installed Docker container

If you use VSCode to connect GCP Notebook, you must tweak Docker container. At the moment, VSCode can only access to remote Jupyter servers with `token` option enabled. But pre-installed Docker container disables `token` option by `c.NotebookApp.token = ''`. You must stop pre-installed Docker container and run a new Docker container with `token` option enabled instead.

- Stop pre-installed Docker container

Stop pre-installed Docker container and turn off the startup option. See details [here](https://docs.docker.com/config/containers/start-containers-automatically/).

```

% docker ps -a

% docker inspect -f "{{.Name}} {{.HostConfig.RestartPolicy.Name}}" $(docker ps -aq)

% docker update --restart no payload-container

% docker inspect -f "{{.Name}} {{.HostConfig.RestartPolicy.Name}}" $(docker ps -aq)

% docker stop payload-container

% docker ps -a

```

- Install `docker-compose`

`docker-compose` will be convenient to run containers, even on a single container. See details [here](https://docs.docker.com/compose/install/).

```

% sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

% sudo chmod +x /usr/local/bin/docker-compose

```

Skip to `Run Docker container` section.

## [Option 2] Setup the environment on your local machine

If you setup the environment on your local machine, [install and setup Docker](https://docs.docker.com/get-docker/).

After that, `docker` and `docker-compose` commands should be available on your terminal.

```sh

% docker -v

Docker version 20.10.8, build 3967b7d

% docker-compose -v

docker-compose version 1.29.2, build 5becea4c

```

## Run Docker container (both GCP and local machine)

I prepared a [sample repository](https://github.com/susumuota/kaggleenv) of the `Dockerfile`, etc. If you do not care about details, execute these commands and skip to `Open Notebook by web browser` section.

```

% git clone https://github.com/susumuota/kaggleenv.git

% cd kaggleenv

% docker-compose build

% docker-compose up -d

% docker-compose logs

# Find and copy http://localhost:8080/?token=...

```

Otherwise, follow the instructions below.

### Create `Dockerfile`

Create a directory (e.g. `kaggleenv`) and go there. If you clone the sample repository, just `cd kaggleenv`.

Create `Dockerfile` like the following. See details [here](https://docs.docker.com/engine/reference/builder/#format). If you use CPU instead of GPU, edit `FROM` lines.

```Dockerfile

# for CPU

# FROM gcr.io/kaggle-images/python:v109

# for GPU

FROM gcr.io/kaggle-gpu-images/python:v109

# apply patch to enable token and change notebook directory to /kaggle/working

# see jupyter_notebook_config.py.patch

COPY jupyter_notebook_config.py.patch /opt/jupyter/.jupyter/

RUN (cd /opt/jupyter/.jupyter/ && patch < jupyter_notebook_config.py.patch)

# add extra modules here

# RUN pip install -U pip

```

You can specify a tag (e.g. `v109` or `latest`) to keep using the same environment. You can find tags from [GCR page](https://gcr.io/kaggle-images/python).

### Create `jupyter_notebook_config.py.patch`

This Docker image will run Jupyter Lab with startup script `/run_jupyter.sh` and config `/opt/jupyter/.jupyter/jupyter_notebook_config.py`. It needs to be tweaked like the following.

- Enable token (so that VSCode can connect properly)

- Change notebook directory to `/kaggle/working`

Create `jupyter_notebook_config.py.patch` like the following.

```patch

--- jupyter_notebook_config.py.orig	2021-12-19 07:04:25.000000000 +0000

+++ jupyter_notebook_config.py	2022-01-29 18:19:29.016821460 +0000

@@ -11 +11 @@

-c.ServerApp.token = ""

+# c.ServerApp.token = ""

@@ -17 +17,2 @@

-c.ServerApp.notebook_dir = "/home/jupyter"

+# c.ServerApp.notebook_dir = "/home/jupyter"

+c.ServerApp.notebook_dir = "/kaggle/working"

```

> **_Note:_** This patch may not work in the future version of [Kaggle Python docker image](https://github.com/Kaggle/docker-python). In that case, create a new patch with `diff -u original new > patch`. At least I confirmed this patch work on `v109` tag.

### Create `docker-compose.yml`

Create `docker-compose.yml` like the following. See details [here](https://docs.docker.com/compose/). This setting mounts current directory on your local machine to `/kaggle/working` on the container. If you use CPU instead of GPU, comment out `runtime: nvidia`.

```yaml

version: "3"

services:

  jupyter:

    build: .

    volumes:

      - $PWD:/kaggle/working

    working_dir: /kaggle/working

    ports:

      - "8080:8080"

    hostname: localhost

    restart: always

    # for GPU

    runtime: nvidia

```

### Create `.dockerignore`

Create `.dockerignore` like the following. See details [here](https://docs.docker.com/engine/reference/builder/#dockerignore-file). This setting specifies subdirectories and files that should be ignored when building Docker images. You will **mount** the current directory, so you do not need to **include** subdirectories and files into image. Especially, `input` directory should be ignored because it may include large files so that build process may take long time.

```

README.md

input

output

.git

.gitignore

.vscode

.ipynb_checkpoints

```

### Run `docker-compose build`

Run `docker-compose build` to build the Docker image. See details [here](https://docs.docker.com/compose/reference/build/).

> **_Note:_** This process may take 20-30 minutes and over 18.5GB disks for data downloads on your local machine.

```sh

% docker-compose build

```

Confirm the image by `docker images`.

```sh

% docker images

REPOSITORY            TAG       IMAGE ID       CREATED          SIZE

kaggleenv_jupyter   latest    ............   28 minutes ago   18.5GB

```

### Run `docker-compose up -d`

Run `docker-compose up -d` to start Docker container in the background. In addition, the container will automatically run at startup VM instance or local machine. See details [here](https://docs.docker.com/compose/reference/up/) and [here](https://docs.docker.com/config/containers/start-containers-automatically/).

```sh

% docker-compose up -d

% docker ps -a

% docker inspect -f "{{.Name}} {{.HostConfig.RestartPolicy.Name}}" $(docker ps -aq)

```

Find the Notebook URL on the log and copy it.

```

% docker-compose logs

http://localhost:8080/?token=...

```

### Open Notebook by web browser

- Open web browser and type the Notebook URL (`http://localhost:8080/?token=...`).

- Create a `Python 3` Notebook.

- Create code cells and execute `!pwd`, `!ls` and `!pip list` to confirm Python environment.

![jupyter_kaggle](https://user-images.githubusercontent.com/1632335/113484058-5afcc700-94e1-11eb-9f2e-a6fd01a0121a.png)

### Setup Kaggle API

[Setup Kaggle API credentials](https://github.com/Kaggle/kaggle-api#api-credentials).

After that, `~/.kaggle/kaggle.json` file should be on your local machine.

- Copy `~/.kaggle/kaggle.json` to current directory **on your local machine** (so that it can be accessed from the container at `/kaggle/working/kaggle.json`)

```sh

% cp -p ~/.kaggle/kaggle.json .

```

- Create a code cell on the Notebook and confirm `/kaggle/working/kaggle.json` on the container.

```sh

!ls -l /kaggle/working/kaggle.json

-rw------- 1 root root 65 Mar 22 07:59 /kaggle/working/kaggle.json

```

- Copy it to `~/.kaggle` directory on the container.

```sh

!cp -p /kaggle/working/kaggle.json ~/.kaggle/

```

- Remove `kaggle.json` on the current directory **on your local machine**.

```sh

% rm -i kaggle.json

```

- Try `kaggle` command on the Notebook.

```sh

!kaggle competitions list

```

### Shutdown the Vertex AI Workbench (GCP)

After you finished your work, stop the VM instance.

- Access https://console.cloud.google.com/vertex-ai/workbench/list/instances

- Check the VM instance on the list

- Click `STOP` or `DELETE`

If you `DELETE` the VM instance, you will not be charged anything (as far as I know).

However, if you `STOP` the VM instance, you will be charged for resources (e.g. persistent disk) until you `DELETE` it. You should `DELETE` if you do not use it for a long time (though you must setup the environment again). See details [here](https://cloud.google.com/compute/docs/instances/stop-start-instance#billing).

### Run `docker-compose down` (local machine)

After you finished your work, run `docker-compose down` to stop Docker container. See details [here](https://docs.docker.com/compose/reference/down/).

```sh

% docker-compose down

```

## Setup VSCode to open remote Notebooks

If you are using [Visual Studio Code (VSCode)](https://code.visualstudio.com/), you can setup VSCode to connect to the remote Notebook.

### [Optional] Install the latest Notebook extension

There is a revamped version of Notebook extension. See details [here](https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/). I recommend installing it because this new version can handle custom extensions (e.g. key bindings) properly inside code cells, etc.

![vscode_jupyter](https://user-images.githubusercontent.com/1632335/113431667-0d1b8c80-9417-11eb-8183-e89084670f39.png)

### Connect to the remote Notebook

Connect to the remote Notebook. See details [here](https://code.visualstudio.com/docs/python/jupyter-support#_connect-to-a-remote-jupyter-server).

- Open `Command Palette...`

- Type `Jupyter: Specify local or remote Jupyter server for connections`

![vscode_palette](https://user-images.githubusercontent.com/1632335/113466765-3bca4f00-9479-11eb-914e-7d90ac073daf.png)

- Choose `Existing: Specify the URI of an existing server`

![vscode_existing](https://user-images.githubusercontent.com/1632335/113467276-01fb4780-947d-11eb-93f6-a4f5a974d323.png)

- Specify the Notebook URL (`http://localhost:8080/?token=...`)

> **_Note:_** `token` must be specified.

![vscode_uri](https://user-images.githubusercontent.com/1632335/113467238-c2ccf680-947c-11eb-9388-1ecd2297eb6b.png)

- Press `Reload` button

![vscode_reload](https://user-images.githubusercontent.com/1632335/113467629-31ab4f00-947f-11eb-9062-1bbc5566ab86.png)

- Open `Command Palette...`

- Type `Jupyter: Create New Blank Notebook`

![vscode_create](https://user-images.githubusercontent.com/1632335/113467560-9f0ab000-947e-11eb-865e-62beeed43f12.png)

- Create code cells and execute `!pwd`, `!ls` and `!pip list` to confirm Python environment.

![vscode_new_notebook](https://user-images.githubusercontent.com/1632335/113467525-75518900-947e-11eb-86e1-e9e79d84e610.png)

## Increase Docker resources (local machine)

Sometimes containers need much resources (e.g. memory or disk). You can increase the amount of resources from Docker preferences.

- Click Docker icon

- Choose `Preferences...`

- Click `Resources`

- Click `ADVANCED`

- Increase `Memory`, e.g. `8GB`

- Increase `Disk image size`, e.g. `128GB`

- Click `Apply & Restart`

![docker_preferences](https://user-images.githubusercontent.com/1632335/137115613-88386ee7-807e-4252-920f-73ef04d9a18a.png)

## Maintain Docker containers, images and cache

Basically `docker-compose up -d` and `docker-compose down` work well, but sometimes you may need to use these commands to maintain Docker containers, images and cache.

- How to remove containers. See details [here](https://docs.docker.com/engine/reference/commandline/rm/).

```sh

% docker ps -a  # confirm container ids to remove

% docker rm CONTAINER  # remove container by id

% docker rm $(docker ps --filter status=exited -q)  # remove all containers that have exited

```

- How to remove images. See details [here](https://docs.docker.com/engine/reference/commandline/rmi/).

```sh

% docker images  # confirm image ids to remove

% docker rmi IMAGE  # remove image by id

```

- How to remove cache. See details [here](https://docs.docker.com/engine/reference/commandline/builder_prune/) and [here](https://docs.docker.com/engine/reference/commandline/volume_prune/).

```sh

% docker system df  # confirm how much disk used by cache

% docker builder prune

% docker volume prune

```

## TODO

- Workflow to submit local Notebook to Kaggle

## Links

- https://github.com/Kaggle/docker-python

- https://medium.com/kaggleteam/how-to-get-started-with-data-science-in-containers-6ed48cb08266

- https://github.com/susumuota/kaggleenv

- https://cloud.google.com/vertex-ai/docs/workbench

- https://cloud.google.com/sdk/docs/quickstart

- https://code.visualstudio.com/docs/python/jupyter-support#_connect-to-a-remote-jupyter-server

- https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/

- https://www.kaggle.com/product-feedback/159602

- https://amalog.hateblo.jp/entry/data-analysis-docker  (Japanese)

## Author

Susumu OTA
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/susumuota/kaggleenv

Awesome Lists containing this project

README