https://github.com/apache/incubator-datalab

Apache DataLab (incubating)
https://github.com/apache/incubator-datalab

datalab

Last synced: 8 months ago
JSON representation

Apache DataLab (incubating)

Host: GitHub
URL: https://github.com/apache/incubator-datalab
Owner: apache
License: apache-2.0
Archived: true
Created: 2019-01-11T18:39:39.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-10-03T08:33:27.000Z (over 2 years ago)
Last Synced: 2025-10-06T00:36:42.325Z (8 months ago)
Topics: datalab
Language: Python
Homepage: https://datalab.apache.org/
Size: 243 MB
Stars: 152
Watchers: 15
Forks: 57
Open Issues: 51
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

DataLab Overview
=============

-------
CONTENTS
-------

[What is DataLab?](#What_is_DataLab)

[How to Contribute](CONTRIBUTING.md)

[Logical architecture](#Logical_architecture)

[Physical architecture](#Physical_architecture)

[DataLab Deployment](#DataLab_Deployment)

[Structure of main DataLab directory](#DataLab_directory)

[Structure of log directory](#log_directory)

[Preparing environment for DataLab deployment](#Env_for_DataLab)

[Keycloak server](#Keycloak_server)

[Self-Service Node](#Self_Service_Node)

[Endpoint Node](#Endpoint_node)

[Edge Node](#Edge_Node)

[Notebook node](#Notebook_node)

[Dataengine-service cluster](#Dataengine-service_cluster)

[Dataengine cluster](#Dataengine_cluster)

[Configuration files](#Configuration_files)

[Starting/Stopping services](#Starting_Stopping_services)

[Billing report](#Billing_Report)

[Backup and Restore](#Backup_and_Restore)

[GitLab server](#GitLab_server)

[Troubleshooting](#Troubleshooting)

[Development](#Development)

[Folder structure](#Folder_structure)

[Pre-requisites](#Pre-requisites)

[Java back-end services](#Java_back_end_services)

[Front-end](#Front_end)

[How to setup local development environment](#setup_local_environment)

[How to run locally](#run_locally)

[Infrastructure provisioning](#Infrastructure_provisioning)

[LDAP Authentication](#LDAP_Authentication)

[Azure OAuth2 Authentication](#Azure_OAuth2_Authentication)

---------------
# What is DataLab?

DataLab is an essential toolset for analytics. It is a self-service Web Console, used to create and manage exploratory
environments. It allows teams to spin up analytical environments with best of breed open-source tools just with a
single click of the mouse. Once established, environment can be managed by an analytical team itself, leveraging simple
and easy-to-use Web Interface.

See more at datalab.incubator.apache.org.

----------------------------
# Logical architecture

The following diagram demonstrate high-level logical architecture.

![Logical architecture](doc/logical_architecture.png)

The diagram shows main components of DataLab, which is a self-service for the infrastructure deployment and interaction
with it. The purpose of each component is described below.

## Self-Service

Self-Service is a service, which provides RESTful user API with Web User Interface for data scientist. It tightly
interacts with Provisioning Service and Database. Self-Service delegates all user\`s requests to Provisioning Service.
After execution of certain request from Self-service, Provisioning Service returns response about corresponding action
happened with particular resource. Self-service, then, saves this response into Database. So, each time Self-Service
receives request about status of provisioned infrastructure resources – it loads it from Database and propagates to Web UI.

## Billing

Billing is a module, which provides a loading of the billing report for the environment to the database. It can be
running as part of the Self-Service or a separate process.

## Provisioning Service

The Provisioning Service is a RESTful service, which provides APIs for provisioning of the user’s infrastructure.
Provisioning Service receives the request from Self-Service, afterwards it forms and sends a command to the docker
to execute requested action. Docker executes the command and generates a response.json file. Provisioning service
analyzes response.json and responds to initial request of Self-Service, providing status-related information of the instance.

## Security service

Security Service is RESTful service, which provides authorization API for Self-Service and Provisioning Service via LDAP.

## Docker

Docker is an infrastructure-provisioning module based on Docker service, which provides low-level actions for infrastructure management.

## Database

Database serves as a storage with description of user infrastructure, user’s settings and service information.

-----------------------------
# Physical architecture

The following diagrams demonstrate high-level physical architecture of DataLab in AWS, GCP and Azure.

### DataLab high level Architecture on AWS:

![Physical architecture](doc/datalab_aws.png)

### DataLab high level Architecture on GCP:

![Physical architecture](doc/datalab_gcp.png)

### DataLab high level Architecture on Azure:

![Physical architecture](doc/datalab_azure.png)

## Main components

- Self-service node (SSN)
- Endpoint node
- Edge node
- Notebook node (Jupyter, Rstudio, etc.)
- Data engine cluster
- Data engine cluster as a service provided with Cloud

## Self-service node (SSN)

Creation of self-service node – is the first step for deploying DataLab. SSN is a main server with following pre-installed services:

- DataLab Web UI – is Web user interface for managing/deploying all components of DataLab. It is accessible by the
following URL: http[s]://SSN\_Public\_IP\_or\_Public\_DNS
- MongoDB – is a database, which contains part of DataLab’s configuration, user’s exploratory environments description
as well as user’s preferences.
- Docker – used for building DataLab Docker containers, which will be used for provisioning other components.

Elastic(Static) IP address is assigned to an SSN Node, so you are free to stop|start it and and SSN node's IP address
won’t change.

## Endpoint

This is a node which serves as a provisioning endpoint for DataLab resources. Endpoint machine is deployed separately from DataLab
installation and can be even deployed on a different cloud.

## Edge node

This node is used as a reverse-proxy server for the user. Through Edge node users can access Notebook via HTTPS.
Edge Node has a Nginx reverse-proxy pre-installed.

## Notebook node

The next step is setting up a Notebook node (or a Notebook server). It is a server with pre-installed applications and
libraries for data processing, data cleaning and transformations, numerical simulations, statistical modeling, machine
learning, etc. Following analytical tools are currently supported in DataLab and can be installed on a Notebook node:

- Jupyter
- Jupyterlab
- RStudio
- Apache Zeppelin
- TensorFlow + Jupyter
- TensorFlow + RStudio
- Deep Learning + Jupyter

Apache Spark is also installed for each of the analytical tools above.

**Note:** terms 'Apache Zeppelin' and 'Apache Spark' hereinafter may be referred to as 'Zeppelin' and 'Spark'
respectively or may have original reference.

## Data engine cluster

After deploying Notebook node, user can create one of the cluster for it:
- Data engine - Spark standalone cluster
- Data engine service - cloud managed cluster platform (EMR for AWS or Dataproc for GCP)
That simplifies running big data frameworks, such as Apache Hadoop and Apache Spark to process and analyze vast amounts
of data. Adding cluster is not mandatory and is only needed in case additional computational resources are required for
job execution.
----------------------
# DataLab Deployment

### Structure of main DataLab directory

DataLab’s SSN node main directory structure is as follows:

/opt
└───datalab
├───conf
├───sources
├───template
├───tmp
│ └───result
└───webapp

- conf – contains configuration for DataLab Web UI and back-end services;
- sources – contains all Docker/Python scripts, templates and files for provisioning;
- template – docker’s templates;
- tmp –temporary directory of DataLab;
- tmp/result – temporary directory for Docker’s response files;
- webapp – contains all .jar files for DataLab Web UI and back-end
services.

### Structure of log directory

SSN node structure of log directory is as follows:

/var
└───opt
└───datalab
└───log
├───dataengine
├───dateengine-service
├───edge
├───notebook
├───project
└───ssn

These directories contain the log files for each template and for DataLab back-end services.
- ssn – contains logs of back-end services;
- provisioning.log – Provisioning Service log file;
- security.log – Security Service log file;
- selfservice.log – Self-Service log file;
- edge, notebook, dataengine, dataengine-service – contains logs of Python scripts.

## Keycloak server

**Keycloak** is used to manage user authentication instead of the aplication. To use existing server following
parameters must be specified either when running *DataLab* deployment script or in
*/opt/datalab/conf/self-service.yml* and */opt/datalab/conf/provisioning.yml* files on SSN node.

| Parameter | Description/Value |
|--------------------------|---------------------------------|
| keycloak_realm_name |Keycloak Realm name |
| keycloak_auth_server_url |Keycloak auth server URL |
| keycloak_client_secret |Keycloak client secret (optional)|
| keycloak_user |Keycloak user |
| keycloak_user_password |Keycloak user password |

### Preparing environment for Keycloak deployment
Keycloak can be deployed with Nginx proxy on instance using *deploy_keycloak.py* script. Currently it only works with HTTP.

Preparation steps for deployment:

- Create an VM instance with the following settings:
- The instance should have access to Internet in order to install required prerequisites
- Boot disk OS Image - Ubuntu 18.04
- Put private key that is used to connect to instance where Keycloak will be deployed somewhere on the instance where
deployment script will be executed.
- Install Git and clone DataLab repository
### Executing deployment script
To build Keycloak node, following steps should be executed:
- Connect to the instance via SSH and run the following commands:
```
sudo su
apt-get update
apt-get install -y python3-pip
pip3 install fabric
```
- Go to *datalab* directory
- Run *infrastructure-provisioning/scripts/deploy_keycloak/deploy_keycloak.py* deployment script:

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_keycloak/deploy_keycloak.py --os_user ubuntu --keyfile ~/.ssh/key.pem --keycloak_realm_name test_realm_name --keycloak_user admin --keycloak_user_password admin_password --public_ip_address XXX.XXX.XXX.XXX
```

List of parameters for Keycloak node deployment:

| Parameter | Description/Value |
|---------------------------|-----------------------------------------------------------------------------------------|
| os_user | username, used to connect to the instance |
| keyfile | /path_to_key/private_key.pem, used to connect to instance |
| keycloak_realm_name | Keycloak realm name that will be created |
| keycloak_user | initial keycloak admin username |
| keycloak_user_password | password for initial keycloak admin user |
| public_ip_address | Public IP address of the instance (if not specified, keycloak will be deployed on localhost)* (On AWS try to specify Public DNS (IPv4) instead of IPv4 if unable to connect)* |

## Self-Service Node

### Preparing environment for DataLab deployment

Deployment of DataLab starts from creating Self-Service(SSN) node. DataLab can be deployed in AWS, Azure and Google cloud.

For each cloud provider, prerequisites are different.

In Amazon cloud (click to expand)

Prerequisites:

DataLab can be deployed using the following two methods:
- IAM user: DataLab deployment script is executed on local machine and uses IAM user permissions to create resources in AWS.
- EC2 instance: DataLab deployment script is executed on EC2 instance prepared in advance and with attached IAM role.
Deployment script uses the attached IAM role to create resources in AWS.

**'IAM user' method prerequisites:**

- IAM user with created AWS access key ID and secret access key. These keys are provided as arguments for the
deployment script and are used to create resources in AWS.
- Amazon EC2 Key Pair. This key is system and is used for configuring DataLab instances.
- The following IAM [policy](#AWS_SSN_policy) should be attached to the IAM user in order to deploy DataLab.

**'EC2 instance' method prerequisites:**

- Amazon EC2 Key Pair. This key is system and is used for configuring DataLab instances.
- EC2 instance where DataLab deployment script is executed.
- IAM role with the following IAM [policy](#AWS_SSN_policy) should be attached to the EC2 instance.

**Optional prerequisites for both methods:**

- VPC ID. If VPC where DataLab should be deployed is already in place, then "VPC ID" should be provided for deployment
script. DataLab instances are deployed in this VPC.
- Subnet ID. If Subnet where DataLab should be deployed is already in place, then "Subnet ID" should be provided for
deployment script. DataLab SSN node and users' Edge nodes are deployed in this Subnet.

DataLab IAM Policy

```
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"iam:CreatePolicy",
"iam:AttachRolePolicy",
"iam:DetachRolePolicy",
"iam:DeletePolicy",
"iam:DeleteRolePolicy",
"iam:GetRolePolicy",
"iam:GetPolicy",
"iam:GetUser",
"iam:ListUsers",
"iam:ListAccessKeys",
"iam:ListUserPolicies",
"iam:ListAttachedRolePolicies",
"iam:ListPolicies",
"iam:ListRolePolicies",
"iam:ListRoles",
"iam:CreateRole",
"iam:CreateInstanceProfile",
"iam:PutRolePolicy",
"iam:AddRoleToInstanceProfile",
"iam:PassRole",
"iam:GetInstanceProfile",
"iam:ListInstanceProfilesForRole",
"iam:RemoveRoleFromInstanceProfile",
"iam:DeleteInstanceProfile",
"iam:ListInstanceProfiles",
"iam:DeleteRole",
"iam:GetRole"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:DeleteRouteTable",
"ec2:DeleteSubnet",
"ec2:DeleteTags",
"ec2:DescribeSubnets",
"ec2:DescribeVpcs",
"ec2:DescribeInstanceStatus",
"ec2:ModifyInstanceAttribute",
"ec2:RevokeSecurityGroupIngress",
"ec2:DescribeImages",
"ec2:CreateTags",
"ec2:DescribeRouteTables",
"ec2:CreateRouteTable",
"ec2:AssociateRouteTable",
"ec2:DescribeVpcEndpoints",
"ec2:CreateVpcEndpoint",
"ec2:ModifyVpcEndpoint",
"ec2:DescribeInstances",
"ec2:RunInstances",
"ec2:DescribeAddresses",
"ec2:AllocateAddress",
"ec2:AssociateAddress",
"ec2:DisassociateAddress",
"ec2:ReleaseAddress",
"ec2:TerminateInstances",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:DescribeSecurityGroups",
"ec2:CreateSecurityGroup",
"ec2:DeleteSecurityGroup",
"ec2:RevokeSecurityGroupEgress"

],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"s3:GetBucketLocation",
"s3:PutBucketPolicy",
"s3:GetBucketPolicy",
"s3:DeleteBucket",
"s3:DeleteObject",
"s3:GetObject",
"s3:ListBucket",
"s3:PutEncryptionConfiguration",
"s3:ListAllMyBuckets",
"s3:CreateBucket",
"s3:PutBucketTagging",
"s3:GetBucketTagging"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
```

Preparation steps for deployment:

- Create an EC2 instance with the following settings:
- The instance should have access to Internet in order to install required prerequisites
- The instance should have access to further DataLab installation
- AMI - Ubuntu 20.04
- IAM role with [policy](#AWS_SSN_policy) should be assigned to the instance
- Put SSH key file created through Amazon Console on the instance with the same name
- Install Git and clone DataLab repository

In Azure cloud (click to expand)

Prerequisites:

- IAM user with Contributor permissions.
- Service principal and JSON based auth file with clientId, clientSecret and tenantId.

**Note:** The following permissions should be assigned to the service principal:

- Windows Azure Active Directory
- Microsoft Graph
- Windows Azure Service Management API
- Storage Blob Data Contributor Role

**Preparation steps for deployment:**

- Create a VM instance with the following settings:
- The instance should have access to Internet in order to install required prerequisites
- Image - Ubuntu 20.04
- Generate SSH key pair and rename private key with .pem extension
- Put JSON auth file to users home directory

In Google cloud (GCP) (click to expand)

Prerequisites:

- IAM user
- Service account and JSON auth file for it. In order to get JSON auth file, Key should be created for service account
through Google cloud console.
- Google Cloud Storage JSON API should be enabled

Preparation steps for deployment:

- Create an VM instance with the following settings:
- The instance should have access to Internet in order to install required prerequisites
- Boot disk OS Image - Ubuntu 20.04
- Generate SSH key pair and rename private key with .pem extension
- Put JSON auth file created through Google cloud console to users home directory
- Install Git and clone DataLab repository

### Executing deployment script

To build SSN node, following steps should be executed:

- Connect to the instance via SSH and run the following commands:

```
sudo su
apt-get update
apt-get install git
git clone https://github.com/apache/incubator-datalab.git -b develop
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt-get update
apt-cache policy docker.io
apt-get install -y docker.io=20.10.7-0ubuntu1~20.04.1
usermod -a -G docker *username*
apt-get install -y python3-pip
pip3 install fabric
cd incubator-datalab
```
- Go to *datalab* directory
- Run *infrastructure-provisioning/scripts/deploy_datalab.py* deployment script:

This python script will build front-end and back-end part of DataLab, create SSN docker image and run Docker container
for creating SSN node.

In Amazon cloud (click to expand)

**Note:** cloud provider argument should be specified before arguments related to the cloud.

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_datalab.py \
--conf_service_base_name datalab-test \
--conf_os_family debian \
--key_path /path/to/key/ \
--conf_key_name key_name \
--conf_tag_resource_id datalab \
--action create \
aws \
--aws_access_key XXXXXXX \
--aws_secret_access_key XXXXXXXXXX \
--aws_region xx-xxxxx-x \
--aws_vpc_id vpc-xxxxx \
--aws_subnet_id subnet-xxxxx \
--aws_security_groups_ids sg-xxxxx,sg-xxxx \
--aws_account_id xxxxxxxx \
--aws_billing_bucket billing_bucket \
--aws_report_path /billing/directory/
```

List of parameters for SSN node deployment:

| Parameter
|-----------------
| conf\_service\_base\_name
| conf\_os\_family
| conf\_duo\_vpc\_enable
| key\_path
| conf\_key\_name
| conf\_tag\_resource\_id
| action
| workspace\_path
| conf\_image\_enabled
| conf\_cloud\_provider
| aws\_access\_key
| aws\_secret\_access\_key
| aws\_region
| aws\_vpc\_id
| aws\_subnet\_id
| aws\_security\_groups\_ids|
| aws\_account\_id
| aws\_billing\_bucket
| aws\_report\_path | Description/Value | ----------|-----------------------------------------------------------------------------------------| | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) | | Name of the Linux distributive family, which is supported by DataLab (Debian/RedHat) | | "true" - for installing DataLab into two Virtual Private Clouds (VPCs) or "false" - for installing DataLab into one VPC. Also this parameter isn't required when deploy DataLab in one VPC| | Path to admin key (without key name) | | Name of the uploaded SSH key file (without “.pem” extension) | | The name of tag for billing reports | | In case of SSN node creation, this parameter should be set to “create”| | Path to DataLab sources root | Enable or Disable creating image at first time | | Name of the cloud provider, which is supported by DataLab (AWS) | AWS user access key | | AWS user secret access key | | AWS region | | ID of the VPC (optional) | | ID of the public subnet (optional) | One or more ID\`s of AWS Security Groups, which will be assigned to SSN node (optional) | | The The ID of Amazon account | | The name of S3 bucket where billing reports will be placed | | The path to billing reports directory in S3 bucket. This parameter isn't required when billing reports are placed in the root of S3 bucket. |

**Note:** If the following parameters are not specified, they will be created automatically:
- aws\_vpc\_id
- aws\_subnet\_id
- aws\_sg\_ids

**Note:** If billing won't be used, the following parameters are not required:
- aws\_account\_id
- aws\_billing\_bucket
- aws\_report\_path

**SSN deployment creates following AWS resources:**

- SSN EC2 instance
- Elastic IP for SSN instance
- IAM role and EC2 Instance Profile for SSN
- Security Group for SSN node (if it was specified, script will attach the provided one)
- VPC, Subnet (if they have not been specified) for SSN and EDGE nodes
S3 bucket – its name will be \-ssn-bucket. This bucket will contain necessary dependencies and configuration files for Notebook nodes (such as .jar files, YARN configuration, etc.)
- S3 bucket for for collaboration between DataLab users. Its name will be \-\-shared-bucket

In Azure cloud (click to expand)

**Note:** cloud provider argument should be specified before arguments related to the cloud.

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_datalab.py \
--conf_service_base_name datalab_test \
--conf_os_family debian \
--key_path /root/ \
--conf_key_name Test \
--azure_auth_path /dir/file.json \
--action create \
azure \
--azure_vpc_name vpc-test \
--azure_subnet_name subnet-test \
--azure_security_group_name sg-test1,sg-test2 \
--azure_region westus2
```

List of parameters for SSN node deployment:

| Parameter | Description/Value |
|-----------------------------------|-----------------------------------------------------------------------------------------|
| conf\_service\_base\_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (Debian/RedHat) |
| key\_path | Path to admin key (without key name) |
| conf\_key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| conf\_image\_enabled | Enable or Disable creating image at first time |
| action | In case of SSN node creation, this parameter should be set to “create” |
| conf\_cloud\_provider | Name of the cloud provider, which is supported by DataLab (Azure) |
| azure\_vpc\_name | Name of the Virtual Network (VN) (optional) |
| azure\_subnet\_name | Name of the Azure subnet (optional) |
| azure\_security\_groups\_name | One or more Name\`s of Azure Security Groups, which will be assigned to SSN node (optional) |
| azure\_ssn\_instance\_size | Instance size of SSN instance in Azure |
| azure\_resource\_group\_name | Resource group name (can be the same as service base name |
| azure\_region | Azure region |
| azure\_auth\_path | Full path to auth json file |
| azure\_offer\_number | Azure offer id number |
| azure\_currency | Currency that is used for billing information(e.g. USD) |
| azure\_locale | Locale that is used for billing information(e.g. en-US) |
| azure\_region\_info | Region info that is used for billing information(e.g. US) |
| azure\_datalake\_enable | Support of Azure Data Lake (true/false) |
| azure\_oauth2\_enabled | Defines if Azure OAuth2 authentication mechanisms is enabled(true/false) |
| azure\_validate\_permission\_scope| Defines if DataLab verifies user's permission to the configured resource(scope) during login with OAuth2 (true/false). If Data Lake is enabled default scope is Data Lake Store Account, else Resource Group, where DataLab is deployed, is default scope. If user does not have any role in scope he/she is forbidden to log in
| azure\_application\_id | Azure application ID that is used to log in users in DataLab |
| azure\_ad\_group\_id | ID of group in Active directory whose members have full access to shared folder in Azure Data Lake Store |

**Note:** If the following parameters are not specified, they will be created automatically:

- azure\_vpc\_nam
- azure\_subnet\_name
- azure\_security\_groups\_name

**Note:** Billing configuration:

To know azure\_offer\_number open [Azure Portal](https://portal.azure.com), go to Subscriptions and open yours, then
click Overview and you should see it under Offer ID property:

![Azure offer number](doc/azure_offer_number.png)

Please see [RateCard API](https://msdn.microsoft.com/en-us/library/mt219004.aspx) to get more details about
azure\_offer\_number, azure\_currency, azure\_locale, azure\_region_info. These DataLab deploy properties correspond to
RateCard API request parameters.

To have working billing functionality please review Billing configuration note and use proper parameters for SSN node
deployment.

To use Data Lake Store please review Azure Data Lake usage pre-requisites note and use proper parameters for SSN node
deployment.

**Note:** Azure Data Lake usage pre-requisites:

1. Configure application in Azure portal and grant proper permissions to it.
- Open *Azure Active Directory* tab, then *App registrations* and click *New application registration*
- Fill in ui form with the following parameters *Name* - put name of the new application, *Application type* - select
Native, *Sign-on URL* put any valid url as it will be updated later
- Grant proper permissions to the application. Select the application you just created on *App registration* view, then
click *Required permissions*, then *Add->Select an API-> In search field type MicrosoftAzureQueryService* and press
*Select*, then check the box *Have full access to the Azure Data Lake service* and save the changes. Repeat the same
actions for *Windows Azure Active Directory* API (available on *Required permissions->Add->Select an API*) and the
box *Sign in and read user profile*
- Get *Application ID* from application properties it will be used as azure_application_id for deploy_dlap.py script
2. Usage of Data Lake resource predicts shared folder where all users can write or read any data. To manage access to
this folder please create ot use existing group in Active Directory. All users from this group will have RW access to
the shared folder. Put ID(in Active Directory) of the group as *azure_ad_group_id* parameter to deploy_datalab.py script
3. After execution of deploy_datalab.py script go to the application created in step 1 and change *Redirect URIs* value to
the https://SSN_HOSTNAME/ where SSN_HOSTNAME - SSN node hostname

After SSN node deployment following Azure resources will be created:

- Resource group where all DataLab resources will be provisioned
- SSN Virtual machine
- Static public IP address dor SSN virtual machine
- Network interface for SSN node
- Security Group for SSN node (if it was specified, script will attach the provided one)
- Virtual network and Subnet (if they have not been specified) for SSN and EDGE nodes
- Storage account and blob container for necessary further dependencies and configuration files for Notebook nodes (such as .jar files, YARN configuration, etc.)
- Storage account and blob container for collaboration between DataLab users
- If support of Data Lake is enabled: Data Lake and shared directory will be created

In Google cloud (GCP) (click to expand)

**Note:** cloud provider argument should be specified before arguments related to the cloud.

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_datalab.py \
--conf_service_base_name datalab-test \
--conf_os_family debian \
--key_path /path/to/key/ \
--conf_key_name key_name \
--action create
gcp \
--gcp_ssn_instance_size n1-standard-1 \
--gcp_project_id project_id \
--gcp_service_account_path /path/to/auth/file.json \
--gcp_region xx-xxxxx \
--gcp_zone xxx-xxxxx-x \
```

List of parameters for SSN node deployment:

| Parameter | Description/Value |
|------------------------------|---------------------------------------------------------------------------------------|
| conf\_service\_base\_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before)|
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (Debian/RedHat) |
| key\_path | Path to admin key (without key name) |
| conf\_key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| action | In case of SSN node creation, this parameter should be set to “create” |
| conf\_image\_enabled | Enable or Disable creating image at first time |
| conf\_cloud\_provider | Name of the cloud provider, which is supported by DataLab (GCP) |
| gcp\_service\_account\_path | Full path to auth json file |
| gcp\_ssn\_instance\_size | Instance size of SSN instance in GCP |
| gcp\_project\_id | ID of GCP project |
| gcp\_region | GCP region |
| gcp\_zone | GCP zone |
| gcp\_vpc\_name | Name of the Virtual Network (VN) (optional) |
| gcp\_subnet\_name | Name of the GCP subnet (optional) |
| gcp\_firewall\_name | One or more Name\`s of GCP Security Groups, which will be assigned to SSN node (optional)|
| billing\_dataset\_name | Name of GCP dataset (BigQuery service) |

**Note:** If you gonna use Dataproc cluster, be aware that Dataproc has limited availability in GCP regions.
[Cloud Dataproc availability by Region in GCP](https://cloud.google.com/about/locations/)

After SSN node deployment following GCP resources will be created:

- SSN VM instance
- External IP address for SSN instance
- IAM role and Service account for SSN
- Security Groups for SSN node (if it was specified, script will attach the provided one)
- VPC, Subnet (if they have not been specified) for SSN and EDGE nodes
- Bucket for for collaboration between DataLab users. Its name will be
\-\-shared-bucket

**Note:** Optionally Nexus repository can be used to store DataLab jars and Docker images
List of parameters for repository usage during SSN deployment: (click to expand)

| Parameter | Description/Value |
|------------------------------|---------------------------------------------------------------------------------------|
| conf\_repository\_user | Username used to access repository |
| conf\_repository\_pass | Password used to access repository |
| conf\_repository\_address | URI/IP address used to access repository |
| conf\_repository\_port | Port of docker repository |
| conf\_download\_docker\_images | true to download docker images from repository (previous parameters are required) |
| conf\_download\_jars | true to download jars from repository (previous parameters are required) |

### Terminating Self-Service Node

Terminating SSN node will also remove all nodes and components related to it. Basically, terminating Self-service node
will terminate all DataLab’s infrastructure.
Example of command for terminating DataLab environment:

In Amazon (click to expand)

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_datalab.py --conf_service_base_name datalab-test --aws_access_key XXXXXXX --aws_secret_access_key XXXXXXXX --aws_region xx-xxxxx-x --key_path /path/to/key/ --conf_key_name key_name --conf_os_family debian --conf_cloud_provider aws --action terminate
```
List of parameters for SSN node termination:

| Parameter | Description/Value |
|----------------------------|------------------------------------------------------------------------------------|
| conf\_service\_base\_name | Unique infrastructure value |
| aws\_access\_key | AWS user access key |
| aws\_secret\_access\_key | AWS user secret access key |
| aws\_region | AWS region |
| key\_path | Path to admin key (without key name) |
| conf\_key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (Debian/RedHat) |
| conf\_cloud\_provider | Name of the cloud provider, which is supported by DataLab (AWS) |
| action | terminate |

In Azure (click to expand)

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_datalab.py --conf_service_base_name datalab-test --azure_vpc_name vpc-test --azure_resource_group_name resource-group-test --azure_region westus2 --key_path /root/ --conf_key_name Test --conf_os_family debian --conf_cloud_provider azure --azure_auth_path /dir/file.json --action terminate
```
List of parameters for SSN node termination:

| Parameter | Description/Value |
|----------------------------|------------------------------------------------------------------------------------|
| conf\_service\_base\_name | Unique infrastructure value |
| azure\_region | Azure region |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (Debian/RedHat) |
| conf\_cloud\_provider | Name of the cloud provider, which is supported by DataLab (Azure) |
| azure\_vpc\_name | Name of the Virtual Network (VN) |
| key\_path | Path to admin key (without key name) |
| conf\_key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| azure\_auth\_path | Full path to auth json file |
| action | terminate |

In Google cloud (click to expand)

```
/usr/bin/python3 infrastructure-provisioning/scripts/deploy_datalab.py --gcp_project_id project_id --conf_service_base_name datalab-test --gcp_region xx-xxxxx --gcp_zone xx-xxxxx-x --key_path /path/to/key/ --conf_key_name key_name --conf_os_family debian --conf_cloud_provider gcp --gcp_service_account_path /path/to/auth/file.json --action terminate
```
List of parameters for SSN node termination:

| Parameter | Description/Value |
|------------------------------|---------------------------------------------------------------------------------------|
| conf\_service\_base\_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before)|
| gcp\_region | GCP region |
| gcp\_zone | GCP zone |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (Debian/RedHat) |
| conf\_cloud\_provider | Name of the cloud provider, which is supported by DataLab (GCP) |
| gcp\_vpc\_name | Name of the Virtual Network (VN) (optional) |
| gcp\_subnet\_name | Name of the GCP subnet (optional) |
| key\_path | Path to admin key (without key name) |
| conf\_key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| gcp\_service\_account\_path | Full path to auth json file |
| gcp\_project\_id | ID of GCP project |
| action | In case of SSN node termination, this parameter should be set to “terminate” |

Note: It is required to enter gcp_vpc_name and gcp_subnet_name parameters if Self-Service Node was deployed in
pre-defined VPC and Subnet.

## Endpoint node

This node allows you to create Edge nodes and Notebooks on other cloud providers (AWS, Microsoft Azure or GCP).
The exception is the option in which Edge nodes and Notebooks are created on the same cloud provider as Self-Service Node,
in which case endpoint is already provided locally.

### Executing deployment script

In Amazon (click to expand)

```
source /venv/bin/activate
/venv/bin/python3 infrastructure-provisioning/terraform/bin/datalab.py deploy aws endpoint \
--access_key_id access_key \
--secret_access_key secret_access_key \
--key_name datalab-key \
--pkey /path/to/private/key.pem \
--path_to_pub_key /path/key.pub \
--service_base_name datalab-test \
--endpoint_id awstest \
--region xx-xxx \
--zone xxxx \
--cloud_provider aws \
--vpc_id xxxxxxx \
--subnet_id xxxxxxx \
--ssn_ui_host 0.0.0.0 \
--billing_enable true/false \
--billing_bucket xxxxxxx \
--report_path xxxxxxxxx \
--billing_aws_account_id xxxxxxxxx \
--billing_tag xxxxxxx \
--mongo_password xxxxxxxxxx
deactivate
```

The following AWS resources will be created:
- Endpoint EC2 instance
- Elastic IP address for Endpoint EC2 instance
- S3 bucket
- Security Group for Endpoint instance
- IAM Roles and Instance Profiles for Endpoint instance
- User private subnet. All further nodes (Notebooks, EMR clusters) will be provisioned in different subnet than SSN.

List of parameters for Endpoint deployment:

| Parameter | Description/Value |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| service\_base\_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
| pkey | Path to private key |
| path\_to\_pub\_key | Path to public key |
| key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| endpoint\_id | ID of the endpoint |
| cloud\_provider | Name of the cloud provider, which is supported by DataLab (AWS) |
| access\_key\_id | AWS user access key |
| secret\_access\_key | AWS user secret access key |
| region | AWS region |
| vpc\_id | ID of the VPC (optional) |
| ssn\_ui\_host | IP address of SSN host on the cloud provider |
| subnet\_id | ID of the public subnet (optional) |
| billing\_enable | Enabling or disabling billing |
| billing\_bucket | The name of S3 bucket where billing reports will be placed |
| report\_path | The path to billing reports directory in S3 bucket. This parameter isn't required when billing reports are placed in the root of S3 bucket. |
| mongo\_password | Mongo database password |

In Azure (click to expand)

```
source /venv/bin/activate
/venv/bin/python3 infrastructure-provisioning/terraform/bin/datalab.py deploy azure endpoint \
--auth_file_path /path/to/auth.json
--key_name datalab-key \
--pkey /path/to/private/key.pem \
--path_to_pub_key /path/key.pub \
--service_base_name datalab-test \
--resource_group_name datalab-test-group \
--endpoint_id azuretest \
--region xx-xxx \
--zone xxxx \
--cloud_provider azure \
--vpc_id xxxxxxx \
--subnet_id xxxxxxx \
--ssn_ui_host 0.0.0.0 \
--billing_enable true/false \
--billing_bucket xxxxxxx \
--report_path xxxxxxxxx \
--billing_aws_account_id xxxxxxxxx \
--billing_tag xxxxxxx \
--mongo_password xxxxxxxxxx \
--offer_number xxxxxxxx \
--currency "USD" \
--locale "en-US" \
--region_info "US"
deactivate
```

The following Azure resources will be created:
- Endpoint virtual machine
- Static public IP address for Endpoint virtual machine
- Network interface for Endpoint node
- Security Group for user's Endpoint instance
- Endpoint's private subnet.

List of parameters for Endpoint deployment:

| Parameter | Description/Value |
|-----------------------|-----------------------------------------------------------------------------------------|
| service\_base\_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
| pkey | Path to private key |
| path\_to\_pub\_key | Path to public key |
| key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| endpoint\_id | ID of the endpoint |
| cloud\_provider | Name of the cloud provider, which is supported by DataLab (Azure) |
| vpc\_id | Name of the Virtual Network (VN) (optional) |
| subnet\_id | Name of the Azure subnet (optional) |
| ssn\_ui\_host | IP address of SSN host on the cloud provider |
| resource\_group\_name | Resource group name (can be the same as service base name |
| region | Azure region |
| auth\_file\_path | Full path to auth json file |
| offer\_number | Azure offer id number |
| currency | Currency that is used for billing information(e.g. USD) |
| locale | Locale that is used for billing information(e.g. en-US) |
| region\_info | Region info that is used for billing information(e.g. US) |
| billing\_enable | Enabling or disabling billing |
| mongo\_password | Mongo database password |

In Google cloud (click to expand)

```
source /venv/bin/activate
/venv/bin/python3 infrastructure-provisioning/terraform/bin/datalab.py deploy gcp endpoint \
--gcp_project_id xxx-xxxx-xxxxxx \
--creds_file /path/to/auth.json \
--key_name datalab-key \
--pkey /path/to/private/key.pem \
--service_base_name datalab-test \
--path_to_pub_key /path/key.pub \
--endpoint_id gcptest \
--region xx-xxx \
--zone xxxx \
--cloud_provider gcp \
--vpc_id xxxxxxx \
--subnet_id xxxxxxx \
--ssn_ui_host 0.0.0.0 \
--mongo_password xxxxxxxxxx \
--billing_dataset_name xxxxxxxx \
--billing_enable true/false
deactivate
```

The following GCP resources will be created:
- Endpoint VM instance
- External static IP address for Endpoint VM instance
- Security Group for Endpoint instance
- Endpoint's private subnet.

List of parameters for Endpoint deployment:

| Parameter | Description/Value |
|------------------------|-----------------------------------------------------------------------------------------|
| service\_base\_name | Any infrastructure value (should be unique if multiple SSN’s have been deployed before) |
| pkey | Path to private key |
| path\_to\_pub\_key | Path to public key |
| key\_name | Name of the uploaded SSH key file (without “.pem” extension) |
| endpoint\_id | ID of the endpoint |
| cloud\_provider | Name of the cloud provider, which is supported by DataLab (GCP) |
| creds\_file | Full path to auth json file |
| gcp\_project\_id | ID of GCP project |
| region | GCP region |
| zone | GCP zone |
| vpc\_id | Name of the Virtual Network (VN) (optional) |
| subnet\_id | Name of the GCP subnet (optional) |
| billing\_dataset\_name | Name of GCP dataset (BigQuery service) |
| ssn\_ui\_host | IP address of SSN host on the cloud provider |
| billing\_enable | Enabling or disabling billing |
| mongo\_password | Mongo database password |

### Terminating Endpoint

In Amazon (click to expand)

```
source /venv/bin/activate
/venv/bin/python3 infrastructure-provisioning/terraform/bin/datalab.py destroy aws endpoint \
--access_key_id access_key \
--secret_access_key secret_access_key \
--key_name datalab-key \
--pkey /path/to/private/key.pem \
--path_to_pub_key /path/key.pub \
--service_base_name datalab-test \
--endpoint_id awstest \
--region xx-xxx \
--zone xxxx \
--cloud_provider aws \
--vpc_id xxxxxxx \
--subnet_id xxxxxxx \
--ssn_ui_host 0.0.0.0 \
--billing_enable true/false \
--billing_bucket xxxxxxx \
--report_path xxxxxxxxx \
--billing_aws_account_id xxxxxxxxx \
--billing_tag xxxxxxx \
--mongo_password xxxxxxxxxx
deactivate
```

List of parameters for Endpoint termination:

In Azure (click to expand)

```
source /venv/bin/activate
/venv/bin/python3 infrastructure-provisioning/terraform/bin/datalab.py destroy azure endpoint \
--auth_file_path /path/to/auth.json
--key_name datalab-key \
--pkey /path/to/private/key.pem \
--path_to_pub_key /path/key.pub \
--service_base_name datalab-test \
--resource_group_name datalab-test-group \
--endpoint_id azuretest \
--region xx-xxx \
--zone xxxx \
--cloud_provider azure \
--vpc_id xxxxxxx \
--subnet_id xxxxxxx \
--ssn_ui_host 0.0.0.0 \
--billing_enable true/false \
--billing_bucket xxxxxxx \
--report_path xxxxxxxxx \
--billing_aws_account_id xxxxxxxxx \
--billing_tag xxxxxxx \
--mongo_password xxxxxxxxxx \
--offer_number xxxxxxxx \
--currency "USD" \
--locale "en-US" \
--region_info "US"
deactivate
```

List of parameters for Endpoint termination:

In Google cloud (click to expand)

```
source /venv/bin/activate
/venv/bin/python3 infrastructure-provisioning/terraform/bin/datalab.py destroy gcp endpoint \
--gcp_project_id xxx-xxxx-xxxxxx \
--creds_file /path/to/auth.json \
--key_name datalab-key \
--pkey /path/to/private/key.pem \
--service_base_name datalab-test \
--path_to_pub_key /path/key.pub \
--endpoint_id gcptest \
--region xx-xxx \
--zone xxxx \
--cloud_provider gcp \
--vpc_id xxxxxxx \
--subnet_id xxxxxxx \
--ssn_ui_host 0.0.0.0 \
--mongo_password xxxxxxxxxx \
--billing_dataset_name xxxxxxxx \
--billing_enable true/false
deactivate
```

List of parameters for Endpoint termination:

After creating endpoint, you must set the endpoint in Datalab UI. To do this, go to the tab
"Resources" - "Endpoints". In the Name field, you should specify the name of the endpoint, in the URL,
specify the network address of the endpoint (taking as an example https://0.0.0.0:8084/). In the Account field, you need to
specify the "endpoint_id" value that was specified when creating the endpoint (for example, awstest, azuretest or gcptest),
in the Endpoint tag field, specify the endpoint tag (optional).

## Edge Node

Gateway node (or an Edge node) is an instance(virtual machine) provisioned in a public subnet. It serves as an entry
point for accessing user’s personal analytical environment. It is created by an end-user, whose public key will be
uploaded there. Only via Edge node, DataLab user can access such application resources as notebook servers and dataengine
clusters. Also, Edge Node is used to setup SOCKS proxy to access notebook servers via Web UI and SSH. Elastic(Static)
IP address is assigned to an Edge Node.

### Create

In order to create Edge node using DataLab Web UI – login and, click on the button “Upload” (Depending on authorization
provider that was chosen on deployment stage, user may be taken from [LDAP](#LDAP_Authentication) or from
[Azure AD (Oauth2)](#Azure_OAuth2_Authentication)). Choose user’s SSH public key and after that click on the button
“Create”. Edge node will be deployed and corresponding instance (virtual machine) will be started.

In Amazon (click to expand)

The following AWS resources will be created:
- Edge EC2 instance
- Elastic IP address for Edge EC2 instance
- User's S3 bucket
- Security Group for user's Edge instance
- Security Group for all further user's Notebook instances
- Security Groups for all further user's master nodes of data engine cluster
- Security Groups for all further user's slave nodes of data engine cluster
- IAM Roles and Instance Profiles for user's Edge instance
- IAM Roles and Instance Profiles all further user's Notebook instances
- User private subnet. All further nodes (Notebooks, EMR clusters) will be provisioned in different subnet than SSN.

List of parameters for Edge node creation:

| Parameter | Description/Value |
|--------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | edge |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (debian/redhat) |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Name of the user |
| aws\_vpc\_id | ID of AWS VPC where infrastructure is being deployed |
| aws\_region | AWS region where infrastructure was deployed |
| aws\_security\_groups\_ids | One or more id’s of the SSN instance security group |
| aws\_subnet\_id | ID of the AWS public subnet where Edge will be deployed |
| aws\_private\_subnet\_prefix | Prefix of the private subnet |
| conf\_tag\_resource\_id | The name of tag for billing reports |
| action | create |

In Azure (click to expand)

The following Azure resources will be created:
- Edge virtual machine
- Static public IP address for Edge virtual machine
- Network interface for Edge node
- Security Group for user's Edge instance
- Security Group for all further user's Notebook instances
- Security Groups for all further user's master nodes of data engine cluster
- Security Groups for all further user's slave nodes of data engine cluster
- User's private subnet. All further nodes (Notebooks, data engine clusters) will be provisioned in different subnet
than SSN.
- User's storage account and blob container

List of parameters for Edge node creation:

| Parameter | Description/Value |
|--------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | edge |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (debian/redhat) |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Name of the user |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| azure\_region | Azure region where infrastructure was deployed |
| azure\_vpc\_name | Name of Azure Virtual network where all infrastructure is being deployed |
| azure\_subnet\_name | Name of the Azure public subnet where Edge will be deployed |
| action | create |

In Google cloud (click to expand)

The following GCP resources will be created:
- Edge VM instance
- External static IP address for Edge VM instance
- Security Group for user's Edge instance
- Security Group for all further user's Notebook instances
- Security Groups for all further user's master nodes of data engine cluster
- Security Groups for all further user's slave nodes of data engine cluster
- User's private subnet. All further nodes (Notebooks, data engine clusters) will be provisioned in different subnet
than SSN.
- User's bucket

List of parameters for Edge node creation:

| Parameter | Description/Value |
|--------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | edge |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (debian/redhat) |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Name of the user |
| gcp\_region | GCP region where infrastructure was deployed |
| gcp\_zone | GCP zone where infrastructure was deployed |
| gcp\_vpc\_name | Name of Azure Virtual network where all infrastructure is being deployed |
| gcp\_subnet\_name | Name of the Azure public subnet where Edge will be deployed |
| gcp\_project\_id | ID of GCP project |
| action | create |

### Start/Stop

To start/stop Edge node, click on the button which looks like a cycle on the top right corner, then click on the button
which is located in “Action” field and in the drop-down menu click on the appropriate action.

In Amazon (click to expand)

List of parameters for Edge node starting/stopping:

| Parameter | Description/Value |
|---------------------------|--------------------------------------------------------------|
| conf\_resource | edge |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| edge\_user\_name | Name of the user |
| aws\_region | AWS region where infrastructure was deployed |
| action | start/stop |

In Azure (click to expand)

List of parameters for Edge node starting:

| Parameter | Description/Value |
|------------------------------|---------------------------------------------------------------------------|
| conf\_resource | edge |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| edge\_user\_name | Name of the user |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| azure\_region | Azure region where infrastructure was deployed |
| action | start |

List of parameters for Edge node stopping:

In Google cloud (click to expand)

List of parameters for Edge node starting/stopping:

| Parameter | Description/Value |
|--------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | edge |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| edge\_user\_name | Name of the user |
| gcp\_region | GCP region where infrastructure was deployed |
| gcp\_zone | GCP zone where infrastructure was deployed |
| gcp\_project\_id | ID of GCP project |
| action | start/stop |

## Notebook node

Notebook node is an instance (virtual machine), with preinstalled analytical software, needed dependencies and with
pre-configured kernels and interpreters. It is the main part of personal analytical environment, which is setup by a
data scientist. It can be Created, Stopped and Terminated. To support variety of analytical needs - Notebook node can
be provisioned on any of cloud supported instance shape for your particular region. From analytical software, which is
already pre-installed on a notebook node, end users can access (read/write) data stored on buckets/containers.

### Create

To create Notebook node, click on the “Create new” button. Then, in drop-down menu choose template type
(jupyter/rstudio/zeppelin/tensor/etc.), enter notebook name and choose instance shape. After clicking the button
“Create”, notebook node will be deployed and started.

List of parameters for Notebook node creation:

In Amazon (click to expand)

| Parameter | Description/Value |
|-------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (debian/redhat) |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| aws\_notebook\_instance\_type | Value of the Notebook EC2 instance shape |
| aws\_region | AWS region where infrastructure was deployed |
| aws\_security\_groups\_ids | ID of the SSN instance's security group |
| application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
| conf\_tag\_resource\_id | The name of tag for billing reports |
| git\_creds | User git credentials in JSON format |
| action | Create |

**Note:** For format of git_creds see "Manage git credentials" lower.

In Azure (click to expand)

| Parameter | Description/Value |
|---------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (debian/redhat) |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| azure\_notebook\_instance\_size | Value of the Notebook virtual machine shape |
| azure\_region | Azure region where infrastructure was deployed |
| azure\_vpc\_name | NAme of Azure Virtual network where all infrastructure is being deployed |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
| git\_creds | User git credentials in JSON format |
| action | Create |

In Google cloud (click to expand)

| Parameter | Description/Value |
|-------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_os\_family | Name of the Linux distributive family, which is supported by DataLab (debian/redhat) |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| gcp\_vpc\_name | Name of Azure Virtual network where all infrastructure is being deployed |
| gcp\_project\_id | ID of GCP project |
| gcp\_notebook\_instance\_size | Value of the Notebook VM instance size |
| gcp\_region | GCP region where infrastructure was deployed |
| gcp\_zone | GCP zone where infrastructure was deployed |
| application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
| git\_creds | User git credentials in JSON format |
| action | Create |

### Stop

In order to stop Notebook node, click on the “gear” button in Actions column. From the drop-down menu click on “Stop”
action.

List of parameters for Notebook node stopping:

In Amazon (click to expand)

| Parameter | Description/Value |
|---------------------------|--------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| aws\_region | AWS region where infrastructure was deployed |
| action | Stop |

In Azure (click to expand)

| Parameter | Description/Value |
|---------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| action | Stop |

In Google cloud (click to expand)

| Parameter | Description/Value |
|---------------------------|--------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| gcp\_region | GCP region where infrastructure was deployed |
| gcp\_zone | GCP zone where infrastructure was deployed |
| gcp\_project\_id | ID of GCP project |
| action | Stop |

### Start

In order to start Notebook node, click on the button, which looks like gear in “Action” field. Then in drop-down menu
choose “Start” action.

List of parameters for Notebook node start:

In Amazon (click to expand)

**Note:** For format of git_creds see "Manage git credentials" lower.

In Azure (click to expand)

| Parameter | Description/Value |
|---------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| azure\_region | Azure region where infrastructure was deployed |
| git\_creds | User git credentials in JSON format |
| action | start |

In Google cloud (click to expand)

| Parameter | Description/Value |
|---------------------------|--------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| gcp\_region | GCP region where infrastructure was deployed |
| gcp\_zone | GCP zone where infrastructure was deployed |
| gcp\_project\_id | ID of GCP project |
| git\_creds | User git credentials in JSON format |
| action | Stop |

### Terminate

In order to terminate Notebook node, click on the button, which looks like gear in “Action” field. Then in drop-down
menu choose “Terminate” action.

List of parameters for Notebook node termination:

In Amazon (click to expand)

**Note:** If terminate action is called, all connected data engine clusters will be removed.

In Azure (click to expand)

| Parameter | Description/Value |
|---------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| action | terminate |

In Google cloud (click to expand)

| Parameter | Description/Value |
|---------------------------|--------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| gcp\_region | GCP region where infrastructure was deployed |
| gcp\_zone | GCP zone where infrastructure was deployed |
| gcp\_project\_id | ID of GCP project |
| git\_creds | User git credentials in JSON format |
| action | Stop |

### List/Install additional libraries

In order to list available libraries (OS/Python2/Python3/R/Others) on Notebook node, click on the button, which looks
like gear in “Action” field. Then in drop-down menu choose “Manage libraries” action.

In Amazon (click to expand)

List of parameters for Notebook node to **get list** of available libraries:

| Parameter | Description/Value |
|-------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| aws\_region | AWS region where infrastructure was deployed |
| application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
| action | lib_list |

**Note:** This operation will return a file with response **[edge_user_name]\_[application]\_[request_id]\_all\_pkgs.json**

**Example** of available libraries in response (type->library->version):

```
{
"os_pkg": {"htop": "2.0.1-1ubuntu1", "python-mysqldb": "1.3.7-1build2"},
"pip3": {"configparser": "N/A"},
"r_pkg": {"rmarkdown": "1.5"},
"others": {"Keras": "N/A"}
}
```

List of parameters for Notebook node to **install** additional libraries:

| Parameter | Description/Value |
|-------------------------------|--------------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| aws\_region | AWS region where infrastructure was deployed |
| application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
| libs | List of additional libraries in JSON format with type (os_pkg/pip3/r_pkg/others)|
| action | lib_install |

**Example** of additional_libs parameter:

```
{
...
"libs": [
{"group": "os_pkg", "name": "nmap"},
{"group": "os_pkg", "name": "htop"},
{"group": "pip3", "name": "configparser"},
{"group": "r_pkg", "name": "rmarkdown"},
{"group": "others", "name": "Keras"}
]
...
}
```

In Azure (click to expand)

List of parameters for Notebook node to **get list** of available libraries:

| Parameter | Description/Value |
|-------------------------------|-----------------------------------------------------------------------------------|
| conf\_resource | notebook |
| conf\_service\_base\_name | Unique infrastructure value, specified during SSN deployment |
| conf\_key\_name | Name of the uploaded SSH key file (without ".pem") |
| edge\_user\_name | Value that previously was used when Edge being provisioned |
| notebook\_instance\_name | Name of the Notebook instance to terminate |
| azure\_resource\_group\_name | Name of the resource group where all DataLab resources are being provisioned |
| application | Type of the notebook template (jupyter/rstudio/zeppelin/tensor/deeplearning) |
| action | lib_list |

List of parameters for Notebook node to **install** additional libraries:

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apache/incubator-datalab

Awesome Lists containing this project

README