https://github.com/analyticalmonk/pyspark_nlp_workshop
Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"
https://github.com/analyticalmonk/pyspark_nlp_workshop
databricks databricks-notebooks distributed-computing nlp pyspark spark spark-nlp workshop
Last synced: 8 months ago
JSON representation
Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"
- Host: GitHub
- URL: https://github.com/analyticalmonk/pyspark_nlp_workshop
- Owner: analyticalmonk
- Created: 2023-04-03T09:44:44.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-05-09T16:09:14.000Z (about 3 years ago)
- Last Synced: 2025-04-15T09:53:09.003Z (about 1 year ago)
- Topics: databricks, databricks-notebooks, distributed-computing, nlp, pyspark, spark, spark-nlp, workshop
- Language: Jupyter Notebook
- Homepage: https://odsc.com/speakers/from-big-data-to-nlp-insights-getting-started-with-pyspark-and-spark-nlp/
- Size: 622 KB
- Stars: 13
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PySpark NLP virtual workshop
Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"
## Setup
### Databricks community edition
We will run the training code on Databricks Community Edition. Create your account by following the [instructions provided in the official documentation](https://docs.databricks.com/getting-started/community-edition.html). Please complete this step before moving forward.
### Databricks workspace
You can now create a Databricks workspace with the required Jupyter notebooks [using this link](https://storage.googleapis.com/odsc-23/ODSC%20PySpark%20NLP%20session.dbc). The steps for doing this can be seen in the below GIF.
From the left-hand side navbar, click on `Workspace` > click on dropdown > click on `Import` > choose `URL` option and enter the link > click on `Import`.

### Compute cluster
#### Cluster creation
We will now create a compute cluster that we will run our code on.
- Click on the Compute tab on the navbar. Then click on "Create Compute" button. You will be taken to the "New Cluster" configuration view.
- Assign the cluster a name. From the "Databricks runtime version" dropdown, choose "Runtime: 12.2 LTS (Scala 2.12, Spark 3.3.2).
- Click on the "Spark" tab. Add the following lines to "Spark config" field.
```
spark.kryoserializer.buffer.max 2000M
spark.serializer org.apache.spark.serializer.KryoSerializer
```
- Click on "Create Cluster". It may take a few minutes before the cluster gets created.

At this point, you can successfully run the code in module 1's notebook. For the next 2 modules, we need to install the Spark NLP library in our cluster.
#### Spark NLP installation
In Libraries tab inside your cluster you need to follow these steps:
- Install New -> PyPI -> spark-nlp -> Install
- Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.1 -> Install
Voila! You're all set to start now.
## Code
The workshop code is distributed across 3 Jupyter notebooks. Each of these correspond to a workshop module. They are:
- Module 1: Basics of PySpark and the DataFrame API
- Module 2: PySpark for NLP
- Module 3: Advanced NLP with Spark NLP
They should be in your workspace if you have successfully completed the setup steps. They are present in this repository too if you want to go through them after the workshop.
_Note: A conceptual introduction to Jupyter notebooks can be found [here](https://www.databricks.com/glossary/jupyter-notebook)._