{"id":18916504,"url":"https://github.com/analyticalmonk/pyspark_nlp_workshop","last_synced_at":"2025-10-05T12:35:03.237Z","repository":{"id":162935318,"uuid":"622901439","full_name":"analyticalmonk/pyspark_nlp_workshop","owner":"analyticalmonk","description":"Instructions and code for the workshop \"From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP\"","archived":false,"fork":false,"pushed_at":"2023-05-09T16:09:14.000Z","size":637,"stargazers_count":13,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-15T09:53:09.003Z","etag":null,"topics":["databricks","databricks-notebooks","distributed-computing","nlp","pyspark","spark","spark-nlp","workshop"],"latest_commit_sha":null,"homepage":"https://odsc.com/speakers/from-big-data-to-nlp-insights-getting-started-with-pyspark-and-spark-nlp/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/analyticalmonk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-04-03T09:44:44.000Z","updated_at":"2025-01-10T04:30:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"25186157-a23b-4d5d-835f-2ade64c33ce6","html_url":"https://github.com/analyticalmonk/pyspark_nlp_workshop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/analyticalmonk/pyspark_nlp_workshop","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/analyticalmonk%2Fpyspark_nlp_workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/analyticalmonk%2Fpyspark_nlp_workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/analyticalmonk%2Fpyspark_nlp_workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/analyticalmonk%2Fpyspark_nlp_workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/analyticalmonk","download_url":"https://codeload.github.com/analyticalmonk/pyspark_nlp_workshop/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/analyticalmonk%2Fpyspark_nlp_workshop/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278457432,"owners_count":25989952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["databricks","databricks-notebooks","distributed-computing","nlp","pyspark","spark","spark-nlp","workshop"],"created_at":"2024-11-08T10:19:49.721Z","updated_at":"2025-10-05T12:35:03.227Z","avatar_url":"https://github.com/analyticalmonk.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PySpark NLP virtual workshop\n\nInstructions and code for the workshop \"From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP\"\n\n## Setup\n\n### Databricks community edition\n\nWe will run the training code on Databricks Community Edition. Create your account by following the [instructions provided in the official documentation](https://docs.databricks.com/getting-started/community-edition.html). Please complete this step before moving forward.\n\n### Databricks workspace\n\nYou can now create a Databricks workspace with the required Jupyter notebooks [using this link](https://storage.googleapis.com/odsc-23/ODSC%20PySpark%20NLP%20session.dbc). The steps for doing this can be seen in the below GIF.  \n\nFrom the left-hand side navbar, click on `Workspace` \u003e click on dropdown \u003e  click on `Import` \u003e choose `URL` option and enter the link \u003e click on `Import`.\n\n![](databricks_workspace_import_steps.gif)\n\n### Compute cluster\n\n#### Cluster creation\n\nWe will now create a compute cluster that we will run our code on.\n\n- Click on the Compute tab on the navbar. Then click on \"Create Compute\" button. You will be taken to the \"New Cluster\" configuration view.\n- Assign the cluster a name. From the \"Databricks runtime version\" dropdown, choose \"Runtime: 12.2 LTS (Scala 2.12, Spark 3.3.2).\n- Click on the \"Spark\" tab. Add the following lines to \"Spark config\" field.\n```\nspark.kryoserializer.buffer.max 2000M\nspark.serializer org.apache.spark.serializer.KryoSerializer\n```\n- Click on \"Create Cluster\". It may take a few minutes before the cluster gets created.\n\n![databricks_cluster_creation](https://user-images.githubusercontent.com/4419448/237058641-e67762bc-e459-4586-857c-0851f611a218.gif)\n\nAt this point, you can successfully run the code in module 1's notebook. For the next 2 modules, we need to install the Spark NLP library in our cluster.\n\n#### Spark NLP installation\n\nIn Libraries tab inside your cluster you need to follow these steps:\n\n- Install New -\u003e PyPI -\u003e spark-nlp -\u003e Install\n- Install New -\u003e Maven -\u003e Coordinates -\u003e com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.1 -\u003e Install\n\nVoila! You're all set to start now.\n\n## Code\n\nThe workshop code is distributed across 3 Jupyter notebooks. Each of these correspond to a workshop module. They are:\n\n- Module 1: Basics of PySpark and the DataFrame API\n- Module 2: PySpark for NLP\n- Module 3: Advanced NLP with Spark NLP\n\nThey should be in your workspace if you have successfully completed the setup steps. They are present in this repository too if you want to go through them after the workshop.\n\n_Note: A conceptual introduction to Jupyter notebooks can be found [here](https://www.databricks.com/glossary/jupyter-notebook)._\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanalyticalmonk%2Fpyspark_nlp_workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanalyticalmonk%2Fpyspark_nlp_workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanalyticalmonk%2Fpyspark_nlp_workshop/lists"}