{"id":15990835,"url":"https://github.com/prrao87/topic-modelling","last_synced_at":"2025-06-20T14:33:38.945Z","repository":{"id":131349039,"uuid":"234853636","full_name":"prrao87/topic-modelling","owner":"prrao87","description":"Comparing the scalability and quality of topic models in Gensim and PySpark ","archived":false,"fork":false,"pushed_at":"2024-05-03T19:46:32.000Z","size":10543,"stargazers_count":6,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-19T22:49:52.397Z","etag":null,"topics":["data-mining","gensim","lda","natural-language-processing","nlp","pyspark","python","topic-modeling","topic-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prrao87.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-19T06:43:21.000Z","updated_at":"2024-12-14T19:42:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"f5284efe-4ed1-4703-a5f3-32e1b11facb6","html_url":"https://github.com/prrao87/topic-modelling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/prrao87/topic-modelling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prrao87%2Ftopic-modelling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prrao87%2Ftopic-modelling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prrao87%2Ftopic-modelling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prrao87%2Ftopic-modelling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prrao87","download_url":"https://codeload.github.com/prrao87/topic-modelling/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prrao87%2Ftopic-modelling/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260961811,"owners_count":23089297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-mining","gensim","lda","natural-language-processing","nlp","pyspark","python","topic-modeling","topic-models"],"created_at":"2024-10-08T05:40:31.566Z","updated_at":"2025-06-20T14:33:33.932Z","avatar_url":"https://github.com/prrao87.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Building a Scalable Topic Model Workflow\n\n## Latent Dirichlet Allocation\nTopic modelling is an unsupervised machine learning technique to discover the main ‘topics’, or themes,in a collection of unstructured documents. A ‘topic’ here refers to a cluster of words that represents a larger concept from the real world. Each document in a corpus can be imagined as consisting of multiple topics in different proportions all at once — for example, in an article about a major airline procuring new aircraft, it is reasonable to expect many words related to finance, geopolitics, travel policy, as well as passenger trends and market events that led to the deal taking place. A document can thus be composed of several topics, each consisting of specific words (that may or may not overlap between topics).\n\nTopic modelling encapsulates these ideas into a mathematical framework that discovers clusters of word distributions representing overall themes within the corpus, making it a useful technique to analyze very large datasets for their content.The mathematical goal of topic modelling is to fit a model’s parameters to the given data using heuristic rules, such that there is a maximum likelihood that the data arose from the model. Such methods are known as parametric methods, among which ​Latent Dirichlet Allocation​ (LDA) is by far the most popular.\n\nSeveral Python-based implementations of LDA exist - the focus in this repo is to study the topic modelling results of two specific implementations: __Gensim__ and __PySpark__. Broadly speaking, both methods perform similar steps, but it is entirely up to the user to preprocess the text as required beforehand. Specifically for news articles and communication research corpora, the below sequence of preprocessing steps (as per [[1]](#references)) are found to provide good topic model results downstream:\n\n1. Sentence detection\n1. Tokenization\n1. Lowercasing\n1. Normalizing (cleaning unwanted symbols and artifacts)\n1. Stopword removal\n1. Lemmatization\n1. Topic model training\n\n### Stopword removal\nBecause a large proportion of text in any corpus contains repetitive and meaningless words (with respect to interpreting topics) such as \"a\", \"and\", or \"the\", as well as a host of other common verbs/nouns, stopword removal is a very important step in text preprocessing. A stopword list is **highly domain-specific**, so careful thought should be put in beforehand to curate a reasonable list of stopwords. Topic modelling is typically an iterative process, where multiple training runs of the model are done to identify more stopwords that hinder topic interpretation.\n\n### Why perform lemmatization?\nAlthough lemmatization is not strictly necessary, Maier et. al. [[1]](#references) state in their review paper of topic modelling literature that following a clear sequence of steps during text preprocessing can yield better and more interpretable topics. Since lemmatization reduces words to their root form, it results in large-scale feature reduction by combining words that are very similar to each other (such as *kill*, *killed* and *killing*).\n\n\n## Set up Python Environment\n\nFirst, set up virtual environment and install the required libraries from requirements.txt:\n\n```\npython3 -m venv venv\nsource venv/bin/activate\npip3 install -r requirements.txt\n```\n\nFor further development, simply activate the existing virtual environment.\n\n```\nsource venv/bin/activate\n```\n\n#### Language model for spaCy\n\nFor NLP-related tasks such as lemmatization and stopword lists, we use the SpaCy library's \"small\" English language model (the \"large\" one can be used as well, but this can take longer to load and generate results). Download the spaCy language model as follows:\n\n```\npython3 -m spacy download en_core_web_sm\n```\n\n## Set up PySpark Environment\n\nLDA is done in PySpark using Spark's ML pipeline as well as the external [__SparkNLP__](https://nlp.johnsnowlabs.com/) library (for lemmatization).\n\nTo set up PySpark, first [install Java 8+ (using OpenJDK)](https://openjdk.java.net/install/).\n\n#### Install PySpark and SparkNLP 2.4+\n```\npip install pyspark==2.4.4 spark-nlp==2.4.5\n```\nSpecify the PySpark location as an environment variable:\n\n```\nexport SPARK_HOME=/path/to/spark/\nexport PYSPARK_PYTHON=python3\nexport PYSPARK_DRIVER_PYTHON=python3\n```\n\n## Obtain Raw Data\nAs an example, the New York Times dataset from Kaggle is used (download instructions are in the folder `data/`). This is an English-language news dataset of 8,800 articles from the New York Times over a few months in 2016. The structure of the dataset, once preprocessed, is as follows:\n\n| date | url | content |\n|:------:| :-----: | :-------: |\n| 2016-06-30 |  http://www.nytimes.com/2016/06/30/sports/baseb..   | WASHINGTON — Stellar pitching kept the Mets af...  |\n| 2016-06-30 |  http://www.nytimes.com/2016/06/30/nyregion/may..   | Mayor Bill de Blasio’s counsel and chief legal...\n| ...|  ... |  ... |\n\nThe dataset contains article content from a raw HTML dump, so it is full of unnecessary symbols and artifacts.\n\n## Create initial stopword list\nTo train a topic model, a hand-curated, domain-specific stopword list is necessary. Run the script `topic_model/stopwords/create_stopword_list.py`.\n```\ncd topic_model/stopwords\npython3 create_stopword_list.py\n```\nThis script pulls the default spaCy stopword list, and adds a number of news article-specific vocabulary to the stopword list (obtained after some trial and error and inspecting initial model results).\n\n## Train topic model\nSee the [src](https://github.com/prrao87/topic-modelling/tree/master/src) directory.\n\n---\n\n## References\n\n[1] Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... Adam, S. (2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. *Communication Methods and Measures*, 12(2–3), 93–118. doi:10.1080/19312458.2018.1430754 [Taylor \u0026 Francis Online](https://www.tandfonline.com/servlet/linkout?suffix=CIT0040\u0026dbid=20\u0026doi=10.1080%2F19312458.2018.1458084\u0026key=10.1080%2F19312458.2018.1430754\u0026tollfreelink=2_18_091d52e2c25fb605f624551cc29e5f412ee28f10d2308cd98d03acb52762af29).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprrao87%2Ftopic-modelling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprrao87%2Ftopic-modelling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprrao87%2Ftopic-modelling/lists"}