{"id":27112433,"url":"https://github.com/kingflow-23/wikipedia-topic-clustering","last_synced_at":"2026-04-15T12:37:27.748Z","repository":{"id":281233471,"uuid":"944534232","full_name":"Kingflow-23/wikipedia-topic-clustering","owner":"Kingflow-23","description":"This project scrapes Wikipedia pages on various topics, processes the text using TF-IDF vectorization, and clusters the topics using KMeans. The results are visualized in a 2D plot using UMAP, providing insights into the relationships and groupings of different Wikipedia topics based on their content.","archived":false,"fork":false,"pushed_at":"2025-03-07T17:59:48.000Z","size":1,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-07T18:39:09.626Z","etag":null,"topics":["beautifulsoup4","cleaning-data","clustering","jupyrt-notebook","python","scraping","umap","vectorization","wikipedia-api"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kingflow-23.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-07T14:14:41.000Z","updated_at":"2025-03-07T18:03:36.000Z","dependencies_parsed_at":"2025-03-07T18:49:22.490Z","dependency_job_id":null,"html_url":"https://github.com/Kingflow-23/wikipedia-topic-clustering","commit_stats":null,"previous_names":["kingflow-23/wikipedia-topic-clustering"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kingflow-23%2Fwikipedia-topic-clustering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kingflow-23%2Fwikipedia-topic-clustering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kingflow-23%2Fwikipedia-topic-clustering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kingflow-23%2Fwikipedia-topic-clustering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kingflow-23","download_url":"https://codeload.github.com/Kingflow-23/wikipedia-topic-clustering/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247578575,"owners_count":20961270,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup4","cleaning-data","clustering","jupyrt-notebook","python","scraping","umap","vectorization","wikipedia-api"],"created_at":"2025-04-07T01:59:14.595Z","updated_at":"2025-10-18T20:28:23.675Z","avatar_url":"https://github.com/Kingflow-23.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Wikipedia Topic Clustering and Visualization\n\nThis project scrapes text data from Wikipedia pages, processes the data using TF-IDF vectorization, and visualizes the clustering of topics using UMAP for dimensionality reduction and KMeans clustering.\n\n## Project Overview\n\nThe goal of this project is to fetch content from various Wikipedia pages on different topics, clean and preprocess the text data, vectorize the content using the Term Frequency-Inverse Document Frequency (TF-IDF) method, and visualize the relationships between these topics using dimensionality reduction techniques. The topics are clustered using KMeans and visualized with UMAP projections.\n\n### Primary Steps in the Workflow:\n\n---\n\n### 1. Scraping Wikipedia Pages\n\n**Objective:**  \nThe first step is to gather content from Wikipedia pages on various topics of interest. \n\n- **How It Works:**\n  - The script uses the `requests` library to send HTTP requests to the Wikipedia API (or directly scrape HTML content from the pages).\n  - The `beautifulsoup4` library is used to parse the HTML structure and extract only the relevant text data (the content section of each article).\n  - In the notebooks I try an approach of scraping using wikipedia-api and also a direct scraping approach using beautifulsoup4.\n\n**Topics Example:**  \n- Artificial Intelligence\n- Computer Science\n- Machine Learning\n- Data Science\n\nOnce the data is scraped, it is stored in a structured format (e.g., a Pandas DataFrame) with columns for the topic title and the corresponding text content.\n\n---\n\n### 2. Text Preprocessing\n\n**Objective:**  \nOnce the content is scraped, the next step is to preprocess and clean the text data to make it suitable for analysis.\n\n- **How It Works:**\n  - The text is cleaned by removing unwanted characters like punctuation and numbers, stripping unnecessary whitespace, and converting the text to lowercase for uniformity.\n  - Regular expressions (`re` module) are used to filter out non-alphabetic characters, leaving only the relevant words.\n  - Additional text normalization techniques may include tokenization (splitting text into individual words) and removing stopwords (common words like “the,” “and,” etc., that don’t add meaningful context).\n  \n**Result:**  \nAt the end of preprocessing, you’ll have a clean list of text data that is ready for vectorization.\n\n---\n\n### 3. TF-IDF Vectorization\n\n**Objective:**  \nThe next step is to convert the text data into numerical format, which will enable the machine learning algorithm (KMeans) to work with it.\n\n- **How It Works:**\n  - The TF-IDF (Term Frequency-Inverse Document Frequency) method is applied to the preprocessed text.\n  - TF-IDF assigns a weight to each word based on how frequently it appears in a document relative to its frequency across all documents. This helps highlight the most important terms for each topic.\n  - The `TfidfVectorizer` from `scikit-learn` is used to convert the preprocessed text into a sparse matrix of numerical values.\n  \n**Result:**  \nA TF-IDF matrix where each row corresponds to a document (Wikipedia page), and each column corresponds to a word (term) with its corresponding weight (importance).\n\n---\n\n### 4. Clustering Using KMeans\n\n**Objective:**  \nWith the TF-IDF matrix in hand, we now want to group similar Wikipedia topics into clusters based on their content.\n\n- **How It Works:**\n  - KMeans is an unsupervised machine learning algorithm used for clustering.\n  - The algorithm divides the dataset into a predefined number of clusters (K). It does this by iteratively assigning data points (in this case, Wikipedia topics) to clusters based on the closest centroids.\n  - The KMeans algorithm in `scikit-learn` is used, with the number of clusters (K) defined based on prior knowledge or experimentation.\n  \n**Result:**  \nEach Wikipedia topic is assigned to one of the K clusters, grouping similar topics together.\n\n---\n\n### 5. Dimensionality Reduction Using UMAP\n\n**Objective:**  \nAfter clustering the topics, we need to visualize the results. Since the TF-IDF matrix is high-dimensional, we need to reduce its dimensionality to plot the data on a 2D graph.\n\n- **How It Works:**\n  - UMAP (Uniform Manifold Approximation and Projection) is used for dimensionality reduction, which preserves both the local and global structures of the data.\n  - UMAP projects the high-dimensional TF-IDF matrix into a 2D space, which makes it easier to visualize and interpret the clusters.\n  \n**Result:**  \nA 2D projection of the clustered topics, where each point on the plot represents a Wikipedia topic, and the color or shape of the points indicates which cluster they belong to.\nA 3d version and interactive one (with plotly) can be found in the api_project_3d.ipynb notebook\n\n---\n\n### 6. Saving and Displaying the Plot\n\n**Objective:**  \nFinally, we want to visualize the clustering results in a meaningful way and save the visualization for future reference.\n\n- **How It Works:**\n  - The `matplotlib` library is used to create a scatter plot.\n  - Each point in the plot represents a Wikipedia topic, and the points are color-coded according to the cluster they belong to.\n  - A timestamp is added to the filename of the plot to make each generated plot unique.\n  \n**Result:**  \n- A PNG file of the 2D visualization is saved to the disk with a filename in the format `scraping_wikipedia_result_YYYY-MM-DD_HH-MM-SS.png`.\n- The plot is displayed on the screen so you can visually inspect how the topics are clustered.\n\n---\n\n### Output:\n\n- **Saved Plot:** A PNG file of the 2D scatter plot showing the clustering of Wikipedia topics.\n- **Visualization:** A 2D plot where similar topics are grouped together, making it easy to see which topics are related based on their textual content.\n\nExample of a saved file name:  \n`scraping_wikipedia_result_YYYY-MM-DD_HH-MM-SS.png`\n\n---\n\n## Installation\n\nTo get started, clone the repository and install the required dependencies:\n\n```bash\ngit clone https://github.com/Kingflow-23/wikipedia-topic-clustering.git\n```\n\n```bash\npip install requests beautifulsoup4 pandas seaborn matplotlib scikit-learn umap-learn wikipedia-api plotly\n```\n\nor \n\n```bash\npip install -r requirements.txt\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkingflow-23%2Fwikipedia-topic-clustering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkingflow-23%2Fwikipedia-topic-clustering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkingflow-23%2Fwikipedia-topic-clustering/lists"}