https://github.com/steveee27/two-stage-bert-for-sports-news-classification-using-llm
This project scrapes sports news articles, classifies them using a Two-Stage BERT model with Large Language Models (LLM). The first stage distinguishes between football and non-football news, while the second classifies football articles into specific leagues like Liga Inggris, Liga Indonesia, etc.
https://github.com/steveee27/two-stage-bert-for-sports-news-classification-using-llm
bert llm naturallanguageprocessing sportsnews textclassification webscraping
Last synced: 7 months ago
JSON representation
This project scrapes sports news articles, classifies them using a Two-Stage BERT model with Large Language Models (LLM). The first stage distinguishes between football and non-football news, while the second classifies football articles into specific leagues like Liga Inggris, Liga Indonesia, etc.
- Host: GitHub
- URL: https://github.com/steveee27/two-stage-bert-for-sports-news-classification-using-llm
- Owner: steveee27
- License: mit
- Created: 2025-02-13T16:32:03.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-13T16:51:10.000Z (8 months ago)
- Last Synced: 2025-02-13T17:41:38.590Z (8 months ago)
- Topics: bert, llm, naturallanguageprocessing, sportsnews, textclassification, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 3.01 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Two-Stage BERT for Sports News Classification Using LLM
## Table of Contents
1. [Project Description](#project-description)
2. [Data Collection and Scraping](#data-collection-and-scraping)
3. [Methodology](#methodology)
4. [Model Architecture](#model-architecture)
5. [Training and Evaluation](#training-and-evaluation)
6. [Results](#results)
7. [Conclusion](#conclusion)
8. [License](#license)## Project Description
This project focuses on classifying sports news articles into categories such as **Liga Inggris**, **Liga Indonesia**, **Liga Spanyol**, **Liga Italia**, and **Olahraga Non Sepak Bola** using two approaches. The first model (**LLM 1-Stage**) classifies all five categories in one model, while the second model (**LLM 2-Stage**) classifies the articles in two stages: the first stage distinguishes between football and non-football news, and the second stage further classifies football news into specific leagues.Data was scraped from prominent news websites like **Liputan6.com**, **Detik.com**, and **Antaranews.com**, and the model uses **Large Language Models (LLM)** for text classification, specifically leveraging **BERT**.
## Data Collection and Scraping
The data for this project was scraped from the following news websites:
- **Liputan6.com**
- **Detik.com**
- **Antaranews.com**A web scraper was used to collect articles with the following categories:
- **Liga Inggris** (English Premier League)
- **Liga Indonesia** (Indonesian League)
- **Liga Spanyol** (Spanish League)
- **Liga Italia** (Italian League)
- **Olahraga Non Sepak Bola** (Non-football Sports)The data was then processed and structured into a format suitable for training the models.
## Methodology
1. **Web Scraping**: Data was collected from the specified news sources using Python libraries like `BeautifulSoup`.
2. **Data Preprocessing**: The text data was cleaned, tokenized, and lemmatized to remove unnecessary characters and ensure consistency.
3. **Modeling**:
- **LLM 1-Stage**: This model directly classifies articles into five categories: **Liga Inggris**, **Liga Indonesia**, **Liga Spanyol**, **Liga Italia**, and **Olahraga Non Sepak Bola**.
- **LLM 2-Stage**:
- **Stage 1**: This stage classifies the articles into two groups: **Sepak Bola** (Football) and **Non Sepak Bola** (Non-football Sports).
- **Stage 2**: For articles classified as **Sepak Bola**, this stage further classifies them into one of four football leagues: **Liga Inggris**, **Liga Indonesia**, **Liga Spanyol**, or **Liga Italia**.## Model Architecture
The project uses **BERT (Bidirectional Encoder Representations from Transformers)**, a state-of-the-art language model, for both stages of classification:
- **Stage 1 (LLM 2-Stage)**: The BERT model classifies articles as **Sepak Bola** (Football) or **Non Sepak Bola** (Non-football).
- **Stage 2 (LLM 2-Stage)**: For football-related articles, a second BERT model classifies them into one of the football leagues (Liga Inggris, Liga Indonesia, Liga Spanyol, Liga Italia).
- **LLM 1-Stage**: A single BERT model classifies all articles into one of the five categories (including both football and non-football news) in one step.**Input**: News article text.
**Output**: The predicted category: either **football vs non-football** in the first model or one of the five categories in the second model.## Training and Evaluation
The dataset was split into **Training (70%)**, **Validation (15%)**, and **Test (15%)** sets. The model was trained using the **HuggingFace transformers** library, and performance was evaluated using the following metrics:
- **Accuracy**
- **Precision**
- **Recall**
- **F1 Score**The confusion matrices for both models show the classification performance across all stages.
## Results
### **1-Stage BERT Model (Classifying All Labels)**
The **LLM 1-Stage model** classifies all 5 categories in one model.**Confusion Matrix for LLM 1-Stage Model:**
- **Accuracy**: 92.41% (Test)
- **Precision**: 93.08% (Test)
- **Recall**: 92.41% (Test)
- **F1 Score**: 92.57% (Test)
### **2-Stage BERT Model**
The **LLM 2-Stage model** performs classification in two stages.**Confusion Matrix for LLM 2-Stage Model (Stage 1 - Football vs Non-football):**
- **Accuracy**: 100% (Test)**Confusion Matrix for LLM 2-Stage Model (Stage 2 - Football Category Classification):**
- **Liga Inggris**: 94.08% (Test)
- **Liga Indonesia**: 96.83% (Test)
- **Liga Spanyol**: 90.94% (Test)
- **Liga Italia**: 91.32% (Test)
## Conclusion
This project demonstrates the effectiveness of using a **Two-Stage BERT model** for sports news classification. The **LLM 1-Stage model** efficiently classifies all categories in a single step, while the **LLM 2-Stage model** provides more granular classification for football-related articles. Both models achieve high accuracy, with the **two-stage model** performing particularly well in distinguishing between football and non-football news and further classifying football articles into the correct leagues.Future improvements could include:
- Expanding the dataset with more diverse sources of sports news.
- Fine-tuning the BERT models with domain-specific data.
- Exploring other techniques like multi-task learning to improve performance across both stages.## License
This project is licensed under the [MIT License](LICENSE). You are free to use, modify, and distribute this project as long as proper attribution is given to the original author.