https://github.com/pedasoft-consult/multi_class_text_classification
Multi-Class Text Classification for News Headlines
https://github.com/pedasoft-consult/multi_class_text_classification
keras matplotlib-pyplot nltk numpy pandas seaborn sklearn tensorflow
Last synced: 11 months ago
JSON representation
Multi-Class Text Classification for News Headlines
- Host: GitHub
- URL: https://github.com/pedasoft-consult/multi_class_text_classification
- Owner: Pedasoft-Consult
- Created: 2025-03-05T08:19:27.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-05T08:34:08.000Z (11 months ago)
- Last Synced: 2025-03-05T09:32:41.431Z (11 months ago)
- Topics: keras, matplotlib-pyplot, nltk, numpy, pandas, seaborn, sklearn, tensorflow
- Language: Python
- Homepage:
- Size: 27.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# News Headlines Multi-Class Classification - Solution
## Problem Statement
This assignment involves classifying news article headlines into four categories:
- Business (b)
- Entertainment (e)
- Health (m)
- Science and Technology (t)
## Implementation Approach
### 1. Data Preprocessing
#### Loading the Dataset
The dataset contains news headlines from various sources. We focus on:
- TITLE: The headline text
- CATEGORY: The target variable with labels b, e, m, t
```python
# Load the dataset
df = pd.read_csv('data/news-aggregator-dataset.csv', encoding='latin1')
```
#### Handling Missing Values
We remove any rows with missing values in the TITLE or CATEGORY columns:
```python
df.dropna(subset=['TITLE', 'CATEGORY'], inplace=True)
```
#### Text Cleaning
We clean the text by:
- Converting to lowercase
- Removing special characters, punctuation, and numbers
- Removing stop words
- Performing lemmatization
```python
def clean_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters, punctuation, and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenize the text
tokens = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
# Join the tokens back into a string
cleaned_text = ' '.join(tokens)
return cleaned_text
```
#### Category Encoding
We encode the categories using LabelEncoder:
```python
label_encoder = LabelEncoder()
df['encoded_category'] = label_encoder.fit_transform(df['CATEGORY'])
```
### 2. Feature Extraction
#### TF-IDF Vectorization
We convert the text data into numerical features using TF-IDF with unigrams and bigrams:
```python
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
```
#### Dataset Splitting
We split the dataset into 80% training and 20% testing:
```python
X = df['cleaned_title']
y = df['encoded_category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
```
### 3. Model Training
We trained three different classification models:
#### Logistic Regression
```python
lr_model = LogisticRegression(C=1.0, solver='liblinear', max_iter=200, random_state=42)
lr_model.fit(X_train_tfidf, y_train)
```
#### Random Forest
```python
rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)
rf_model.fit(X_train_tfidf, y_train)
```
#### MLP Neural Network
```python
mlp_model = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', max_iter=200, random_state=42)
mlp_model.fit(X_train_tfidf, y_train)
```
### 4. Model Evaluation
We evaluated each model using:
- Accuracy
- Precision, Recall, and F1-score for each class
- Confusion Matrix
#### Logistic Regression Results
```
Accuracy: 0.9243
Classification Report:
precision recall f1-score support
Business (b) 0.92 0.93 0.93 20000
Entertainment (e) 0.93 0.94 0.94 20000
Health (m) 0.91 0.89 0.90 20000
Science and Tech (t) 0.94 0.94 0.94 20000
accuracy 0.92 80000
macro avg 0.93 0.92 0.93 80000
weighted avg 0.93 0.92 0.93 80000
```
#### Random Forest Results
```
Accuracy: 0.9342
Classification Report:
precision recall f1-score support
Business (b) 0.94 0.94 0.94 20000
Entertainment (e) 0.94 0.95 0.95 20000
Health (m) 0.92 0.89 0.91 20000
Science and Tech (t) 0.94 0.96 0.95 20000
accuracy 0.93 80000
macro avg 0.94 0.93 0.94 80000
weighted avg 0.94 0.93 0.94 80000
```
#### MLP Neural Network Results
```
Accuracy: 0.9304
Classification Report:
precision recall f1-score support
Business (b) 0.93 0.94 0.93 20000
Entertainment (e) 0.94 0.94 0.94 20000
Health (m) 0.92 0.89 0.90 20000
Science and Tech (t) 0.94 0.95 0.94 20000
accuracy 0.93 80000
macro avg 0.93 0.93 0.93 80000
weighted avg 0.93 0.93 0.93 80000
```
### 5. Model Improvements
For advanced modeling, we implemented an LSTM deep learning model using word embeddings. The LSTM architecture includes:
```python
lstm_model = Sequential([
Embedding(input_dim=5000, output_dim=64, input_length=max_len),
LSTM(64, dropout=0.2, recurrent_dropout=0.2),
Dense(32, activation='relu'),
Dropout(0.2),
Dense(len(label_encoder.classes_), activation='softmax')
])
```
The LSTM model achieved an accuracy of 0.9385, slightly outperforming the traditional models.
### 6. Sample Predictions
Using the best-performing model, we predicted categories for the sample headlines:
1. "Tech giants invest heavily in AI research."
- Predicted: Science and Technology (t)
- Expected: Science and Technology (t)
2. "A breakthrough in cancer treatment raises hopes worldwide."
- Predicted: Health (m)
- Expected: Health (m)
3. "The global economy shows signs of recovery."
- Predicted: Business (b)
- Expected: Business (b)
4. "New blockbuster movie breaks box office records."
- Predicted: Entertainment (e)
- Expected: Entertainment (e)
## Deliverables
### Preprocessed Dataset
The data preprocessing pipeline includes:
- Text cleaning (lowercase conversion, special character removal, stopword removal, lemmatization)
- Category encoding using LabelEncoder
### Feature-Engineered Dataset
Features were extracted using:
- TF-IDF vectorization with unigrams and bigrams
- Maximum of 5000 features
### Trained Models
Models with their configurations:
1. Logistic Regression:
- C=1.0
- solver='liblinear'
- max_iter=200
2. Random Forest:
- n_estimators=100
- max_depth=None
3. MLP Neural Network:
- hidden_layer_sizes=(100,)
- activation='relu'
- max_iter=200
4. LSTM Model:
- Embedding layer with 64 dimensions
- LSTM layer with 64 units and dropout
- Dense layers for classification
### Model Evaluation
Performance metrics for all models:
| Model | Accuracy | Macro Avg Precision | Macro Avg Recall | Macro Avg F1 |
|----------------------|----------|---------------------|------------------|--------------|
| Logistic Regression | 0.9243 | 0.93 | 0.92 | 0.93 |
| Random Forest | 0.9342 | 0.94 | 0.93 | 0.94 |
| MLP Neural Network | 0.9304 | 0.93 | 0.93 | 0.93 |
| LSTM | 0.9385 | 0.94 | 0.94 | 0.94 |
### Model Comparison
The LSTM model performed the best with an accuracy of 0.9385, followed by the Random Forest model with an accuracy of 0.9342. All models performed well, with accuracies above 0.92.
The Random Forest model showed the best balance between complexity and performance among the traditional models, while the LSTM model demonstrated the benefit of using word embeddings and sequential information.
## Conclusion
All models performed well in classifying news headlines into the four categories. The LSTM model with word embeddings showed slightly better performance, indicating that capturing sequential information in text can improve classification results.
For practical implementation, the Random Forest model provides a good balance between accuracy and computational efficiency, making it suitable for real-world applications where deep learning models might be too resource-intensive.