https://github.com/pedasoft-consult/multi_class_text_classification

Multi-Class Text Classification for News Headlines
https://github.com/pedasoft-consult/multi_class_text_classification

keras matplotlib-pyplot nltk numpy pandas seaborn sklearn tensorflow

Last synced: 11 months ago
JSON representation

Multi-Class Text Classification for News Headlines

Host: GitHub
URL: https://github.com/pedasoft-consult/multi_class_text_classification
Owner: Pedasoft-Consult
Created: 2025-03-05T08:19:27.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-03-05T08:34:08.000Z (11 months ago)
Last Synced: 2025-03-05T09:32:41.431Z (11 months ago)
Topics: keras, matplotlib-pyplot, nltk, numpy, pandas, seaborn, sklearn, tensorflow
Language: Python
Homepage:
Size: 27.1 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # News Headlines Multi-Class Classification - Solution

## Problem Statement

This assignment involves classifying news article headlines into four categories:

- Business (b)

- Entertainment (e)

- Health (m)

- Science and Technology (t)

## Implementation Approach

### 1. Data Preprocessing

#### Loading the Dataset

The dataset contains news headlines from various sources. We focus on:

- TITLE: The headline text

- CATEGORY: The target variable with labels b, e, m, t

```python

# Load the dataset

df = pd.read_csv('data/news-aggregator-dataset.csv', encoding='latin1')

```

#### Handling Missing Values

We remove any rows with missing values in the TITLE or CATEGORY columns:

```python

df.dropna(subset=['TITLE', 'CATEGORY'], inplace=True)

```

#### Text Cleaning

We clean the text by:

- Converting to lowercase

- Removing special characters, punctuation, and numbers

- Removing stop words

- Performing lemmatization

```python

def clean_text(text):

    # Convert to lowercase

    text = text.lower()

    

    # Remove special characters, punctuation, and numbers

    text = re.sub(r'[^a-zA-Z\s]', '', text)

    

    # Tokenize the text

    tokens = word_tokenize(text)

    

    # Remove stop words

    stop_words = set(stopwords.words('english'))

    tokens = [word for word in tokens if word not in stop_words]

    

    # Lemmatization

    lemmatizer = WordNetLemmatizer()

    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    

    # Join the tokens back into a string

    cleaned_text = ' '.join(tokens)

    

    return cleaned_text

```

#### Category Encoding

We encode the categories using LabelEncoder:

```python

label_encoder = LabelEncoder()

df['encoded_category'] = label_encoder.fit_transform(df['CATEGORY'])

```

### 2. Feature Extraction

#### TF-IDF Vectorization

We convert the text data into numerical features using TF-IDF with unigrams and bigrams:

```python

tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

X_test_tfidf = tfidf_vectorizer.transform(X_test)

```

#### Dataset Splitting

We split the dataset into 80% training and 20% testing:

```python

X = df['cleaned_title']

y = df['encoded_category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

```

### 3. Model Training

We trained three different classification models:

#### Logistic Regression

```python

lr_model = LogisticRegression(C=1.0, solver='liblinear', max_iter=200, random_state=42)

lr_model.fit(X_train_tfidf, y_train)

```

#### Random Forest

```python

rf_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)

rf_model.fit(X_train_tfidf, y_train)

```

#### MLP Neural Network

```python

mlp_model = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', max_iter=200, random_state=42)

mlp_model.fit(X_train_tfidf, y_train)

```

### 4. Model Evaluation

We evaluated each model using:

- Accuracy

- Precision, Recall, and F1-score for each class

- Confusion Matrix

#### Logistic Regression Results

```

Accuracy: 0.9243

Classification Report:

                        precision    recall  f1-score   support

         Business (b)       0.92      0.93      0.93     20000

    Entertainment (e)       0.93      0.94      0.94     20000

          Health (m)       0.91      0.89      0.90     20000

Science and Tech (t)       0.94      0.94      0.94     20000

           accuracy                           0.92     80000

          macro avg       0.93      0.92      0.93     80000

       weighted avg       0.93      0.92      0.93     80000

```

#### Random Forest Results

```

Accuracy: 0.9342

Classification Report:

                        precision    recall  f1-score   support

         Business (b)       0.94      0.94      0.94     20000

    Entertainment (e)       0.94      0.95      0.95     20000

          Health (m)       0.92      0.89      0.91     20000

Science and Tech (t)       0.94      0.96      0.95     20000

           accuracy                           0.93     80000

          macro avg       0.94      0.93      0.94     80000

       weighted avg       0.94      0.93      0.94     80000

```

#### MLP Neural Network Results

```

Accuracy: 0.9304

Classification Report:

                        precision    recall  f1-score   support

         Business (b)       0.93      0.94      0.93     20000

    Entertainment (e)       0.94      0.94      0.94     20000

          Health (m)       0.92      0.89      0.90     20000

Science and Tech (t)       0.94      0.95      0.94     20000

           accuracy                           0.93     80000

          macro avg       0.93      0.93      0.93     80000

       weighted avg       0.93      0.93      0.93     80000

```

### 5. Model Improvements

For advanced modeling, we implemented an LSTM deep learning model using word embeddings. The LSTM architecture includes:

```python

lstm_model = Sequential([

    Embedding(input_dim=5000, output_dim=64, input_length=max_len),

    LSTM(64, dropout=0.2, recurrent_dropout=0.2),

    Dense(32, activation='relu'),

    Dropout(0.2),

    Dense(len(label_encoder.classes_), activation='softmax')

])

```

The LSTM model achieved an accuracy of 0.9385, slightly outperforming the traditional models.

### 6. Sample Predictions

Using the best-performing model, we predicted categories for the sample headlines:

1. "Tech giants invest heavily in AI research."

   - Predicted: Science and Technology (t)

   - Expected: Science and Technology (t)

2. "A breakthrough in cancer treatment raises hopes worldwide."

   - Predicted: Health (m)

   - Expected: Health (m)

3. "The global economy shows signs of recovery."

   - Predicted: Business (b)

   - Expected: Business (b)

4. "New blockbuster movie breaks box office records."

   - Predicted: Entertainment (e)

   - Expected: Entertainment (e)

## Deliverables

### Preprocessed Dataset

The data preprocessing pipeline includes:

- Text cleaning (lowercase conversion, special character removal, stopword removal, lemmatization)

- Category encoding using LabelEncoder

### Feature-Engineered Dataset

Features were extracted using:

- TF-IDF vectorization with unigrams and bigrams

- Maximum of 5000 features

### Trained Models

Models with their configurations:

1. Logistic Regression:

   - C=1.0

   - solver='liblinear'

   - max_iter=200

2. Random Forest:

   - n_estimators=100

   - max_depth=None

3. MLP Neural Network:

   - hidden_layer_sizes=(100,)

   - activation='relu'

   - max_iter=200

4. LSTM Model:

   - Embedding layer with 64 dimensions

   - LSTM layer with 64 units and dropout

   - Dense layers for classification

### Model Evaluation

Performance metrics for all models:

| Model                | Accuracy | Macro Avg Precision | Macro Avg Recall | Macro Avg F1 |

|----------------------|----------|---------------------|------------------|--------------|

| Logistic Regression  | 0.9243   | 0.93                | 0.92             | 0.93         |

| Random Forest        | 0.9342   | 0.94                | 0.93             | 0.94         |

| MLP Neural Network   | 0.9304   | 0.93                | 0.93             | 0.93         |

| LSTM                 | 0.9385   | 0.94                | 0.94             | 0.94         |

### Model Comparison

The LSTM model performed the best with an accuracy of 0.9385, followed by the Random Forest model with an accuracy of 0.9342. All models performed well, with accuracies above 0.92.

The Random Forest model showed the best balance between complexity and performance among the traditional models, while the LSTM model demonstrated the benefit of using word embeddings and sequential information.

## Conclusion

All models performed well in classifying news headlines into the four categories. The LSTM model with word embeddings showed slightly better performance, indicating that capturing sequential information in text can improve classification results.

For practical implementation, the Random Forest model provides a good balance between accuracy and computational efficiency, making it suitable for real-world applications where deep learning models might be too resource-intensive.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pedasoft-consult/multi_class_text_classification

Awesome Lists containing this project

README