https://github.com/fyt3rp4til/newsclassify-v2-spacy-wordembeddings
https://github.com/fyt3rp4til/newsclassify-v2-spacy-wordembeddings
decision-trees gradient-boosting-classifier knn-classifier multinomial-naive-bayes random-forest spacy-word-embeddings word2vec
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/fyt3rp4til/newsclassify-v2-spacy-wordembeddings
- Owner: FYT3RP4TIL
- Created: 2024-09-06T15:55:05.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-09-06T18:19:30.000Z (8 months ago)
- Last Synced: 2025-01-14T07:13:46.950Z (4 months ago)
- Topics: decision-trees, gradient-boosting-classifier, knn-classifier, multinomial-naive-bayes, random-forest, spacy-word-embeddings, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 942 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: news_dataset.json
Awesome Lists containing this project
README
# 📰 NewsClassify-V2-Spacy-WordEmbeddings
## 📊 Project Overview
This project implements a News Category Classifier using various machine learning algorithms. The goal is to automatically categorize news articles into predefined categories based on their content.
### 🔍 Dataset
The dataset consists of news articles with two main columns:
- **Text**: Description of a particular topic
- **Category**: The class to which the text belongsCategories include:
- BUSINESS
- SPORTS
- CRIME
- SCIENCEDistribution of categories:
```python
df['category'].value_counts()# Output:
# BUSINESS 4254
# SPORTS 4167
# CRIME 2893
# SCIENCE 1381
# Name: count, dtype: int64
```## 🛠 Methodology
### 1. Data Preprocessing
The preprocessing step is crucial for preparing the text data for machine learning models. Here's a detailed look at the preprocessing function:
```python
import spacy
nlp = spacy.load("en_core_web_lg")def preprocess(text):
doc = nlp(text)
filtered_tokens = []
for token in doc:
if token.is_stop or token.is_punct:
continue
filtered_tokens.append(token.lemma_)
return ' '.join(filtered_tokens)
```This function does the following:
1. Uses spaCy to tokenize the text
2. Removes stop words (common words like "the", "is", "at", etc.)
3. Removes punctuation
4. Applies lemmatization (reducing words to their base form)Example of preprocessed text:
```python
original_text = "Watching Schrödinger's Cat Die University of California"
preprocessed_text = "watch Schrödinger Cat Die University California"
```### 2. Word Embeddings
Word embeddings are dense vector representations of words that capture semantic meanings. This project uses spaCy's pre-trained word vectors to create document embeddings.
```python
df['vector'] = df['preprocessed_text'].apply(lambda text: nlp(text).vector)
```This creates a new column 'vector' that contains the vector representation of each preprocessed text. Each vector is a 300-dimensional array of floats.
Example of data structure after vector creation:
```python
print(df.head())# Output:
# text category label_num preprocessed_text vector
# 0 Watching Schrödinger's Cat Die University of C... SCIENCE NaN watch Schrödinger Cat Die Univ... [-0.85190785, 1.0438694, ...]
# 1 WATCH: Freaky Vortex Opens Up In Flooded Lake SCIENCE NaN watch freaky Vortex open Flood... [0.60747343, 1.9251899, ...]
# 2 Entrepreneurs Today Don't Need a Big Budget to... BUSINESS 2.0 entrepreneur today need Big B... [0.088981755, 0.5882564, ...]
# ...
```### 3. Data Preparation for Model Training
The vector data needs to be reshaped for use in scikit-learn models:
```python
import numpy as npX_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)print("Shape of X_train after reshaping: ", X_train_2d.shape)
print("Shape of X_test after reshaping: ", X_test_2d.shape)# Output:
# Shape of X_train after reshaping: (6789, 300)
# Shape of X_test after reshaping: (2263, 300)
```This reshapes the data into a 2D numpy array where each row represents a document and each column represents a dimension of the word embedding.
## 🤖 Models and Results
We experimented with several machine learning models to classify the news articles. Here are the results for each model:
### 1. Decision Tree Classifier
```python
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier()
clf.fit(X_train_2d, y_train)
y_pred = clf.predict(X_test_2d)print(classification_report(y_test, y_pred))
```Results:
```
precision recall f1-score support0.0 0.71 0.65 0.68 579
1.0 0.74 0.75 0.75 833
2.0 0.74 0.76 0.75 851accuracy 0.73 2263
macro avg 0.73 0.72 0.73 2263
weighted avg 0.73 0.73 0.73 2263
```### 2. Multinomial Naive Bayes
```python
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)
y_pred = clf.predict(scaled_test_embed)print(classification_report(y_test, y_pred))
```Results:
```
precision recall f1-score support0.0 0.94 0.55 0.69 579
1.0 0.72 0.85 0.78 833
2.0 0.76 0.84 0.80 851accuracy 0.77 2263
macro avg 0.81 0.75 0.76 2263
weighted avg 0.79 0.77 0.76 2263
```### 3. K-Nearest Neighbors
```python
from sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
clf.fit(X_train_2d, y_train)
y_pred = clf.predict(X_test_2d)print(classification_report(y_test, y_pred))
```Results:
```
precision recall f1-score support0.0 0.81 0.88 0.84 579
1.0 0.88 0.87 0.87 833
2.0 0.90 0.86 0.88 851accuracy 0.87 2263
macro avg 0.86 0.87 0.87 2263
weighted avg 0.87 0.87 0.87 2263
```### 4. Random Forest Classifier
```python
from sklearn.ensemble import RandomForestClassifierclf = RandomForestClassifier()
clf.fit(X_train_2d, y_train)
y_pred = clf.predict(X_test_2d)print(classification_report(y_test, y_pred))
```Results:
```
precision recall f1-score support0.0 0.87 0.82 0.85 579
1.0 0.86 0.89 0.88 833
2.0 0.88 0.88 0.88 851accuracy 0.87 2263
macro avg 0.87 0.86 0.87 2263
weighted avg 0.87 0.87 0.87 2263
```### 5. Gradient Boosting Classifier
## 🥇 Best Performing Model
The **Gradient Boosting Classifier** achieved the highest overall performance with an accuracy of 89% and balanced precision, recall, and F1-scores across all categories.
```python
from sklearn.ensemble import GradientBoostingClassifierclf = GradientBoostingClassifier()
clf.fit(X_train_2d, y_train)
y_pred = clf.predict(X_test_2d)print(classification_report(y_test, y_pred))
```Results:
```
precision recall f1-score support0.0 0.89 0.87 0.88 579
1.0 0.89 0.90 0.89 833
2.0 0.90 0.90 0.90 851accuracy 0.89 2263
macro avg 0.89 0.89 0.89 2263
weighted avg 0.89 0.89 0.89 2263
```
## 🚀 Future Improvements
1. Experiment with hyperparameter tuning
- Use GridSearchCV or RandomizedSearchCV to find optimal parameters
2. Try ensemble methods combining multiple models
- Voting Classifier or Stacking could potentially improve results
3. Explore deep learning approaches (e.g., LSTM, BERT)
- These models can capture more complex relationships in text data
4. Collect more data to improve model generalization
- More diverse examples can help the model learn better
5. Feature engineering
- Create additional features like text length, sentiment scores, etc.## 📚 Dependencies
- Python 3.x
- spaCy
- scikit-learn
- NumPy
- PandasTo install dependencies:
```
pip install spacy scikit-learn numpy pandas
python -m spacy download en_core_web_lg
```## Acknowledgments
Dataset source: [Kaggle - News Category Classifier](https://www.kaggle.com/code/hengzheng/news-category-classifier-val-acc-0-65)
## 📖 Additional Resources
- [spaCy Documentation](https://spacy.io/usage/linguistic-features)
- [Scikit-learn Documentation](https://scikit-learn.org/stable/supervised_learning.html)
- [Introduction to Word Embeddings](https://www.tensorflow.org/text/guide/word_embeddings)