{"id":26016559,"url":"https://github.com/pedasoft-consult/multi_class_text_classification","last_synced_at":"2026-04-09T18:46:46.072Z","repository":{"id":280778941,"uuid":"943135892","full_name":"Pedasoft-Consult/Multi_Class_Text_Classification","owner":"Pedasoft-Consult","description":"Multi-Class Text Classification for News Headlines","archived":false,"fork":false,"pushed_at":"2025-03-05T08:34:08.000Z","size":28395,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-05T09:32:41.431Z","etag":null,"topics":["keras","matplotlib-pyplot","nltk","numpy","pandas","seaborn","sklearn","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Pedasoft-Consult.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-05T08:19:27.000Z","updated_at":"2025-03-05T08:34:11.000Z","dependencies_parsed_at":"2025-03-05T09:43:04.540Z","dependency_job_id":null,"html_url":"https://github.com/Pedasoft-Consult/Multi_Class_Text_Classification","commit_stats":null,"previous_names":["pedasoft-consult/multi_class_text_classification"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pedasoft-Consult%2FMulti_Class_Text_Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pedasoft-Consult%2FMulti_Class_Text_Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pedasoft-Consult%2FMulti_Class_Text_Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pedasoft-Consult%2FMulti_Class_Text_Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Pedasoft-Consult","download_url":"https://codeload.github.com/Pedasoft-Consult/Multi_Class_Text_Classification/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242145623,"owners_count":20079160,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["keras","matplotlib-pyplot","nltk","numpy","pandas","seaborn","sklearn","tensorflow"],"created_at":"2025-03-06T04:21:59.301Z","updated_at":"2026-04-09T18:46:46.016Z","avatar_url":"https://github.com/Pedasoft-Consult.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# News Headlines Multi-Class Classification - Solution\n\n## Problem Statement\nThis assignment involves classifying news article headlines into four categories:\n- Business (b)\n- Entertainment (e)\n- Health (m)\n- Science and Technology (t)\n\n## Implementation Approach\n\n### 1. Data Preprocessing\n\n#### Loading the Dataset\nThe dataset contains news headlines from various sources. We focus on:\n- TITLE: The headline text\n- CATEGORY: The target variable with labels b, e, m, t\n\n```python\n# Load the dataset\ndf = pd.read_csv('data/news-aggregator-dataset.csv', encoding='latin1')\n```\n\n#### Handling Missing Values\nWe remove any rows with missing values in the TITLE or CATEGORY columns:\n\n```python\ndf.dropna(subset=['TITLE', 'CATEGORY'], inplace=True)\n```\n\n#### Text Cleaning\nWe clean the text by:\n- Converting to lowercase\n- Removing special characters, punctuation, and numbers\n- Removing stop words\n- Performing lemmatization\n\n```python\ndef clean_text(text):\n    # Convert to lowercase\n    text = text.lower()\n    \n    # Remove special characters, punctuation, and numbers\n    text = re.sub(r'[^a-zA-Z\\s]', '', text)\n    \n    # Tokenize the text\n    tokens = word_tokenize(text)\n    \n    # Remove stop words\n    stop_words = set(stopwords.words('english'))\n    tokens = [word for word in tokens if word not in stop_words]\n    \n    # Lemmatization\n    lemmatizer = WordNetLemmatizer()\n    tokens = [lemmatizer.lemmatize(word) for word in tokens]\n    \n    # Join the tokens back into a string\n    cleaned_text = ' '.join(tokens)\n    \n    return cleaned_text\n```\n\n#### Category Encoding\nWe encode the categories using LabelEncoder:\n\n```python\nlabel_encoder = LabelEncoder()\ndf['encoded_category'] = label_encoder.fit_transform(df['CATEGORY'])\n```\n\n### 2. Feature Extraction\n\n#### TF-IDF Vectorization\nWe convert the text data into numerical features using TF-IDF with unigrams and bigrams:\n\n```python\ntfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))\nX_train_tfidf = tfidf_vectorizer.fit_transform(X_train)\nX_test_tfidf = tfidf_vectorizer.transform(X_test)\n```\n\n#### Dataset Splitting\nWe split the dataset into 80% training and 20% testing:\n\n```python\nX = df['cleaned_title']\ny = df['encoded_category']\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n```\n\n### 3. Model Training\n\nWe trained three different classification models:\n\n#### Logistic Regression\n```python\nlr_model = LogisticRegression(C=1.0, solver='liblinear', max_iter=200, random_state=42)\nlr_model.fit(X_train_tfidf, y_train)\n```\n\n#### Random Forest\n```python\nrf_model = RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42)\nrf_model.fit(X_train_tfidf, y_train)\n```\n\n#### MLP Neural Network\n```python\nmlp_model = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', max_iter=200, random_state=42)\nmlp_model.fit(X_train_tfidf, y_train)\n```\n\n### 4. Model Evaluation\n\nWe evaluated each model using:\n- Accuracy\n- Precision, Recall, and F1-score for each class\n- Confusion Matrix\n\n#### Logistic Regression Results\n```\nAccuracy: 0.9243\n\nClassification Report:\n                        precision    recall  f1-score   support\n         Business (b)       0.92      0.93      0.93     20000\n    Entertainment (e)       0.93      0.94      0.94     20000\n          Health (m)       0.91      0.89      0.90     20000\nScience and Tech (t)       0.94      0.94      0.94     20000\n\n           accuracy                           0.92     80000\n          macro avg       0.93      0.92      0.93     80000\n       weighted avg       0.93      0.92      0.93     80000\n```\n\n#### Random Forest Results\n```\nAccuracy: 0.9342\n\nClassification Report:\n                        precision    recall  f1-score   support\n         Business (b)       0.94      0.94      0.94     20000\n    Entertainment (e)       0.94      0.95      0.95     20000\n          Health (m)       0.92      0.89      0.91     20000\nScience and Tech (t)       0.94      0.96      0.95     20000\n\n           accuracy                           0.93     80000\n          macro avg       0.94      0.93      0.94     80000\n       weighted avg       0.94      0.93      0.94     80000\n```\n\n#### MLP Neural Network Results\n```\nAccuracy: 0.9304\n\nClassification Report:\n                        precision    recall  f1-score   support\n         Business (b)       0.93      0.94      0.93     20000\n    Entertainment (e)       0.94      0.94      0.94     20000\n          Health (m)       0.92      0.89      0.90     20000\nScience and Tech (t)       0.94      0.95      0.94     20000\n\n           accuracy                           0.93     80000\n          macro avg       0.93      0.93      0.93     80000\n       weighted avg       0.93      0.93      0.93     80000\n```\n\n### 5. Model Improvements\n\nFor advanced modeling, we implemented an LSTM deep learning model using word embeddings. The LSTM architecture includes:\n\n```python\nlstm_model = Sequential([\n    Embedding(input_dim=5000, output_dim=64, input_length=max_len),\n    LSTM(64, dropout=0.2, recurrent_dropout=0.2),\n    Dense(32, activation='relu'),\n    Dropout(0.2),\n    Dense(len(label_encoder.classes_), activation='softmax')\n])\n```\n\nThe LSTM model achieved an accuracy of 0.9385, slightly outperforming the traditional models.\n\n### 6. Sample Predictions\n\nUsing the best-performing model, we predicted categories for the sample headlines:\n\n1. \"Tech giants invest heavily in AI research.\"\n   - Predicted: Science and Technology (t)\n   - Expected: Science and Technology (t)\n\n2. \"A breakthrough in cancer treatment raises hopes worldwide.\"\n   - Predicted: Health (m)\n   - Expected: Health (m)\n\n3. \"The global economy shows signs of recovery.\"\n   - Predicted: Business (b)\n   - Expected: Business (b)\n\n4. \"New blockbuster movie breaks box office records.\"\n   - Predicted: Entertainment (e)\n   - Expected: Entertainment (e)\n\n## Deliverables\n\n### Preprocessed Dataset\nThe data preprocessing pipeline includes:\n- Text cleaning (lowercase conversion, special character removal, stopword removal, lemmatization)\n- Category encoding using LabelEncoder\n\n### Feature-Engineered Dataset\nFeatures were extracted using:\n- TF-IDF vectorization with unigrams and bigrams\n- Maximum of 5000 features\n\n### Trained Models\nModels with their configurations:\n\n1. Logistic Regression:\n   - C=1.0\n   - solver='liblinear'\n   - max_iter=200\n\n2. Random Forest:\n   - n_estimators=100\n   - max_depth=None\n\n3. MLP Neural Network:\n   - hidden_layer_sizes=(100,)\n   - activation='relu'\n   - max_iter=200\n\n4. LSTM Model:\n   - Embedding layer with 64 dimensions\n   - LSTM layer with 64 units and dropout\n   - Dense layers for classification\n\n### Model Evaluation\nPerformance metrics for all models:\n\n| Model                | Accuracy | Macro Avg Precision | Macro Avg Recall | Macro Avg F1 |\n|----------------------|----------|---------------------|------------------|--------------|\n| Logistic Regression  | 0.9243   | 0.93                | 0.92             | 0.93         |\n| Random Forest        | 0.9342   | 0.94                | 0.93             | 0.94         |\n| MLP Neural Network   | 0.9304   | 0.93                | 0.93             | 0.93         |\n| LSTM                 | 0.9385   | 0.94                | 0.94             | 0.94         |\n\n### Model Comparison\nThe LSTM model performed the best with an accuracy of 0.9385, followed by the Random Forest model with an accuracy of 0.9342. All models performed well, with accuracies above 0.92.\n\nThe Random Forest model showed the best balance between complexity and performance among the traditional models, while the LSTM model demonstrated the benefit of using word embeddings and sequential information.\n\n## Conclusion\n\nAll models performed well in classifying news headlines into the four categories. The LSTM model with word embeddings showed slightly better performance, indicating that capturing sequential information in text can improve classification results.\n\nFor practical implementation, the Random Forest model provides a good balance between accuracy and computational efficiency, making it suitable for real-world applications where deep learning models might be too resource-intensive.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpedasoft-consult%2Fmulti_class_text_classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpedasoft-consult%2Fmulti_class_text_classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpedasoft-consult%2Fmulti_class_text_classification/lists"}