{"id":24399665,"url":"https://github.com/audrbar/ml-sports","last_synced_at":"2026-05-24T17:04:47.874Z","repository":{"id":272334862,"uuid":"916245910","full_name":"audrbar/ml-sports","owner":"audrbar","description":"A Supervised Machine Learning Project for Text Classification","archived":false,"fork":false,"pushed_at":"2025-02-15T15:11:19.000Z","size":59882,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-15T15:34:10.467Z","etag":null,"topics":["classification-algorithm","mashine-learning","neural-network","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/audrbar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-13T18:19:07.000Z","updated_at":"2025-02-15T15:11:23.000Z","dependencies_parsed_at":"2025-01-13T19:34:46.490Z","dependency_job_id":"fea0c3e6-cf80-4fc3-8b3f-239ef0e6e5a6","html_url":"https://github.com/audrbar/ml-sports","commit_stats":null,"previous_names":["audrbar/ml-sports"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/audrbar%2Fml-sports","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/audrbar%2Fml-sports/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/audrbar%2Fml-sports/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/audrbar%2Fml-sports/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/audrbar","download_url":"https://codeload.github.com/audrbar/ml-sports/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243353816,"owners_count":20277284,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification-algorithm","mashine-learning","neural-network","python"],"created_at":"2025-01-19T23:50:51.623Z","updated_at":"2025-12-25T17:58:16.719Z","avatar_url":"https://github.com/audrbar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## A Supervised Machine Learning Project for Text Classification \n### Introduction\nThis project aim is to develop a supervised machine learning pipeline capable of classifying texts into \npredefined categories. The workflow includes data scraping, text preprocessing, vectorization and the implementation \nof multiple machine learning algorithms. The final outcome is a fully functional pipeline that preprocesses data, \ntrains at least three different models, evaluates their performance and determines the best-performing approach \nfor text classification.\n### Utilities\nUtility functions are used throughout the project to adjust pandas display settings, write data to CSV files, \nbinary data-to pickle files, load the data from these files, split the datasets into training and test sets, \nalso into training, validation and test sets, find and save unique values.\n### Data Analyzed\nA sports articles dataset is used for the project setting, sourced from Sports Illustrated, each labeled with \ntheir respective categories (e.g., basketball, soccer, football). Articles are scraped using the Python libraries \n`requests` and `BeautifulSoup`, extracting relevant text (title, content) and metadata (category). The structured \ndataset is saved in CSV format, containing:\n- `text`: the article content;\n- `category`: the predefined category assigned to the article.\n\nData sample file is accessible in folder `data_sample`.\n\u003e :memo: **Info:** 2740 articles where scraped and saved for processing.\n\nThe main dataset / project features, witch are important for the classifiers picking, are:\\\n`General text classification` `Predefined labels` `Small dataset` `Multi-class classification` \n`Structured text features` `Low compute`\n### Data Preprocessing\nThe scraped data are prepared for models training in four main steps:\n- a missing values are handled, duplicates removed, category column extracted, categories distribution balanced \nwith `Pandas` library;\n- the text is cleaned with `Natural Language Toolkit (NLTK)` (removes HTML tags, punctuation, stopwords, etc.);\n- the text is normalized with `NLTK's WordNetLemmatizer` (converts to lowercase, lemmatize (converts words to their \nbase form) and `NLTK's PorterStemmer` (stemming reduces words to their root form);\n- the text is tokenized and vectorized using **3** methods: `TF-IDF`, `Word2Vector`, `FastText` (embeddings), also \n`NGRAM (2, 3)` data preparation applied.\n\u003e :memo: **Info:** 2370 records remaining after handling missing values, removed duplicates.\\\n\n![Data Plot](./img/class_distribution.png)\n\u003e :memo: **Info:** 1300 records remaining after filtering, sampling and some categories dropped.\n\n## Picking Classifiers for Text Classification\nText classification is a fundamental Natural Language Processing (NLP) task that involves assigning predefined \nlabels to textual data. Below is a breakdown of different classifiers used for text classification, categorized \nby type and use case. This includes (1) Traditional Machine Learning Classifiers, (2) Deep Learning-Based Classifiers \n(NN's), (3) Pre-Trained Transformer Models.\n### 1️⃣ Traditional Machine Learning Classifiers\nTraditional ML-based classifiers require **feature engineering** (e.g., TF-IDF, word embeddings) before classification.\n\n| Classifier                            | Best For                                | Pros                                   | Cons                                    | Model Provider        |\n|---------------------------------------|-----------------------------------------|----------------------------------------|-----------------------------------------|-----------------------|\n| Logistic Regression                   | Binary \u0026 multi-class classification     | Simple, efficient, interpretable       | Limited to linear decision boundaries   | `scikit-learn`        |\n| Support Vector Machine (SVM)          | Spam detection, sentiment analysis      | Works well for small datasets          | Computationally expensive on large data | `scikit-learn`        |\n| Naive Bayes (NB)                      | Email filtering, topic categorization   | Fast, handles small datasets well      | Assumes feature independence            | `scikit-learn`        |\n| Random Forest                         | General text classification             | Handles high-dimensional data well     | Slower for large datasets               | `scikit-learn`        |\n| Gradient Boosting (XGBoost, LightGBM) | Large-scale classification              | High accuracy, handles imbalanced data | Requires careful tuning                 | `XGBoost`, `LightGBM` |\n| k-Nearest Neighbors (k-NN)            | Small datasets, language classification | Simple, non-parametric                 | Slow for large datasets                 | `scikit-learn`        |\n\n**Best For:** Small-to-medium datasets with structured text features (TF-IDF, word embeddings).  \n**Libraries:** `scikit-learn`, `XGBoost`, `LightGBM`  \n\n### 2️⃣ Deep Learning-Based Classifiers (Neural Networks)\nDeep learning models **learn text representations automatically**, requiring **less feature engineering**.\n\n| Classifier                           | Best For                                      | Pros                               | Cons                                 | Model Provider                   |\n|--------------------------------------|-----------------------------------------------|------------------------------------|--------------------------------------|----------------------------------|\n| Multilayer Perceptron (MLP)          | General text classification                   | Works well with dense embeddings   | Requires feature engineering         | `TensorFlow`, `Keras`, `PyTorch` |\n| Convolutional Neural Networks (CNNs) | Short text classification, sentiment analysis | Captures local patterns in text    | Less effective for long documents    | `TensorFlow`, `Keras`, `PyTorch` |\n| Recurrent Neural Networks (RNNs)     | Sequential text classification                | Handles sequential dependencies    | Slower training, vanishing gradients | `TensorFlow`, `Keras`, `PyTorch` |\n| LSTMs (Long Short-Term Memory)       | Long text classification, sentiment analysis  | Preserves long-range dependencies  | Computationally expensive            | `TensorFlow`, `Keras`, `PyTorch` |\n| GRUs (Gated Recurrent Units)         | Faster alternative to LSTMs                   | Memory efficient                   | Still slower than CNNs               | `TensorFlow`, `Keras`, `PyTorch` |\n| Transformers (BERT, RoBERTa, T5)     | Large-scale classification, contextual text   | Best for complex NLP tasks         | Requires GPUs, expensive training    | `Hugging Face Transformers`      |\n\n**Best For:** **Large-scale text classification** with deep contextual understanding.  \n**Libraries:** `TensorFlow`, `PyTorch`, `Keras`, `transformers`  \n\n### 3️⃣ Pre-Trained Transformer Models (State-of-the-Art)\nPre-trained **transformer models** have revolutionized NLP, offering state-of-the-art accuracy for text classification.\n\n| Model                       | Provider     | Best For                                             | Pros                                                    | Cons                      |\n|-----------------------------|--------------|------------------------------------------------------|---------------------------------------------------------|---------------------------|\n| BERT                        | Google AI    | Sentiment analysis, topic classification             | Strong contextual understanding, bidirectional learning | Slow inference            |\n| DistilBERT                  | Hugging Face | Fast classification                                  | Lighter than BERT, optimized for speed                  | Slight accuracy trade-off |\n| RoBERTa                     | Meta AI      | Text classification, fake news detection             | More robust than BERT                                   | Requires fine-tuning      |\n| GPT-4                       | OpenAI       | Zero-shot classification                             | No training required, API-based                         | Requires API access       |\n| Text-to-Text Transformer T5 | Google AI    | Multi-task learning (classification + summarization) | Flexible for various NLP tasks                          | Large model size          |\n| XLNet**                     | Google AI    | Long-form classification                             | Handles dependencies better than BERT                   | Computationally expensive |\n| Longformer                  | Hugging Face | Classification of long articles                      | Optimized for processing long documents                 | Requires large datasets   |\n\n**Best For:** **Large datasets \u0026 complex classification tasks**  \n**Libraries:** `transformers`, `PyTorch`, `TensorFlow`\n\n###  🏆 Selected Classifiers\nA particular classification algorithm outperforms others on particular dataset depending on dataset's structure, shape, \ndensity and noise. Selected Classifiers to evaluate in the project:\n\n| Traditional Classifiers    | Deep Learning-Based Classifiers (NN's)    | Pre-Trained Transformer |\n|----------------------------|-------------------------------------------|-------------------------|\n| Logistic Regression        | Simple Neural Network (SNN)               | DistilBERT              |\n| Random Forest              | Sequential Recurrent Neural Network (RNN) |                         |\n| Decision Tree              |                                           |                         |\n| k-Nearest Neighbors (k-NN) |                                           |                         |\n\n## Traditional Classifiers Comparison\n![Data Plot](./img/traditional_img.png)\n#### Traditional Classifiers Performance evaluation metrics:\n![Data Plot](./img/traditional_table.png)\n#### KNeighbors Classifier fine-tuning GridSearchCV params:\n- `metric` euclidean, manhattan, minkowski;\n- `n_neighbors` 3, 5, 7, 10;\n- `weights` uniform, distance.\n#### KNeighbors Classifier Best Params Across Different Vectorization Methods:\n| Dataset  |     Metric | n_neighbors |     weights | Mean Cross-Validation Accuracy |\n|----------|-----------:|------------:|------------:|-------------------------------:|\n| TF-IDF   |  euclidean |           5 |    distance |                         0.9488 |\n| Word2Vec |  euclidean |          10 |    distance |                         0.8164 |\n| N-Gram   |  euclidean |           5 |    distance |                         0.9344 |\n| FastText |  euclidean |           3 |    distance |                         0.8947 |\n#### KNeighbors Classifier Performance Comparison Across Different Vectorization Methods:\n![Data Plot](./img/knn_performance.png)\n![Data Plot](./img/knn_table.png)\n#### KNeighbors Classifier Confusion Matrix on TF-IDF Vectorization:\n![Data Plot](./img/knn_matrix.png)\n### Logistic Regression Classifier fine-tuning GridSearchCV params:\n- `threshold` 0.01, 0.1, 1, 10;\n- `solver` lbfgs, liblinear, saga;\n- `max_iter` 100, 200, 300, 600.\n#### Logistic Regression Classifier Best Params Across Different Vectorization Methods:\n| Dataset  | Threshold | Max_Iter |    Solver | Accuracy |\n|----------|----------:|---------:|----------:|---------:|\n| TF-IDF   |        10 |      100 |     lbfgs |   0.9768 |\n| Word2Vec |        50 |      200 | liblinear |   0.9189 |\n| N-Gram   |        10 |      100 |     lbfgs |   0.9730 |\n| FastText |        50 |      100 | liblinear |   0.9151 |\n#### Logistic Regression Classifier Best Params Across Different Vectorization Methods:\n![Data Plot](./img/lr_performance.png)\n![Data Plot](./img/lr_table.png)\n#### Logistic Regression Classifier Confusion Matrix on TF-IDF Vectorization:\n![Data Plot](./img/lr_matrix.png)\n## Neural Network Models\n### Simple Neural Network (SNN)-Tensorflow Keras Sequential Model\nIt is a simple (basic) fead-forward neural network model. Input layer accepts vectorized text. Hidden layers fully \nconnected with activation functions (e.g., ReLU). Output layer softmax for multi-class classification.\n![Data Plot](./img/snn_model.png)\n#### Data Shape used for SNN tuning:\n![Data Plot](./img/snn_data_shape.png)\n#### Hyperparameters used for tuning:\n- `neurons_list` [512, 256], [1024, 512], [256, 128];\n- `dropout_rates_list` [0.3, 0.3], [0.5, 0.5], [0.2, 0.2];\n- `batch_sizes` 32, 64;\n- `epochs` 100.\n![Data Plot](./img/snn_tuning_results.png)\n![Data Plot](./img/snn_best_params.png)\n![Data Plot](./img/snn_test_accuracy.png)\n### Recurrent Neural Network (RNN)-Long Short-Term Memory (LSTM)\nRNN variants are LSTM or GRU designed for handling sequential text data. \n![Data Plot](./img/rnn_model.png)\n#### FastText dataset results\n![Data Plot](./img/rnn_best_model.png)\n### Transformer-based models use BERT, RoBERTa, or similar pre-trained models.\nDistil Bert Model from pretrained distilbert-base-uncased model Leverages advanced deep learning techniques \nfor better performance.\n![Data Plot](./img/bert_dataset.png)\n![Data Plot](./img/bert_results.png)\n## 🏆 Models Winners in Classifying Articles\n### Models Winners in Classifying Articles are _Logistic Regression_ and _Simple Neural Network (SNN)_.\nWhile deep learning models like DistilBERT offer contextual understanding, traditional ML models (Logistic Regression) \nwith TF-IDF delivered similar accuracy with significantly lower computational cost. This suggests that for structured \ntext datasets, feature engineering remains a powerful tool and deep learning models should be carefully fine-tuned \nto justify their resource requirements.\n\n| Classifier                              |          Vectorizer | Model Size | Accuracy |\n|-----------------------------------------|--------------------:|-----------:|---------:|\n| Logistic Regression                     |              TF-IDF |       4 KB |   0.9768 |\n| k-Nearest Neighbors                     |              TF-IDF |     423 KB |   0.9488 |\n| Tensorflow Keras Sequential Model (SNN) |              TF-IDF |    32,4 MB |   0.9768 |\n| Distil Bert Base Uncased Model (RNN)    | DistilBertTokenizer |   267,6 MB |   0.8745 |\n\nFind Saved Trained Models Winners and label encoders in `/models` catalog for further training or usage.\n## Conclusions\nThis project aimed to classify text data using various machine learning and deep learning models, leveraging different \nvectorization techniques. We evaluated models ranging from traditional ML classifiers (Logistic Regression, k-NN) \nto deep learning architectures (SNN and RNN-based DistilBERT). The key finding was that Logistic Regression with \nTF-IDF and SNN with TF-IDF performed best, both achieving 97.68% accuracy, demonstrating that traditional models \ncan be competitive with deep learning when using appropriate feature engineering. However, DistilBERT (RNN) showed \nlower accuracy (87.45%), suggesting that fine-tuning on this dataset might improve performance.\n## Usage\n1. Clone the Repository\n    ```\n    git clone https://github.com/audrbar/ml-sports.git\n    cd ml-sports\n    ```\n2. Create and Activate a Virtual Environment\n    ```\n    python3 -m venv venv\n    source venv/bin/activate\n    ```\n    On Windows use\n    ```\n    venv\\Scripts\\activate\n    ```\n3. Install Dependencies\n    ```\n    pip install -r requirements.txt\n    ```\n4. Run the scripts:\n    ```\n    main\n    ```\n5. Load models from `/models` catalog\n   ```\n   model = joblib.load(model_path)\n   ```\n6. Load encoders from `/models` catalog\n   ```\n   label_encoder = pickle.load(label_encoder_path)\n   ```\n### Future Scope\n1. **Fine-Tuning Deep Learning Models:** The performance of DistilBERT (RNN) was lower than expected. Future work \ncould involve fine-tuning on domain-specific text data and experimenting with larger transformer models like BERT, \nRoBERTa, or T5 for improved contextual understanding.\n2. **Hybrid Model Approaches:** A possible improvement could be combining traditional ML models with deep learning \nembeddings, such as using DistilBERT embeddings as input features for Logistic Regression or k-NN. This could \nprovide a balance between accuracy and computational efficiency.\n3. **Feature Engineering Enhancements:** Further improvements could be explored by experimenting with different \nvectorization techniques such as fastText, GloVe, or Word2Vec embeddings to see if they provide better representations \nfor classification.\n4. **Real-Time Classification Pipeline:** Deploying the model in a real-time system with streaming data processing \nusing frameworks like Apache Kafka or FastAPI would enhance its practical applicability. Optimizing for low-latency \npredictions will be crucial for scalability.\n5. **Multi-Label Classification:** The current approach assumes a single category per text entry. A future extension \ncould involve multi-label classification using techniques like sigmoid activation functions or attention mechanisms \nto handle overlapping categories.\n6. **Vector Storage in Databases:** Storing text embeddings (vectors) in vector databases like FAISS, Pinecone, Weaviate, \nor Milvus will allow for efficient similarity search and retrieval. This would be useful for semantic search, nearest \nneighbor classification, and recommendation systems. A structured database like PostgreSQL with the pgvector extension \ncan also be explored for efficient indexing and querying of embeddings.\n7. **Bias and Fairness Analysis:** Analyzing model predictions for bias and fairness across different text categories \ncan help ensure the system provides balanced and unbiased predictions, particularly if deployed in sensitive \napplications.\nBy addressing these areas, the project can be expanded for better accuracy, efficiency, and real-world deployment, \nensuring adaptability across various text classification tasks.\n### Resources\n[Performance comparison of multi class classification algorithms](https://gursev-pirge.medium.com/performance-comparison-of-multi-class-classification-algorithms-606e8ba4e0ee)\\\n[Multiclass Classification](https://builtin.com/machine-learning/multiclass-classification)\\\n[Towards Data Science](https://towardsdatascience.com/)\\\n[KDnuggets](https://www.kdnuggets.com/)\\\n[Analytics Vidhya](https://www.analyticsvidhya.com/)\\\n[Data Science Central](https://www.datasciencecentral.com/)\\\n[Medium](https://medium.com/)\\\n[The batch](https://read.deeplearning.ai/the-batch/)\\\n[Data Mahadev](https://datamahadev.com/category/analytics/)\\\n[Paper with code](https://paperswithcode.com/)\\\n[Random Forest Algorithm](https://builtin.com/data-science/random-forest-algorithm)\\\n[Machine Learning in Science](https://mindthegraph.com/blog/lt/machine-learning-in-science/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faudrbar%2Fml-sports","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faudrbar%2Fml-sports","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faudrbar%2Fml-sports/lists"}