https://github.com/syed007hassan/mlpipeline-randomforest
Training a Random Forest Model for GitHub Issue Classification
https://github.com/syed007hassan/mlpipeline-randomforest
nltk rainforest tf-idf
Last synced: 2 months ago
JSON representation
Training a Random Forest Model for GitHub Issue Classification
- Host: GitHub
- URL: https://github.com/syed007hassan/mlpipeline-randomforest
- Owner: Syed007Hassan
- Created: 2025-02-02T23:02:12.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-02T23:14:44.000Z (4 months ago)
- Last Synced: 2025-02-03T00:20:07.098Z (4 months ago)
- Topics: nltk, rainforest, tf-idf
- Language: Python
- Homepage:
- Size: 23.9 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Random Forest Model Pipeline
## 1. Model Pipeline Components
The pipeline consists of three main components:
### a) Text Preprocessor
- Custom transformer class compatible with sklearn
- Implements the preprocessing steps from Exercise 02:
1. Noise removal (URLs, code blocks, special characters)
2. Text normalization (lowercase, whitespace)
3. Tokenization
4. Stop-word removal
5. Lemmatization
- Maintains preprocessing consistency between training and prediction### b) TF-IDF Vectorizer
- Converts preprocessed text into numerical features
- Parameters:
- max_features=5000 (limits vocabulary size)
- Benefits:
- Captures word importance in documents
- Handles varying document lengths
- Reduces impact of common words### c) Random Forest Classifier
- Final classification model
- Parameters:
- n_estimators=100 (number of trees)
- max_depth=None (allows full tree growth)
- min_samples_split=2 (minimum samples for split)
- random_state=42 (reproducibility)
- Benefits:
- Handles high-dimensional data well
- Resistant to overfitting
- Provides feature importance## 2. Model Modularization for Integration
### Modularization Approach:
1. **Separate Preprocessing Class**
- TextPreprocessor class is self-contained
- Can be imported and used independently
- Maintains consistent preprocessing across applications2. **Serialized Model Pipeline**
- Complete pipeline saved using joblib
- Includes preprocessor, vectorizer, and classifier
- Can be loaded and used in Flask application3. **Verification System**
- Preprocessing examples stored in JSON
- Allows verification of preprocessing consistency
- Useful for testing and debugging4. **Future Integration**
- Flask app can import TextPreprocessor class
- Load saved pipeline using joblib
- Use for real-time predictions## 3. Random Forest and Concept Drift
### Drawbacks of Random Forest
1. **Static Model Structure**
- Trees are fixed after training
- Cannot adapt to new patterns without retraining
- May become outdated as issue patterns change2. **Memory Intensive**
- Stores many decision trees
- Difficult to update incrementally
- Requires full retraining for updates3. **Feature Space Limitations**
- Fixed vocabulary from training data
- Cannot handle new terms or patterns
- May miss emerging topics### Alternative Model: Online Learning with SGDClassifier
#### Benefits for Concept Drift:
1. **Incremental Learning**
- Can update with new data points
- Adapts to changing patterns
- Supports partial_fit method2. **Memory Efficient**
- Doesn't store training data
- Lighter memory footprint
- Easier to deploy and update3. **Adaptive Learning Rate**
- Adjusts to data changes
- Balances old and new knowledge
- Better handles concept drift4. **Implementation Strategy**
- Regular model updates with new data
- Monitoring of prediction confidence
- Sliding window for recent patternsAfter the execution of app.py:
```markdown
### 1. Preprocessing Verification
First, the program verified the preprocessing pipeline using the examples saved from exercise02:```python
Example 1: GitHub issue about Entities and fields
- Original: Technical issue about '__tileSrcRect' fields
- Processed: Cleaned, tokenized, and lemmatized version without URLs and special charactersExample 2: Bug report about blog link
- Original: Markdown-formatted bug report about updating website links
- Processed: Clean text with key terms preserved but formatting removedExample 3: Technical discussion about expressions
- Original: Code example with markdown formatting
- Processed: Plain text with code-related terms preserved
```### 2. Model Training and Evaluation
The program then trained and evaluated the Random Forest model:```
Model Evaluation Results:
- Bug issues:
- Precision: 0.76 (76% of predicted bugs were actual bugs)
- Recall: 0.79 (79% of actual bugs were correctly identified)
- F1-score: 0.77 (harmonic mean of precision and recall)- Enhancement issues:
- Precision: 0.70
- Recall: 0.80
- F1-score: 0.75- Question issues:
- Precision: 0.61
- Recall: 0.09 (very low - model struggles with questions)
- F1-score: 0.16 (poor performance on questions)Overall:
- Accuracy: 0.73 (73% of all predictions were correct)
- The model performs well on bugs and enhancements
- Struggles with questions (likely due to class imbalance)
```### 3. Model Serialization
```python
model_pipeline.joblib
```
This file contains:
- The complete trained pipeline including:
1. TextPreprocessor
2. TF-IDF Vectorizer
3. Random Forest Classifier
- Can be loaded later using:
```python
loaded_model = joblib.load('model_pipeline.joblib')
prediction = loaded_model.predict(['new issue text'])
```### Key Observations:
1. Preprocessing works consistently across different types of issues
2. Model performs well on majority classes (bugs and enhancements)
3. Poor performance on minority class (questions) suggests need for:
- Class balancing techniques
- More training data for questions
- Possibly different model architecture for better minority class handlingThe saved model can now be used in a Flask application for real-time predictions on new GitHub issues.