https://github.com/abdelrahman-amen/active_learning_using_imbalanced_dataset
This project explores active learning techniques, focusing on query strategies to optimize informative data selection for model training. It aims to reduce labeled data while improving model performance, especially with imbalanced datasets where certain classes are underrepresented.
https://github.com/abdelrahman-amen/active_learning_using_imbalanced_dataset
activelearning bmi entropy imbalanced-data margin python querybycommittee smote uncertainty
Last synced: 27 days ago
JSON representation
This project explores active learning techniques, focusing on query strategies to optimize informative data selection for model training. It aims to reduce labeled data while improving model performance, especially with imbalanced datasets where certain classes are underrepresented.
- Host: GitHub
- URL: https://github.com/abdelrahman-amen/active_learning_using_imbalanced_dataset
- Owner: Abdelrahman-Amen
- Created: 2025-01-23T04:58:20.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-01-23T05:13:06.000Z (3 months ago)
- Last Synced: 2025-02-11T12:42:57.466Z (3 months ago)
- Topics: activelearning, bmi, entropy, imbalanced-data, margin, python, querybycommittee, smote, uncertainty
- Language: Jupyter Notebook
- Homepage:
- Size: 809 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Active_Learning using Imbalanced Dataset 🏛
# Introduction 🔍
Active learning is a machine learning approach where the model selects the most informative, unlabeled data points to be labeled by an oracle (usually a human). This method is valuable when labeled data is scarce or expensive, allowing the model to perform well with fewer labeled examples.
# Benefits of Active Learning 🎯
• Less Labeled Data: Active learning can significantly reduce the number of labeled data points required to train an effective model, saving time and cost.
• Better Performance: By focusing on the most informative examples, active learning often leads to better performance than using random sampling.
• Reduces Overfitting: The model can generalize better, as it learns from more diverse, informative examples rather than just random ones.
# Applications 🧠
• Natural Language Processing (NLP): In tasks like text classification, sentiment analysis, and named entity recognition, active learning can help select the most relevant documents to label.
• Computer Vision: Active learning is used for object detection, image classification, and semantic segmentation by selecting images that maximize model uncertainty.
• Speech Recognition: Active learning can be used to select the most uncertain or informative speech samples, improving recognition models with fewer annotated audio clips.
# Objectives of Active Learning 🎯
• Reduce Data Labeling Costs: By minimizing the number of labeled samples needed, active learning cuts down on the overall cost and time required for annotation.
• Improve Model Efficiency: It helps the model learn more from fewer data points, making training more efficient.
• Improve Accuracy: Active learning helps increase the performance of models in specific domains with limited data.
• Enable Learning from Small Datasets: Particularly useful in situations where gathering labeled data is challenging or expensive, like medical imaging or rare event prediction.
# Active Learning Scenarios 📊
• Pool-Based Active Learning: A large pool of unlabeled data is available, and the model selects the most informative samples from it for labeling.
• Membership Query Synthesis: The model generates queries about data points it would like to label, creating new data samples for annotation.
• Stream-Based Sampling: In real-time systems, data arrives continuously, and the model selects data points from the stream that are most useful for learning.
# Strategies for Active Learning 💡
• Low Confidence Sampling: The model queries labels for samples where it has the least confidence in its prediction. These are the data points that the model is most uncertain about.
• Margin-Based Sampling: Active learning chooses the samples that lie closest to the decision boundary. These points are often the hardest for the model to classify correctly, making them valuable for improving the model.
• Entropy-Based Sampling: The model selects the data points with the highest uncertainty or entropy, where the model's prediction probabilities are spread out the most.
# Imbalanced Dataset
Enhancement Approach Performed for Unbalanced Dataset:
Applying the over-sampling technique has improved the performance of the model by a
considerable percentage.
# Conclusion 📝
Active learning is a powerful machine learning technique that can be used to improve the
performance of models with less training data. Active learning is particularly useful in
applications where labeled data is expensive or difficult to obtain.# Want More Details? 📄
If you're interested in a deeper dive, check out my full report here!
[Click here to view the report](https://github.com/Abdelrahman-Amen/Active_Learning_using_Imbalanced_Dataset/blob/main/Technical%20report.pdf)