Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ahmedkhaled404/ner-with-spacy
Named entity recognition using traditional NLP methods
https://github.com/ahmedkhaled404/ner-with-spacy
machine-learning matplotlib ner nlp nlp-machine-learning python spacy
Last synced: 6 days ago
JSON representation
Named entity recognition using traditional NLP methods
- Host: GitHub
- URL: https://github.com/ahmedkhaled404/ner-with-spacy
- Owner: ahmedkhaled404
- Created: 2024-07-21T15:54:42.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-07-21T16:01:29.000Z (4 months ago)
- Last Synced: 2024-10-31T04:06:09.856Z (6 days ago)
- Topics: machine-learning, matplotlib, ner, nlp, nlp-machine-learning, python, spacy
- Language: Jupyter Notebook
- Homepage:
- Size: 3.96 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Named Entity Recognition with spaCy
**1. Introduction**
**Problem Statement:** Named Entity Recognition (NER) is a subtask of information extraction that seeks to identify and classify named entities in text into predefined categories such as persons, organizations, and locations. This project aims to develop a custom NER pipeline using spaCy, a popular NLP library. The objective is to build a model that can accurately identify and classify entities in text data, optimize the pipeline for efficiency.
**Objectives:**
- Develop a pipeline for NER using spaCy.
- Implement and classify named entitie.
- Optimize the pipeline for efficient entity recognition.**2.Data Description**
The dataset for this project is sourced from Kaggle and can be accessed [here](https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus/). It consists of:
- **Sentence**: The complete sentence in text format.
- **POS**: List of Part-of-Speech (POS) tags for each word in the sentence.
- **Tag**: List of Named Entity Recognition (NER) tags for each word in the sentence.**Example Data**
- **Sentence**: "Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country."
- **POS**: ['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP', 'TO', 'VB', 'DT', 'NN', 'IN', 'NNP', 'CC', 'VB', 'DT', 'NN', 'IN', 'JJ', 'NNS', 'IN', 'DT', 'NN', '.']
- **Tag**: ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']**3. Baseline Experiments**
**Goal:** Establish a baseline performance for the NER model using spaCy's pre-trained models and evaluate its effectiveness on the custom dataset.
**Methodology:**
- Data Preprocessing
- Loaded the pre-trained English model en_core_web_sm from spaCy.
- Created and trained an NER component using the custom dataset.
- Custom NER Model Training, Evaluation, and Visualization of results.**Results & Conclusion:**
- The baseline model performance was evaluated, and initial results indicated the effectiveness of spaCy's pre-trained models for the custom dataset, despite that, the model performance might be biased due to data imbalance.
- Model needs more generalization.
- These problems were not encountered due to the time limitations.**4. Methodology breakdown**
1. **Data Import and Inspection:**
- Loaded data from CSV file.
- Checked for null values and data types.
2. **Data Exploration (EDA):**
- Verified the shape and distribution of data.
- Checked for duplicates and removed them.
- Checked distribution of labels.![image](https://github.com/user-attachments/assets/8d957b50-ff81-49c6-b270-612e785723d3)
(fig shows the distribution of labels on the data)
- Label Distribution Analysis across datasets
To analyze the distribution of labels across training, validation, and test datasets.
![image](https://github.com/user-attachments/assets/f43690a3-d414-405f-b40f-65d9ce7d6f88)
1. **Data Preprocessing:**
- Converted sentences and annotations into spaCy's format
- split the data into training, validation, and test sets.
2. **Model Training:**
- Fine-tuned the NER model using spaCy’s training pipeline.
- Trained for 20 Iterations (epochs)
3. **Model Evaluation:**
- Evaluated the model on validation and test sets using classification metrics.
4. **Results:**
- **Validation Classification Report**| Label | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| B-eve | 1.00 | 0.00 | 0.00 | 1 |
| B-org | 1.00 | 0.00 | 0.00 | 3 |
| B-per | 1.00 | 0.00 | 0.00 | 1 |
| B-tim | 1.00 | 0.50 | 0.67 | 8 |
| I-art | 0.00 | 0.00 | 1.00 | 2 |
| I-eve | 0.50 | 1.00 | 0.67 | 1 |
| I-geo | 0.00 | 0.00 | 1.00 | 7 |
| I-org | 0.62 | 0.40 | 0.49 | 25 |
| I-per | 0.67 | 0.33 | 0.44 | 6 |
| I-tim | 0.87 | 0.85 | 0.86 | 65 |
| O | 1.00 | 1.00 | 1.00 | 16451 |-
| Accuracy | Macro Avg | Weighted Avg |
|----------|-----------|--------------|
| 99.00 | 0.70 | 1.00 |- **Test Classification Report:**
| Label | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| B-eve | 1.00 | 0.00 | 0.00 | 1 |
| B-geo | 1.00 | 0.00 | 0.00 | 1 |
| B-org | 1.00 | 0.33 | 0.50 | 3 |
| B-tim | 0.60 | 0.43 | 0.50 | 7 |
| I-art | 1.00 | 0.00 | 0.00 | 1 |
| I-eve | 0.75 | 1.00 | 0.86 | 3 |
| I-geo | 0.40 | 0.40 | 0.40 | 5 |
| I-org | 0.45 | 0.36 | 0.40 | 14 |
| I-per | 1.00 | 0.14 | 0.25 | 7 |
| I-tim | 0.91 | 0.92 | 0.91 | 75 |
| O | 1.00 | 1.00 | 1.00 | 16446 |-
| Accuracy | Macro Avg | Weighted Avg |
|----------|-----------|--------------|
| 99.00 | 0.83 | 1.00 |- **Visualization of Entities**
provided visual examples showing how the model's predictions compare to true annotations.
![image](https://github.com/user-attachments/assets/b93b6c2b-b25e-46e3-b50f-8b1736e84866)
**5. Overall Conclusion**
The project developed and fine-tuned a Named Entity Recognition pipeline using spaCy. The custom model demonstrated improved performance on the custom dataset compared to the baseline pre-trained model. Advanced experiments revealed the effectiveness of deep learning approaches for NER tasks and provided valuable insights into entity classification and pipeline optimization.
**Key Findings:**
- Fine-tuning a pre-trained model can significantly improve performance on domain-specific datasets.
- Label distribution analysis and visualization of entities are essential for understanding model performance and accuracy.**Additional Information**
**1. Libraries and Tools Used**
| Library/Tool | Description | Link |
|--------------|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|
| spaCy | A popular NLP library used for developing and training NER models. | [spaCy Documentation](https://spacy.io/) |
| Pandas | A data manipulation library used for loading and processing the dataset. | Pandas Documentation |
| Matplotlib | A plotting library used for visualizing label distributions and model results. | [Matplotlib Documentation](https://matplotlib.org/) |
| Seaborn | A data visualization library based on Matplotlib, used for creating attractive and informative statistical graphics. | Seaborn Documentation |
| Scikit-learn | A machine learning library used for generating classification reports and calculating performance metrics. | [Scikit-learn Documentation](https://scikit-learn.org/) |**2. External Resources**
| Resource | Description | Link |
|-------------------|-------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| Dataset | Named Entity Recognition Corpus, including sentences, POS tags, and NER tags. | [Kaggle Dataset](https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus/) |
| Pre-trained Model | spaCy's en_core_web_sm Model used for initial NER training and evaluation. | Model Details |