Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aydan-moon/news_headlines_ner
Named Entity Recognition (NER) model for analyzing entities in news headlines using spaCy and trained on the CoNLL-2003 dataset.
https://github.com/aydan-moon/news_headlines_ner
conll-2003 ner nlp python spacy
Last synced: 24 days ago
JSON representation
Named Entity Recognition (NER) model for analyzing entities in news headlines using spaCy and trained on the CoNLL-2003 dataset.
- Host: GitHub
- URL: https://github.com/aydan-moon/news_headlines_ner
- Owner: Aydan-moon
- Created: 2024-11-06T11:15:28.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-20T10:48:12.000Z (3 months ago)
- Last Synced: 2025-01-21T18:17:17.235Z (24 days ago)
- Topics: conll-2003, ner, nlp, python, spacy
- Language: Jupyter Notebook
- Homepage:
- Size: 12.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# Named Entity Recognition (NER) for News Headlines
### Project Overview
This project implements a Named Entity Recognition (NER) system designed to identify and classify named entities (e.g., persons, organizations, locations) in news headlines. The system is built using **spaCy** and trained on the **CoNLL-2003 dataset**.### Project Objectives
- Use the CoNLL-2003 dataset to build and train an NER model.
- Implement data preprocessing and exploratory analysis.
- Train a simple NER model using spaCy's small English model.
- Evaluate the model using precision, recall, and F1-score metrics.
- Create a function to perform NER on new, unseen headlines.---
## Dataset Description
**Dataset**: [CoNLL-2003](https://huggingface.co/datasets/conll2003), a well-known benchmark for NER tasks, containing:
- `PER` (Person)
- `ORG` (Organization)
- `LOC` (Location)
- `MISC` (Miscellaneous)**Structure**: Each word in a sentence is tagged with an entity type or labeled as `O` if it doesn't belong to any named entity. The dataset is split into training, validation, and test sets to ensure fair model evaluation.
---
## Project Workflow
### 1. Data Preprocessing
- **Data Loading**: The dataset is loaded using the Hugging Face `datasets` library.
- **Data Exploration**: Initial exploration provides insights into the distribution of entity labels and helps prepare data for modeling.
- **Data Preparation**: The data is converted into a format compatible with spaCy for training.### 2. Model Training
- **Model Selection**: We use spaCy’s small English model, focusing solely on the NER component.
- **Training Process**: The NER model is trained over multiple iterations, adjusting parameters to improve performance.
- **Model Saving**: The trained model is saved for later use and evaluation.### 3. Model Evaluation
- **Metrics**: The model’s performance is evaluated using:
- **Precision**
- **Recall**
- **F1-score**The model achieves high accuracy, especially in identifying common entity types. Performance insights suggest potential for improvement in detecting less frequent entity types.
### 4. Performing NER on New Headlines
A function is created to perform NER on new text data. Given a headline, the function identifies and classifies entities, providing quick insights into the headline's key terms.---
## Usage
### Prerequisites
- Python 3.7+
- Install required libraries:
```bash
pip install spacy datasets### Running the Project
- **Load the Dataset**
```bash
from datasets import load_dataset
dataset = load_dataset("conll2003")- **Prepare Data for spaCy**
Convert the CoNLL-2003 dataset into spaCy’s training format.
- **Train the Model**
```bash
import spacy
nlp = spacy.load("en_core_web_sm")
#Disable other pipelines and train NER model- **Save and Evaluate the Model**
```bash
nlp.to_disk("/path/to/save/model")- **Run NER on New Headlines**
```bash
def perform_ner(text, nlp):
doc = nlp(text)
return [(ent.text, ent.label_) for ent in doc.ents]## **Project Structure**
- **NLP_01.ipynb**: Jupyter notebook with the code and step-by-step process.
- **README.md**: Project documentation.
- **trained_model**: Directory where the trained spaCy model is saved.## **Results and Insights**
- **Overall Accuracy**: The model shows strong performance, especially in identifying the most common entity classes.
- **Limitations**: The model can be improved to detect less frequent and complex entities.
## **Future Improvements**
- Experiment with different NER models or architectures to improve classification of less common entities.
- Fine-tune hyperparameters and consider using data augmentation for better generalization.### **Contact**
Feel free to reach out for questions or collaboration:
📧 [email protected]