https://github.com/bhawnamehbubani/headline_generation_and_summarization_of_news_articles_with_t5_and_xsum_dataset
This project implements a headline summarization pipeline using the T5 transformer model fine-tuned on the XSum dataset. It provides a user-friendly interface using Gradio to generate concise summaries for long headlines or paragraphs. The project showcases end-to-end machine learning capabilities, from preprocessing and fine-tuning to deployment
https://github.com/bhawnamehbubani/headline_generation_and_summarization_of_news_articles_with_t5_and_xsum_dataset
gradio huggingface-transformers matplotlib pytorch rouge seaborn t5-model xsum
Last synced: 6 months ago
JSON representation
This project implements a headline summarization pipeline using the T5 transformer model fine-tuned on the XSum dataset. It provides a user-friendly interface using Gradio to generate concise summaries for long headlines or paragraphs. The project showcases end-to-end machine learning capabilities, from preprocessing and fine-tuning to deployment
- Host: GitHub
- URL: https://github.com/bhawnamehbubani/headline_generation_and_summarization_of_news_articles_with_t5_and_xsum_dataset
- Owner: BhawnaMehbubani
- Created: 2025-01-23T19:06:29.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-01-27T03:04:38.000Z (8 months ago)
- Last Synced: 2025-02-06T03:52:18.717Z (8 months ago)
- Topics: gradio, huggingface-transformers, matplotlib, pytorch, rouge, seaborn, t5-model, xsum
- Language: Jupyter Notebook
- Homepage:
- Size: 458 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Headline Generation and Summarization of News Articles with T5 and XSum Dataset
## Overview
This project implements a headline summarization pipeline using the T5 transformer model fine-tuned on the XSum dataset. It provides a user-friendly interface using Gradio to generate concise summaries for long headlines or paragraphs. The project showcases end-to-end machine learning capabilities, from preprocessing and fine-tuning to deployment with a modern UI.
## Features
- Fine-tuned T5 transformer model for high-quality text summarization.
- Preprocessing pipeline for efficient dataset preparation.
- Integrated Gradio interface for real-time summarization.
- Modular code structure for ease of customization and scalability.
- Detailed logging for debugging and monitoring.
## Project workflow
```plaintext
1. Dataset and Data Preparation
┌─────────────────────────────────────────────────────────┐
│ Dataset: XSum (Extreme Summarization Dataset) │
│ - Source: Hugging Face Datasets Library │
│ - Purpose: News article summarization and headline gen │
└─────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────┐
│ Data Loading │
│ - Download the XSum dataset │
│ - Automatically splits data into Train, Validation, │
│ and Test sets using Hugging Face's `load_dataset` API │
│ - Ensures compatibility for fine-tuning T5 │
└─────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────┐
│ Tokenization and Input Preprocessing │
│ - Hugging Face T5 Tokenizer (`T5Tokenizer`) │
│ - Converts text input (articles) into model-readable │
│ token sequences │
│ - Includes: │
│ - Truncation to max sequence length │
│ - Padding to ensure uniform sequence length │
│ - Prepares summary text as target labels for training │
└─────────────────────────────────────────────────────────┘2. Model Setup and Fine-Tuning
┌─────────────────────────────────────────────────────────┐
│ Model: Pretrained T5 │
│ - Framework: Hugging Face Transformers Library │
│ - Fine-tuned on the XSum dataset for the following tasks│
│ - Headline Generation │
│ - Abstractive Summarization │
│ - Configuration: │
│ - Pre-trained weights for T5 │
│ - Loss function: Cross-Entropy │
│ - Optimized for GPU acceleration │
└─────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────┐
│ Training Pipeline │
│ - Uses Hugging Face Trainer API for fine-tuning │
│ - Configurations: │
│ - Batch size: Suitable for GPU │
│ - Learning rate scheduler for adaptive optimization │
│ - Gradient clipping to stabilize training │
│ - Logs metrics for validation set after each epoch: │
│ - Rouge Score │
└─────────────────────────────────────────────────────────┘3. Evaluation and Results
┌─────────────────────────────────────────────────────────┐
│ Model Evaluation │
│ - Test the fine-tuned model on unseen test data │
│ - Output: │
│ - Predicted summaries and headlines │
│ - Metrics for Quality Assessment: │
│ - Rouge Score (Precision, Recall, F1) │
│ - Sample Comparison: │
│ - Input Article vs. Generated Summary/Headline │
└─────────────────────────────────────────────────────────┘4. Saving the Model
┌─────────────────────────────────────────────────────────┐
│ Save Trained Model │
│ - Save fine-tuned T5 model weights │
│ - Save tokenizer for consistent input handling │
│ - Format: Compatible with Hugging Face Transformers │
└─────────────────────────────────────────────────────────┘
|
v
5. Generated Results Visualization
┌─────────────────────────────────────────────────────────┐
│ Results Presentation │
│ - Display comparison of: │
│ - Input Articles │
│ - Model Summaries and Headlines │
│ - Examine output quality for various test cases │
└─────────────────────────────────────────────────────────┘```
## Demo#### Input Example:
```plaintext"Apple introduces the latest iPhone with groundbreaking new features for photography, battery life, and user experience."
```#### Output Summary:
```plaintext"Apple launches new iPhone with improved features."
```#### Try the Gradio app to generate summaries instantly!
## Folder Structure
```plaintextHeadline Summarization Project
├── data # Dataset and preprocessing utilities
│ ├── xsum_dataset.py # Script to download and prepare the XSum dataset
└── models # Fine-tuned T5 models
├── t5_model.py # Core model training and evaluation scripts
└── checkpoints # Saved model weights and checkpoints
├── app # Gradio app for deployment
│ ├── app.py # Gradio UI script
└── static # UI-related assets (images, logos, etc.)
├── logs # Training and application logs
├── utils # Helper functions
├── tokenizer.py # Tokenizer utilities for preprocessing
└── metrics.py # Custom evaluation metrics
├── notebooks # Jupyter notebooks for exploratory analysis
└── README.md # Project documentation
```## Setup Instructions
### Prerequisites
```plaintext
Python 3.8 or higherPip or Conda for package management
```### Installation
#### Clone the repository:
```plaintext
git clone https://github.com/your-repo/headline-summarization.git
cd headline-summarization
```
#### Create a virtual environment:
```plaintext
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
```
#### Install dependencies:
```plaintext
pip install -r requirements.txt
```
#### Download the XSum dataset and preprocess it:
```plaintext
python data/xsum_dataset.py
```
#### Fine-tune the T5 model (optional):
```plaintext
python models/t5_model.py
```
#### Run the Gradio app:
```plaintext
python app/app.py
```
## Usage#### Fine-Tuning the Model:
- Modify the t5_model.py script to adjust hyperparameters such as learning rate, batch size, and number of epochs. Execute the script to fine-tune the model on the XSum dataset.#### Launching the Gradio App:
- Run app.py to start the Gradio-based summarization tool locally. Access it at http://localhost:7860.#### Customization:
- Modify the tokenizer.py and metrics.py scripts in the utils folder to adjust tokenization or add custom evaluation metrics.## Key Components
##### 1. Data Preprocessing
```plaintext
Script: data/xsum_dataset.py
```
Highlights: Efficiently preprocesses the XSum dataset and prepares it for training using Hugging Face's datasets library.##### 2. Fine-Tuning
```plaintext
Script: models/t5_model.py
```
Highlights: Implements fine-tuning for the T5 model with options for checkpointing and logging.##### 3. Gradio Deployment
```plaintext
Script: app/app.py
```
Highlights: Deploys the model with a simple Gradio interface for real-time summarization.## Model Performance
| **Metric** | **Value** |
|-------------|-----------|
| ROUGE-1 | 45.67 |
| ROUGE-2 | 22.34 |
| ROUGE-L | 38.45 |## Tech Stack
- Programming Language: Python 3.8
- Frameworks/Libraries:
- Transformers (Hugging Face)
- PyTorch
- Gradio
- Dataset: XSum
- Visualization: Matplotlib, Seaborn
- Deployment: Gradio
## Challenges and Solutions
- Challenge: Managing large-scale data preprocessing.
- Solution: Leveraged the Hugging Face datasets library for efficient preprocessing.
- Challenge: Long training times.
- Solution: Used gradient accumulation and checkpointing for efficient training.
- Challenge: Simplifying deployment.
- Solution: Integrated Gradio for a seamless user experience.
## Future Enhancements
- Support for multilingual summarization.
- Integration with other datasets for broader generalization.
- Deployment as a web app using AWS or Google Cloud.
## Contributing
#### Contributions are welcome! Please follow these steps:
```plaintext
Fork the repository.
```
#### Create a new branch:
```plaintext
git checkout -b feature-name
```
#### Commit changes and push:
```plaintext
git add .
git commit -m "Add your message here"
git push origin feature-name
```
#### Open a pull request.