An open API service indexing awesome lists of open source software.

https://github.com/bhawnamehbubani/headline_generation_and_summarization_of_news_articles_with_t5_and_xsum_dataset

This project implements a headline summarization pipeline using the T5 transformer model fine-tuned on the XSum dataset. It provides a user-friendly interface using Gradio to generate concise summaries for long headlines or paragraphs. The project showcases end-to-end machine learning capabilities, from preprocessing and fine-tuning to deployment
https://github.com/bhawnamehbubani/headline_generation_and_summarization_of_news_articles_with_t5_and_xsum_dataset

gradio huggingface-transformers matplotlib pytorch rouge seaborn t5-model xsum

Last synced: 6 months ago
JSON representation

This project implements a headline summarization pipeline using the T5 transformer model fine-tuned on the XSum dataset. It provides a user-friendly interface using Gradio to generate concise summaries for long headlines or paragraphs. The project showcases end-to-end machine learning capabilities, from preprocessing and fine-tuning to deployment

Awesome Lists containing this project

README

          

# Headline Generation and Summarization of News Articles with T5 and XSum Dataset

## Overview

This project implements a headline summarization pipeline using the T5 transformer model fine-tuned on the XSum dataset. It provides a user-friendly interface using Gradio to generate concise summaries for long headlines or paragraphs. The project showcases end-to-end machine learning capabilities, from preprocessing and fine-tuning to deployment with a modern UI.

## Features

- Fine-tuned T5 transformer model for high-quality text summarization.

- Preprocessing pipeline for efficient dataset preparation.

- Integrated Gradio interface for real-time summarization.

- Modular code structure for ease of customization and scalability.

- Detailed logging for debugging and monitoring.

## Project workflow
```plaintext
1. Dataset and Data Preparation
┌─────────────────────────────────────────────────────────┐
│ Dataset: XSum (Extreme Summarization Dataset) │
│ - Source: Hugging Face Datasets Library │
│ - Purpose: News article summarization and headline gen │
└─────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────┐
│ Data Loading │
│ - Download the XSum dataset │
│ - Automatically splits data into Train, Validation, │
│ and Test sets using Hugging Face's `load_dataset` API │
│ - Ensures compatibility for fine-tuning T5 │
└─────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────┐
│ Tokenization and Input Preprocessing │
│ - Hugging Face T5 Tokenizer (`T5Tokenizer`) │
│ - Converts text input (articles) into model-readable │
│ token sequences │
│ - Includes: │
│ - Truncation to max sequence length │
│ - Padding to ensure uniform sequence length │
│ - Prepares summary text as target labels for training │
└─────────────────────────────────────────────────────────┘

2. Model Setup and Fine-Tuning
┌─────────────────────────────────────────────────────────┐
│ Model: Pretrained T5 │
│ - Framework: Hugging Face Transformers Library │
│ - Fine-tuned on the XSum dataset for the following tasks│
│ - Headline Generation │
│ - Abstractive Summarization │
│ - Configuration: │
│ - Pre-trained weights for T5 │
│ - Loss function: Cross-Entropy │
│ - Optimized for GPU acceleration │
└─────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────┐
│ Training Pipeline │
│ - Uses Hugging Face Trainer API for fine-tuning │
│ - Configurations: │
│ - Batch size: Suitable for GPU │
│ - Learning rate scheduler for adaptive optimization │
│ - Gradient clipping to stabilize training │
│ - Logs metrics for validation set after each epoch: │
│ - Rouge Score │
└─────────────────────────────────────────────────────────┘

3. Evaluation and Results
┌─────────────────────────────────────────────────────────┐
│ Model Evaluation │
│ - Test the fine-tuned model on unseen test data │
│ - Output: │
│ - Predicted summaries and headlines │
│ - Metrics for Quality Assessment: │
│ - Rouge Score (Precision, Recall, F1) │
│ - Sample Comparison: │
│ - Input Article vs. Generated Summary/Headline │
└─────────────────────────────────────────────────────────┘

4. Saving the Model
┌─────────────────────────────────────────────────────────┐
│ Save Trained Model │
│ - Save fine-tuned T5 model weights │
│ - Save tokenizer for consistent input handling │
│ - Format: Compatible with Hugging Face Transformers │
└─────────────────────────────────────────────────────────┘
|
v
5. Generated Results Visualization
┌─────────────────────────────────────────────────────────┐
│ Results Presentation │
│ - Display comparison of: │
│ - Input Articles │
│ - Model Summaries and Headlines │
│ - Examine output quality for various test cases │
└─────────────────────────────────────────────────────────┘

```
## Demo

#### Input Example:
```plaintext

"Apple introduces the latest iPhone with groundbreaking new features for photography, battery life, and user experience."
```

#### Output Summary:
```plaintext

"Apple launches new iPhone with improved features."
```

#### Try the Gradio app to generate summaries instantly!

## Folder Structure
```plaintext

Headline Summarization Project
├── data # Dataset and preprocessing utilities
│ ├── xsum_dataset.py # Script to download and prepare the XSum dataset
└── models # Fine-tuned T5 models
├── t5_model.py # Core model training and evaluation scripts
└── checkpoints # Saved model weights and checkpoints
├── app # Gradio app for deployment
│ ├── app.py # Gradio UI script
└── static # UI-related assets (images, logos, etc.)
├── logs # Training and application logs
├── utils # Helper functions
├── tokenizer.py # Tokenizer utilities for preprocessing
└── metrics.py # Custom evaluation metrics
├── notebooks # Jupyter notebooks for exploratory analysis
└── README.md # Project documentation
```

## Setup Instructions

### Prerequisites
```plaintext
Python 3.8 or higher

Pip or Conda for package management
```

### Installation

#### Clone the repository:
```plaintext
git clone https://github.com/your-repo/headline-summarization.git
cd headline-summarization
```
#### Create a virtual environment:
```plaintext
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
```
#### Install dependencies:
```plaintext
pip install -r requirements.txt
```
#### Download the XSum dataset and preprocess it:
```plaintext
python data/xsum_dataset.py
```
#### Fine-tune the T5 model (optional):
```plaintext
python models/t5_model.py
```
#### Run the Gradio app:
```plaintext
python app/app.py
```
## Usage

#### Fine-Tuning the Model:
- Modify the t5_model.py script to adjust hyperparameters such as learning rate, batch size, and number of epochs. Execute the script to fine-tune the model on the XSum dataset.

#### Launching the Gradio App:
- Run app.py to start the Gradio-based summarization tool locally. Access it at http://localhost:7860.

#### Customization:
- Modify the tokenizer.py and metrics.py scripts in the utils folder to adjust tokenization or add custom evaluation metrics.

## Key Components

##### 1. Data Preprocessing
```plaintext
Script: data/xsum_dataset.py
```
Highlights: Efficiently preprocesses the XSum dataset and prepares it for training using Hugging Face's datasets library.

##### 2. Fine-Tuning
```plaintext
Script: models/t5_model.py
```
Highlights: Implements fine-tuning for the T5 model with options for checkpointing and logging.

##### 3. Gradio Deployment
```plaintext
Script: app/app.py
```
Highlights: Deploys the model with a simple Gradio interface for real-time summarization.

## Model Performance

| **Metric** | **Value** |
|-------------|-----------|
| ROUGE-1 | 45.67 |
| ROUGE-2 | 22.34 |
| ROUGE-L | 38.45 |

## Tech Stack

- Programming Language: Python 3.8

- Frameworks/Libraries:

- Transformers (Hugging Face)

- PyTorch

- Gradio

- Dataset: XSum

- Visualization: Matplotlib, Seaborn

- Deployment: Gradio

## Challenges and Solutions

- Challenge: Managing large-scale data preprocessing.

- Solution: Leveraged the Hugging Face datasets library for efficient preprocessing.

- Challenge: Long training times.

- Solution: Used gradient accumulation and checkpointing for efficient training.

- Challenge: Simplifying deployment.

- Solution: Integrated Gradio for a seamless user experience.

## Future Enhancements

- Support for multilingual summarization.

- Integration with other datasets for broader generalization.

- Deployment as a web app using AWS or Google Cloud.

## Contributing

#### Contributions are welcome! Please follow these steps:
```plaintext
Fork the repository.
```
#### Create a new branch:
```plaintext
git checkout -b feature-name
```
#### Commit changes and push:
```plaintext
git add .
git commit -m "Add your message here"
git push origin feature-name
```
#### Open a pull request.