https://github.com/ozcanmiraay/bart-text-summarization
End-to-end text summarization pipeline using BART, MLflow, and FastAPI
https://github.com/ozcanmiraay/bart-text-summarization
bart deep-learning fastapi huggingface mlflow mps nlp text-summarization transformers
Last synced: 28 days ago
JSON representation
End-to-end text summarization pipeline using BART, MLflow, and FastAPI
- Host: GitHub
- URL: https://github.com/ozcanmiraay/bart-text-summarization
- Owner: ozcanmiraay
- Created: 2025-03-30T17:51:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-25T19:24:57.000Z (about 1 year ago)
- Last Synced: 2025-04-25T20:19:10.165Z (about 1 year ago)
- Topics: bart, deep-learning, fastapi, huggingface, mlflow, mps, nlp, text-summarization, transformers
- Language: Python
- Homepage:
- Size: 25.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π§ Text Summarization: Model Benchmarking with BART, GPT-2 & MLflow
This project demonstrates a lightweight **summarization benchmarking pipeline** using [Hugging Face Transformers](https://github.com/huggingface/transformers), evaluated on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset.
It is built for **reproducible experimentation**, with:
- Short training runs on pre-trained models like `facebook/bart-base` and `gpt2`
- Consistent evaluation using ROUGE metrics and generation length
- **MLflow logging** for tracking hyperparameters, metrics, and sample outputs
- A modular CLI-based script: `src/fine_tune.py` (*used for benchmarking, not fine-tuning*)
> π‘ This project is designed to explore how different model types and configurations affect summarization quality β not to perform full-scale fine-tuning.
---
### π₯ Demo Walkthrough
Want to see the project in action without setting it up?
β
**Watch the full demo here:**
[π Google Drive Folder β Project Demo](https://drive.google.com/drive/folders/13d9DSMaWFTYVKHugG9ifXBiRbJEaF99F?usp=sharing)
Includes:
- MLflow walkthrough and run comparison
- Analysis outputs and visualizations
- Final model usage
- Live drift detection in action via FastAPI
---
### π§± Project Structure
```
text_summarization_project/
βββ checkpoints/ # Model checkpoints from small-scale runs
βββ checkpoints_final/ # Final model checkpoint from best config
βββ deployment/model/ # Saved tokenizer and model for inference
βββ fine_tune_analysis/ # Analysis outputs (CSVs, plots)
βββ mlruns/ # MLflow run logs (auto-created)
βββ src/
β βββ fine_tune.py # CLI-based benchmarking script
β βββ run_sweep.py # Loop over all model configs
β βββ final_train.py # Run best config at larger scale
β βββ analysis_scripts/
β β βββ extract_fine_tune_results_mlflow.py
β β βββ fine_tune_analysis.py
β βββ mlops_demo/
β βββ inference_api.py # FastAPI app to serve model
β βββ demo_client.py # Sends sample requests to server
β βββ drift_monitor.py # Drift detection logic
β βββ drift_analyzer.py # Visualizes drift logs
βββ requirements.txt
βββ README.md
```
---
### βοΈ Step 1: Setup Instructions
1. **Clone the repository**:
```bash
git clone https://github.com/yourusername/text_summarization_project.git
cd text_summarization_project
```
Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On macOS/Linux
# Or: venv\Scripts\activate # On Windows
```
Install dependencies:
```bash
pip install -r requirements.txt
```
(macOS only): Add support for Apple Silicon GPUs:
```python
import os
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
```
π Run a Benchmarking Trial
Use the CLI-based script at src/fine_tune.py to evaluate a summarization model with a chosen config.
βΆοΈ Example: Run BART with small batch and short input/output lengths
```bash
python src/fine_tune.py \
--model_name_or_path facebook/bart-base \
--epochs 1 \
--batch_size 2 \
--max_input_length 256 \
--max_target_length 64
```
βΆοΈ Example: Run GPT-2 as a causal LM
```bash
python src/fine_tune.py \
--model_name_or_path gpt2 \
--epochs 1 \
--batch_size 2 \
--max_input_length 256 \
--max_target_length 64 \
--use_causal_lm
```
This will:
- Run a short training + evaluation loop over ~5000 train and ~250 val examples
- Log losses, ROUGE scores, and runtime metrics to MLflow
- Save a few sample predictions as .txt files
- Track hyperparameters and outcomes for comparison across runs
π Output Summary
Each run logs:
- π Training & evaluation metrics (loss, ROUGE, speed)
- π Token length stats for predictions and references
- π Sample outputs: article, reference summary, model prediction
- π§Ύ Run-specific artifacts via MLflow (including config and example text files)
- π View MLflow Logs
---
### π§ͺ Step 2: Reproduce All Experiments
To run the exact 10 benchmarking experiments used in this project, use the src/run_sweep.py script. This script loops through 10 combinations of:
- Model type (facebook/bart-base, gpt2)
- Epochs (1 or 2)
- Batch size (2 or 4)
- Input length (256 or 384)
- Target length (64 or 128)
- Model family (Seq2Seq or Causal LM)
Each experiment is run sequentially and logged via MLflow under the same "local-file" experiment name.
βΆοΈ Run All Benchmarking Sweeps
```bash
python src/run_sweep.py
```
This will:
- Run all 10 model configuration experiments defined in the sweep list
- Automatically handle BART vs. GPT-2 configuration logic
- Log all metrics, ROUGE scores, and samples to MLflow
- Sleep for 5 seconds between runs to ensure system stability
β οΈ Make sure you've already set up your environment and installed requirements before running the sweep.
After it's done, view all runs in one place using the MLflow UI:
```bash
mlflow ui
```
Then go to: http://127.0.0.1:5000 and browse the experiment "local-file".
---
### π Step 3: Analyze Results
Once all experiments have been logged via MLflow, you can extract and analyze the benchmarking results using the following two scripts under `src/analysis_scripts/`.
---
#### π `extract_fine_tune_results_mlflow.py`
This script exports all run metadata and final metrics into a single, clean `.csv` file for further analysis or visualization.
It captures:
- Model config parameters (e.g., model type, batch size, input/output lengths)
- Final ROUGE scores, eval loss, and training speed
- Run info such as status and start time
β
**Run it after completing all experiments**:
```bash
python src/analysis_scripts/extract_fine_tune_results_mlflow.py
```
This will save the file to:
```
fine_tune_analysis/mlflow_all_model_runs.csv
```
---
#### π `fine_tune_analysis.py`
This script loads all MLflow runs directly, filters the key metrics and parameters, and produces:
- β
Leaderboards for best/worst runs
- π A runtime vs. ROUGE-1 scatter plot
- π A ROUGE-1/2/L bar chart (best run per model)
- π A parallel coordinates plot showing hyperparameter impact
- π A `summary.csv` file for downstream Excel / pandas analysis
β
**Run it anytime after experiments are logged**:
```bash
python src/analysis_scripts/fine_tune_analysis.py
```
This will output:
- 3 figures saved to: `fine_tune_analysis/`
- CSV summary: `fine_tune_analysis/summary.csv`
> π Make sure MLflow still points to the same `mlruns/` directory.
---
### π Step 4: Run the Best Model at Scale
After completing the sweep and analysis steps, we identified the best-performing configuration (based on ROUGE-1 and eval loss) and ran it on a **larger dataset slice** for a more robust final evaluation.
This was done using the `src/final_train.py` script.
---
#### π `final_train.py`
This script reruns the top configuration from the sweep:
- `facebook/bart-base`
- `1` epoch
- `batch_size=2`
- `max_input_length=384`, `max_target_length=64`
...but scales up the dataset:
- `train`: 25,000 examples
- `validation`: 1,250 examples
- `test`: 1,250 examples
It performs:
- β
Full training on the larger training split
- π ROUGE-based evaluation on both validation and test sets
- π Logging of metrics, lengths, and prediction examples to MLflow
- πΎ Saving of the final model and tokenizer to `deployment/model/` for downstream use
---
#### βΆοΈ Run the final training script
```bash
python src/final_train.py
```
This will:
- Log a new MLflow run named `bart-base_final_benchmark_config`
- Store the final model under: `deployment/model/`
- Log validation and test ROUGE scores to MLflow
- Upload sample predictions to MLflow as artifacts
> π¦ The saved model can now be reused for inference or integrated into a downstream application or API.
---
### βοΈ Step 5: MLOps Demo β Inference & Drift Monitoring
This project includes a complete, lightweight **MLOps simulation** using FastAPI and basic drift monitoring heuristics.
---
#### π°οΈ Start the Inference Server (Terminal 1)
This will load the saved model from `deployment/model/` and expose a `/summarize` endpoint.
```bash
uvicorn src.mlops_demo.inference_api:app --reload --port 8000
```
The server will:
- Tokenize and summarize incoming input
- Log summary token length
- Call drift monitors on:
- Input entropy & readability
- Summary length deviation
- Embedding-based cosine drift
Drift logs are saved to:
```
mlops_demo/drift_logs.log
mlops_demo/alerts.json
```
---
#### π€ Run the Client Simulator (Terminal 2)
This script sends both real CNN/DailyMail articles and synthetic "drift-triggering" inputs to the server.
```bash
python src/mlops_demo/demo_client.py
```
The script simulates:
- Normal requests from the dataset (real-world cases)
- Drift cases including:
- Short or very long inputs
- Low entropy (repeating tokens)
- High entropy (gibberish)
- Embedding-based novelty
Each result includes latency, token count, and a truncated summary.
---
#### π Drift Logging & Monitoring
The following drift types are monitored in `drift_monitor.py`:
- **Output Length Drift**
Triggers when average summary length deviates from baseline (56 tokens Β±10)
- **Input Entropy Drift**
Flags abnormally repetitive or highly chaotic input
- **Embedding Drift**
Computes cosine distance to a reference embedding baseline
Drift alerts are logged to:
- `mlops_demo/drift_logs.log` (for audit/debug)
- `mlops_demo/alerts.json` (for alert storage)
---
#### π Visualize Drift Over Time
Use the following script to generate 4 time-series plots based on the logs:
```bash
python src/mlops_demo/drift_analyzer.py
```
It will show:
- Input length over time
- Input entropy and entropy-based drift
- Summary length and output drift
- Embedding cosine distance vs. threshold
---
> π₯οΈ **Note:** You must run the server (`uvicorn ...`) and the client (`python demo_client.py`) in **separate terminals** at the same time.
---
### π Next Steps
- β
Add Docker support for containerized deployment
- β
Integrate real-time metrics dashboard for live monitoring
- π§ͺ Experiment with larger models (e.g., `bart-large`, `t5-base`)
- π§ Incorporate LoRA or quantization for efficient fine-tuning
- π Deploy API to a cloud platform (e.g., Hugging Face Spaces, Render)
> Contributions and ideas are welcome!
---
### π Contact
Built by **Miray Γzcan**
π§ `miray@uni.minerva.edu`
π [linkedin.com/in/mirayozcan](https://linkedin.com/in/mirayozcan)
> If you found this useful or want to collaborate, feel free to reach out!