https://github.com/ozcanmiraay/bart-text-summarization
End-to-end text summarization pipeline using BART, MLflow, and FastAPI
https://github.com/ozcanmiraay/bart-text-summarization
bart deep-learning fastapi huggingface mlflow mps nlp text-summarization transformers
Last synced: 3 months ago
JSON representation
End-to-end text summarization pipeline using BART, MLflow, and FastAPI
- Host: GitHub
- URL: https://github.com/ozcanmiraay/bart-text-summarization
- Owner: ozcanmiraay
- Created: 2025-03-30T17:51:33.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-04-25T19:24:57.000Z (5 months ago)
- Last Synced: 2025-04-25T20:19:10.165Z (5 months ago)
- Topics: bart, deep-learning, fastapi, huggingface, mlflow, mps, nlp, text-summarization, transformers
- Language: Python
- Homepage:
- Size: 25.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π§ Text Summarization: Model Benchmarking with BART, GPT-2 & MLflow
This project demonstrates a lightweight **summarization benchmarking pipeline** using [Hugging Face Transformers](https://github.com/huggingface/transformers), evaluated on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset.
It is built for **reproducible experimentation**, with:
- Short training runs on pre-trained models like `facebook/bart-base` and `gpt2`
- Consistent evaluation using ROUGE metrics and generation length
- **MLflow logging** for tracking hyperparameters, metrics, and sample outputs
- A modular CLI-based script: `src/fine_tune.py` (*used for benchmarking, not fine-tuning*)> π‘ This project is designed to explore how different model types and configurations affect summarization quality β not to perform full-scale fine-tuning.
---
### π₯ Demo Walkthrough
Want to see the project in action without setting it up?
β **Watch the full demo here:**
[π Google Drive Folder β Project Demo](https://drive.google.com/drive/folders/13d9DSMaWFTYVKHugG9ifXBiRbJEaF99F?usp=sharing)Includes:
- MLflow walkthrough and run comparison
- Analysis outputs and visualizations
- Final model usage
- Live drift detection in action via FastAPI---
### π§± Project Structure
```
text_summarization_project/
βββ checkpoints/ # Model checkpoints from small-scale runs
βββ checkpoints_final/ # Final model checkpoint from best config
βββ deployment/model/ # Saved tokenizer and model for inference
βββ fine_tune_analysis/ # Analysis outputs (CSVs, plots)
βββ mlruns/ # MLflow run logs (auto-created)
βββ src/
β βββ fine_tune.py # CLI-based benchmarking script
β βββ run_sweep.py # Loop over all model configs
β βββ final_train.py # Run best config at larger scale
β βββ analysis_scripts/
β β βββ extract_fine_tune_results_mlflow.py
β β βββ fine_tune_analysis.py
β βββ mlops_demo/
β βββ inference_api.py # FastAPI app to serve model
β βββ demo_client.py # Sends sample requests to server
β βββ drift_monitor.py # Drift detection logic
β βββ drift_analyzer.py # Visualizes drift logs
βββ requirements.txt
βββ README.md
```
---### βοΈ Step 1: Setup Instructions
1. **Clone the repository**:
```bash
git clone https://github.com/yourusername/text_summarization_project.git
cd text_summarization_project
```
Create and activate a virtual environment:```bash
python -m venv venv
source venv/bin/activate # On macOS/Linux
# Or: venv\Scripts\activate # On Windows
```Install dependencies:
```bash
pip install -r requirements.txt
```
(macOS only): Add support for Apple Silicon GPUs:```python
import os
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"
```π Run a Benchmarking Trial
Use the CLI-based script at src/fine_tune.py to evaluate a summarization model with a chosen config.βΆοΈ Example: Run BART with small batch and short input/output lengths
```bash
python src/fine_tune.py \
--model_name_or_path facebook/bart-base \
--epochs 1 \
--batch_size 2 \
--max_input_length 256 \
--max_target_length 64
```βΆοΈ Example: Run GPT-2 as a causal LM
```bash
python src/fine_tune.py \
--model_name_or_path gpt2 \
--epochs 1 \
--batch_size 2 \
--max_input_length 256 \
--max_target_length 64 \
--use_causal_lm
```This will:
- Run a short training + evaluation loop over ~5000 train and ~250 val examples
- Log losses, ROUGE scores, and runtime metrics to MLflow
- Save a few sample predictions as .txt files
- Track hyperparameters and outcomes for comparison across runsπ Output Summary
Each run logs:- π Training & evaluation metrics (loss, ROUGE, speed)
- π Token length stats for predictions and references
- π Sample outputs: article, reference summary, model prediction
- π§Ύ Run-specific artifacts via MLflow (including config and example text files)
- π View MLflow Logs---
### π§ͺ Step 2: Reproduce All Experiments
To run the exact 10 benchmarking experiments used in this project, use the src/run_sweep.py script. This script loops through 10 combinations of:- Model type (facebook/bart-base, gpt2)
- Epochs (1 or 2)
- Batch size (2 or 4)
- Input length (256 or 384)
- Target length (64 or 128)
- Model family (Seq2Seq or Causal LM)Each experiment is run sequentially and logged via MLflow under the same "local-file" experiment name.
βΆοΈ Run All Benchmarking Sweeps
```bash
python src/run_sweep.py
```This will:
- Run all 10 model configuration experiments defined in the sweep list
- Automatically handle BART vs. GPT-2 configuration logic
- Log all metrics, ROUGE scores, and samples to MLflow
- Sleep for 5 seconds between runs to ensure system stabilityβ οΈ Make sure you've already set up your environment and installed requirements before running the sweep.
After it's done, view all runs in one place using the MLflow UI:
```bash
mlflow ui
```
Then go to: http://127.0.0.1:5000 and browse the experiment "local-file".---
### π Step 3: Analyze Results
Once all experiments have been logged via MLflow, you can extract and analyze the benchmarking results using the following two scripts under `src/analysis_scripts/`.
---
#### π `extract_fine_tune_results_mlflow.py`
This script exports all run metadata and final metrics into a single, clean `.csv` file for further analysis or visualization.
It captures:
- Model config parameters (e.g., model type, batch size, input/output lengths)
- Final ROUGE scores, eval loss, and training speed
- Run info such as status and start timeβ **Run it after completing all experiments**:
```bash
python src/analysis_scripts/extract_fine_tune_results_mlflow.py
```This will save the file to:
```
fine_tune_analysis/mlflow_all_model_runs.csv
```---
#### π `fine_tune_analysis.py`
This script loads all MLflow runs directly, filters the key metrics and parameters, and produces:
- β Leaderboards for best/worst runs
- π A runtime vs. ROUGE-1 scatter plot
- π A ROUGE-1/2/L bar chart (best run per model)
- π A parallel coordinates plot showing hyperparameter impact
- π A `summary.csv` file for downstream Excel / pandas analysisβ **Run it anytime after experiments are logged**:
```bash
python src/analysis_scripts/fine_tune_analysis.py
```This will output:
- 3 figures saved to: `fine_tune_analysis/`
- CSV summary: `fine_tune_analysis/summary.csv`> π Make sure MLflow still points to the same `mlruns/` directory.
---
### π Step 4: Run the Best Model at Scale
After completing the sweep and analysis steps, we identified the best-performing configuration (based on ROUGE-1 and eval loss) and ran it on a **larger dataset slice** for a more robust final evaluation.
This was done using the `src/final_train.py` script.
---
#### π `final_train.py`
This script reruns the top configuration from the sweep:
- `facebook/bart-base`
- `1` epoch
- `batch_size=2`
- `max_input_length=384`, `max_target_length=64`...but scales up the dataset:
- `train`: 25,000 examples
- `validation`: 1,250 examples
- `test`: 1,250 examplesIt performs:
- β Full training on the larger training split
- π ROUGE-based evaluation on both validation and test sets
- π Logging of metrics, lengths, and prediction examples to MLflow
- πΎ Saving of the final model and tokenizer to `deployment/model/` for downstream use---
#### βΆοΈ Run the final training script
```bash
python src/final_train.py
```This will:
- Log a new MLflow run named `bart-base_final_benchmark_config`
- Store the final model under: `deployment/model/`
- Log validation and test ROUGE scores to MLflow
- Upload sample predictions to MLflow as artifacts> π¦ The saved model can now be reused for inference or integrated into a downstream application or API.
---
### βοΈ Step 5: MLOps Demo β Inference & Drift Monitoring
This project includes a complete, lightweight **MLOps simulation** using FastAPI and basic drift monitoring heuristics.
---
#### π°οΈ Start the Inference Server (Terminal 1)
This will load the saved model from `deployment/model/` and expose a `/summarize` endpoint.
```bash
uvicorn src.mlops_demo.inference_api:app --reload --port 8000
```The server will:
- Tokenize and summarize incoming input
- Log summary token length
- Call drift monitors on:
- Input entropy & readability
- Summary length deviation
- Embedding-based cosine driftDrift logs are saved to:
```
mlops_demo/drift_logs.log
mlops_demo/alerts.json
```---
#### π€ Run the Client Simulator (Terminal 2)
This script sends both real CNN/DailyMail articles and synthetic "drift-triggering" inputs to the server.
```bash
python src/mlops_demo/demo_client.py
```The script simulates:
- Normal requests from the dataset (real-world cases)
- Drift cases including:
- Short or very long inputs
- Low entropy (repeating tokens)
- High entropy (gibberish)
- Embedding-based noveltyEach result includes latency, token count, and a truncated summary.
---
#### π Drift Logging & Monitoring
The following drift types are monitored in `drift_monitor.py`:
- **Output Length Drift**
Triggers when average summary length deviates from baseline (56 tokens Β±10)
- **Input Entropy Drift**
Flags abnormally repetitive or highly chaotic input
- **Embedding Drift**
Computes cosine distance to a reference embedding baselineDrift alerts are logged to:
- `mlops_demo/drift_logs.log` (for audit/debug)
- `mlops_demo/alerts.json` (for alert storage)---
#### π Visualize Drift Over Time
Use the following script to generate 4 time-series plots based on the logs:
```bash
python src/mlops_demo/drift_analyzer.py
```It will show:
- Input length over time
- Input entropy and entropy-based drift
- Summary length and output drift
- Embedding cosine distance vs. threshold---
> π₯οΈ **Note:** You must run the server (`uvicorn ...`) and the client (`python demo_client.py`) in **separate terminals** at the same time.
---
### π Next Steps
- β Add Docker support for containerized deployment
- β Integrate real-time metrics dashboard for live monitoring
- π§ͺ Experiment with larger models (e.g., `bart-large`, `t5-base`)
- π§ Incorporate LoRA or quantization for efficient fine-tuning
- π Deploy API to a cloud platform (e.g., Hugging Face Spaces, Render)> Contributions and ideas are welcome!
---
### π ContactBuilt by **Miray Γzcan**
π§ `miray@uni.minerva.edu`
π [linkedin.com/in/mirayozcan](https://linkedin.com/in/mirayozcan)> If you found this useful or want to collaborate, feel free to reach out!