{"id":24874812,"url":"https://github.com/mohankrishnagr/infosys_springboard_text-summarization","last_synced_at":"2026-02-21T04:01:39.769Z","repository":{"id":246541374,"uuid":"821428712","full_name":"MohanKrishnaGR/Infosys_Springboard_Text-Summarization","owner":"MohanKrishnaGR","description":"GROUP 4. This repository contains the implementation of a Transformer-based model for abstractive text summarization and a rule-based approach for extractive text summarization.","archived":false,"fork":false,"pushed_at":"2024-07-17T14:58:26.000Z","size":9969,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-12T03:06:16.761Z","etag":null,"topics":["automatic-summarization","bart","datasets","deep-learning","kmeans-clustering","nlp","pytorch","tf-idf","trainer","transformer"],"latest_commit_sha":null,"homepage":"http://text-summarizer.bqegenbyedfzhpa3.centralindia.azurecontainer.io:8000/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MohanKrishnaGR.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-28T14:13:44.000Z","updated_at":"2025-03-15T18:10:24.000Z","dependencies_parsed_at":"2024-06-28T15:45:15.916Z","dependency_job_id":"c5e74823-57af-45fc-aa3e-95c4beb6433b","html_url":"https://github.com/MohanKrishnaGR/Infosys_Springboard_Text-Summarization","commit_stats":null,"previous_names":["mohankrishnagr/infosys_springboard_text-summarization"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MohanKrishnaGR/Infosys_Springboard_Text-Summarization","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohanKrishnaGR%2FInfosys_Springboard_Text-Summarization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohanKrishnaGR%2FInfosys_Springboard_Text-Summarization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohanKrishnaGR%2FInfosys_Springboard_Text-Summarization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohanKrishnaGR%2FInfosys_Springboard_Text-Summarization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MohanKrishnaGR","download_url":"https://codeload.github.com/MohanKrishnaGR/Infosys_Springboard_Text-Summarization/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohanKrishnaGR%2FInfosys_Springboard_Text-Summarization/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29672754,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-21T03:11:15.450Z","status":"ssl_error","status_checked_at":"2026-02-21T03:10:34.920Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic-summarization","bart","datasets","deep-learning","kmeans-clustering","nlp","pytorch","tf-idf","trainer","transformer"],"created_at":"2025-02-01T07:28:47.448Z","updated_at":"2026-02-21T04:01:39.752Z","avatar_url":"https://github.com/MohanKrishnaGR.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align='center'\u003e\n    \u003ca\u003e\u003cimg src=\"https://i.ibb.co/P4B4LrL/springboard-logo-removebg-preview.png\" alt=\"springboard-logo-removebg-preview\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n[![Deploy to Azure Container Instance](https://github.com/MohanKrishnaGR/Infosys_Springboard_Text-Summarization/actions/workflows/azure.yml/badge.svg?event=push)](https://github.com/MohanKrishnaGR/Infosys_Springboard_Text-Summarization/actions/workflows/azure.yml)\n\n![Docker Image Version](https://img.shields.io/docker/v/mohankrishnagr/infosys_text-summarization/group)\n![Docker Image Size](https://img.shields.io/docker/image-size/mohankrishnagr/infosys_text-summarization/group)\n![Docker Pulls](https://img.shields.io/docker/pulls/mohankrishnagr/infosys_text-summarization)\n\u003ci\u003e\n\n# Text Summarization\nA project by AI/ML Interns (Group 4) @ Infosys Springboard, Summer 2024.\n\n## Mentor\nMr. Narendra Kumar\n\n## Contents\n- [Problem Statement](#problem-statement)\n- [Project Statement](#project-statement)\n- [Approach to Solution](#approach-to-solution)\n- [Background Research](#background-research)\n- [Solution](#solution)\n- [Workflow](#workflow)\n- [Data Collection](#data-collection)\n- [Abstractive Text Summarization](#abstractive-text-summarization)\n- [Extractive Text Summarization](#extractive-text-summarization)\n- [Testing](#testing)\n- [Deployment](#deployment)\n- [Containerization](#containerization)\n- [CI/CD Pipeline](#cicd-pipeline)\n\n## Problem Statement\n- Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.\n- This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.\n- The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.\n\n## Project Statement\n- Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.\n- There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.\n- This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.\n\n## Approach to Solution\n- **Figure:** Intended Plan\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1cn429WFQzvF1eDwEFiLsIk87M8KBELt8\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n\n## Background Research\n- **Literature Review**\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1201kWfyGURgsA32u6Xe_WPSrO0izo8Fg\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n## Solution\n- **Selected Deep Learning Architecture**\n\n## Workflow\n- Workflow for Abstractive Text Summarizer:\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1-smea28F10cOnmXXUj24QkzEZL-ffhWt\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n- Workflow for Extractive Text Summarizer:\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1vS2Gm5ccJvjxH7fsnyOf3ARk2pNTR75p\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n\n## Data Collection\n- Data Preprocessing \u0026 Pre-processing Implemented in `src/data_preprocessing`.\n- Data collection from different sources:\n  - CNN, Daily Mail: News\n  - BillSum: Legal\n  - ArXiv: Scientific\n  - Dialoguesum: Conversations\n- Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.\n- Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.\n- Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.\n- Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.\n\n## Abstractive Text Summarization\n### Model Training \u0026 Evaluation\n- **Training:**\n  - Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.\n  - Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.\n      -  406,291,456 training parameters.\n    \n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1fe7MMx_-kEAN9c0QVJbsMj9dBUNgEZX8\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n- **Methods:**\n  - Native PyTorch Implementation\n  - Trainer API Implementation\n\n### Method 1 - Native PyTorch\n- Trained the model using manual training loop and evaluation loop in PyTorch. Implemented in: `src/model.ipynb`\n- **Model Evaluation:** Source code:`src/evaluation.ipynb`\n    - Obtained inconsistent results in inferencing.\n    - ROUGE1 (F-Measure) = 00.018\n    - There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.\n    - Rejected for the further deployment.\n    - Dire need to implement alternative approach.   \n\n### Method 2 – Trainer Class Implementation\n- Utilized Trainer API from Hugging Face for optimized transformer model training. Implemented in: `src/bart.ipynb`\n    - The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.\n     \n- **Evaluation:** Performance metrics using ROUGE scores. Source code: `src/rouge.ipynb`\n    - Model 2 - results outperformed that of method 1.\n    - \u003cstrong\u003eROUGE1 (F-Measure) = 61.32\u003c/strong\u003e -\u003e Benchmark grade\n        - Significantly higher than typical scores reported for state-of-the-art models on common datasets.\n    - GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22\n    - Selected for further deployment.\n \n- Comparative analysis showed significant improvement in performance after fine-tuning. Source code: `src/compare.ipynb`\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1V4u8ohFNFcceidx3l43LNjxTbtLZ233g\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n## Extractive Text Summarization\n- Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.\n- It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.\n- The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.\n- **Implementation:** Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.\n    - Source code for implentation \u0026 evaluation: `src/Extractive_Summarization.ipynb`\n    - ROUGE1 (F-Measure) = 24.71     \n\n## Testing\n- Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.\n- **Source Code:** `src/interface.ipynb`\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=15YsrZBPpEqnrdfzM6Bs8wu_4P1GbN8HZ\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n## Deployment\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1mvbC3IZzRxS0Hx0DoO6EvrQyKqgD--Gw\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n### Application\n- **File Structure:** `summarize/`\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1OnHuW8YMPQYT88pqPbWAZCpgEFlic0kw\u0026export=download\" width=\"320\" height=\"320\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n### API Endpoints\n- Developed using FastAPI framework for handling URLs, files, and direct text input.\n    - **Source Code:** `summarizer/app.py` \n- **Endpoints:**\n  - Root Endpoint\n  - Summarize URL\n  - Summarize File\n  - Summarize Text\n\n### Extractor Modules\n- Extract text from various sources (URLs, PDF, DOCX) using BeautifulSoup and fitz.\n- **Source Code:** `summarizer/extractors.py`\n\n### Extractive Summary Script\n- Implemented extractive summarizer module. Same as implemented in: src/bart.ipynb\n- **Source Code:** `summarizer/extractive_summary.py`\n\n### User Interface\n- Developed a user-friendly interface using HTML, CSS, and JavaScript.\n- **Source Code:** `summarizer/templates/index.html`\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1fvuJlBsFtTyhHK1emn-XCWE0KvTuGa2c\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\n## Containerization\n- Developed a Dockerfile to build a Docker image for the FastAPI application.\n- **Source Code:** `summarizer/Dockerfile`\n- **Image:** [Docker Image](https://hub.docker.com/layers/mohankrishnagr/infosys_text-summarization/group/images/sha256-28802ba2a3b30d36b94fbd878c97585c02c813534fc80fdca5e81494b96bfd08?context=explore)\n\n## CI/CD Pipeline\n- Developed a CI/CD pipeline using Docker, Azure and GitHub Actions.\n- Utilized Azure Container Instance (ACI) for deploying the image, triggers for every push to the main branch.\n- **Source Code:** `.github/workflows/azure.yml`\n- **IPv4 Address:** [Text Summarizer](http://20.235.235.107:8000/) ( http://20.235.235.107:8000/ )\n- **FQDN:** [Text Summarizer](http://text-summarizer.bqegenbyedfzhpa3.centralindia.azurecontainer.io:8000/) ( http://text-summarizer.bqegenbyedfzhpa3.centralindia.azurecontainer.io:8000/ )\n- **Screenshots:** \n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1y0upt5MfiMXMrA5Ar6D0FV91EJ2FIb0p\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1F1agLXesw1XyLh7xBWAC2zxju3bGTaGc\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1M5aIP7_Q7eakyBQ-dSxSaJqJBsblGmzy\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003ca\u003e\u003cimg src=\"https://drive.usercontent.google.com/u/0/uc?id=1rrHoRdoJEk8VTyrG8Py2RirkXspZadBQ\u0026export=download\" border=\"0\"\u003e\u003c/a\u003e\n\u003c/div\u003e\u003cbr\u003e\n\u003c/i\u003e\n\n----\n\n### End Note\nThank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohankrishnagr%2Finfosys_springboard_text-summarization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohankrishnagr%2Finfosys_springboard_text-summarization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohankrishnagr%2Finfosys_springboard_text-summarization/lists"}