https://github.com/semanticclimate/pdf_summarization_demo

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/semanticclimate/pdf_summarization_demo
Owner: semanticClimate
License: apache-2.0
Created: 2025-07-28T07:34:19.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-07-28T11:11:56.000Z (11 months ago)
Last Synced: 2025-09-05T15:18:52.528Z (9 months ago)
Language: Jupyter Notebook
Size: 67.4 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

# Demonstration of PDF Summarization

DOI Zenodo badge:

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16526790.svg)](https://doi.org/10.5281/zenodo.16526790)

Citation:

Barbhuiya, S., S, A., Jawed, M., Kumari, R., Simon, W., Yadav, G., & Murray-Rust, P. (2025). Demonstration of PDF Summarization (0.1a). Zenodo. https://doi.org/10.5281/zenodo.16526790

**Description:**

This Jupyter notebook provides an end-to-end pipeline for summarizing scientific PDFs using Natural Language Processing (NLP) techniques. It extracts text from uploaded PDFs and generates concise summaries using transformer-based models.

#### Features
- Upload and parse PDF documents
- Extract meaningful text content
- Generate summaries using Hugging Face Transformers (e.g., BART, T5)
- Optionally view original and summarized text side-by-side
- Includes visualization support with PyMuPDF and IPython.display
#### Requirements
1. Install the following packages:
2. pip install transformers
3. pip install PyPDF2
4. pip install fitz
5. pip install PyMuPDF
6. pip install nltk
7. pip install torch

#### How to Use
1. Clone this repository or download the notebook.
2. Launch Jupyter Notebook or Google Colab.
3. Upload your scientific or research-based PDF.
4. Run all cells to:
- Extract the full text
- Preprocess and chunk the content
- Generate a summary using a transformer model

#### Structure
- upload_pdf() – Upload and read PDF files
- extract_text() – Extract text from all pages
- summarize_text() – Use pre-trained summarization models
- visualize() – Display original vs. summarized content

#### Applications
- Research paper summarization
- Literature review automation
- Information extraction for large documents

#### Notes
- Pretrained models like facebook/bart-large-cnn or t5-base are used.
- Results depend on PDF formatting quality.

Reviewers & review process: \

---

Software citation information: [CITATION.cff](CITATION.cff)

License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: [LICENSE](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/semanticclimate/pdf_summarization_demo

Awesome Lists containing this project

README