https://github.com/madhavbahl/extractive-text-summarizer
Repository accompanying the blog
https://github.com/madhavbahl/extractive-text-summarizer
Last synced: about 1 month ago
JSON representation
Repository accompanying the blog
- Host: GitHub
- URL: https://github.com/madhavbahl/extractive-text-summarizer
- Owner: MadhavBahl
- Created: 2019-08-14T17:58:23.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-07-06T20:23:40.000Z (almost 3 years ago)
- Last Synced: 2025-03-22T17:02:19.220Z (2 months ago)
- Language: Python
- Size: 644 KB
- Stars: 5
- Watchers: 1
- Forks: 3
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Extractive Text Summarizer

This repository contains a very simple imlementation of extractive text summarization. The implemented summarizer was partially implemented from this paper (without adding a boost factor) - [https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf](https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf)
The code for summarizer service can be found in `service` folder. After creating the service, this project was hosted as a flask API.
## Steps
### From the above mentioned paper
1. Read a text in and split it intoindividual tokens.
2. Remove the stop words to filter the text
3. Assign a weight value to each individual terms. The weight is calculated as:
```
weight = (frequency of that term)/(total number of terms)
```
4. Add a boost factor to bold, italic or underlined text
5. Find the weight of each sentence (sum of individual weights)
6. Rank inidivivdual sentences according to weight
7. Extract `n` highest ranked sentences### Things implemented
1. Read the text
2. Pre-process the data
- Convert to lower case
- Remove special characters
- Remove digits
- Remove all the extra spaces with a single space
- return the clean text
3. Tokenize the data into sentences
4. Remove stop words
5. Create a word-count dictionary
6. Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
```
weight = (frequency of that word)/(total number of terms)
```
7. Assign score to each sentence
8. Rank inidivivdual sentences according to weight and extract `n` highest ranked sentences## Steps involved illustrated -
1. Read the text
Read a text document, or ask input from user. Here we create a function which would take the input text and return the summarized text. The second argument to the function is the number of high scored senteces which you want to extract
```python
def summarize_text (text, num_sent):
...
...
return summary
```2. Pre-process the data
Steps and code as shown below -
```python
def preprocess (text):
# Convert to lower case
clean_text = text.lower()
# Remove special characters
clean_text = re.sub (r"\W", " ", clean_text)
# Remove digits
clean_text = re.sub (r"\d", " ", clean_text)
# Remove all the extra spaces with a single space
clean_text = re.sub (r"\s+", " ", clean_text)
# return the clean text
return clean_text
```3. Tokenize the data into sentences
We use `sent_tokenize()` provided by `nltk` library
```python
sentences = nltk.sent_tokenize (text)
```4. Remove stop words
Again, we use `nltk`
```python
stop_words = nltk.corpus.stopwords.words('english')
```5. (contd. from 4) Remove stop words and create word count dictionary
```python
word_count_dict = {}for word in nltk.word_tokenize(clean_text):
if word not in stop_words:
if word not in word_count_dict.keys():
word_count_dict[word] = 1
else:
word_count_dict[word] += 1
```6. Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
```python
# Find the total number of terms (not necessarily unique) = sum of values in the word_count_dict
total_terms = sum(word_count_dict.values)# Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
max_value = max(word_count_dict.values())
for key in word_count_dict.keys():
word_count_dict[key] = word_count_dict[key]/total_terms
```7. Assign scores to each sentence
```python
sentence_score_dict = {}
for sentence in sentences:
for word in nltk.word_tokenize(sentence.lower()):
if word in word_count_dict.keys():
if len(sentence.split(' ')) < 25: # 25 taken at random, to remove very long sentences
if sentence not in sentence_score_dict.keys():
sentence_score_dict[sentence] = word_count_dict[word]
else:
sentence_score_dict[sentence] += word_count_dict[word]
```8. Rank inidivivdual sentences according to weight and extract `n` highest ranked sentences
```python
best_sentences = heapq.nlargest(num_sent, sentence_score_dict, key=sentence_score_dict.get)
```