https://github.com/madhavbahl/extractive-text-summarizer

Repository accompanying the blog
https://github.com/madhavbahl/extractive-text-summarizer

Last synced: about 1 month ago
JSON representation

Repository accompanying the blog

Host: GitHub
URL: https://github.com/madhavbahl/extractive-text-summarizer
Owner: MadhavBahl
Created: 2019-08-14T17:58:23.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-07-06T20:23:40.000Z (almost 3 years ago)
Last Synced: 2025-03-22T17:02:19.220Z (2 months ago)
Language: Python
Size: 644 KB
Stars: 5
Watchers: 1
Forks: 3
Open Issues: 5
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Extractive Text Summarizer

![Short Notes](https://user-images.githubusercontent.com/26179770/70233084-e34f4880-1783-11ea-8e33-93093263fa13.png)

This repository contains a very simple imlementation of extractive text summarization. The implemented summarizer was partially implemented from this paper (without adding a boost factor) - [https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf](https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf)

The code for summarizer service can be found in `service` folder. After creating the service, this project was hosted as a flask API.

## Steps

### From the above mentioned paper

1. Read a text in and split it intoindividual tokens.

2. Remove the stop words to filter the text

3. Assign a weight value to each individual terms. The weight is calculated as:

    ```

     weight = (frequency of that term)/(total number of terms)

    ```

4. Add a boost factor to bold, italic or underlined text

5. Find the weight of each sentence (sum of individual weights)

6. Rank inidivivdual sentences according to weight

7. Extract `n` highest ranked sentences

### Things implemented

1. Read the text

2. Pre-process the data

   - Convert to lower case

   - Remove special characters

   - Remove digits

   - Remove all the extra spaces with a single space

   - return the clean text

3. Tokenize the data into sentences

4. Remove stop words 

5. Create a word-count dictionary

6. Normalize the word-frequency dictionary (weighted word count matrix/dictionary)

    ```

     weight = (frequency of that word)/(total number of terms)

    ```

7. Assign score to each sentence

8. Rank inidivivdual sentences according to weight and extract `n` highest ranked sentences

## Steps involved illustrated - 

1. Read the text

Read a text document, or ask input from user. Here we create a function which would take the input text and return the summarized text. The second argument to the function is the number of high scored senteces which you want to extract

```python

def summarize_text (text, num_sent):

    ...

    ...

    return summary

```

2. Pre-process the data

Steps and code as shown below - 

```python

def preprocess (text):

    # Convert to lower case

    clean_text = text.lower()

    # Remove special characters

    clean_text = re.sub (r"\W", " ", clean_text)

    # Remove digits

    clean_text = re.sub (r"\d", " ", clean_text)

    # Remove all the extra spaces with a single space

    clean_text = re.sub (r"\s+", " ", clean_text)

    # return the clean text

    return clean_text

```

3. Tokenize the data into sentences

We use `sent_tokenize()` provided by `nltk` library

```python

sentences = nltk.sent_tokenize (text)

```

4. Remove stop words

Again, we use `nltk`

```python

stop_words = nltk.corpus.stopwords.words('english')

```

5. (contd. from 4) Remove stop words and create word count dictionary

```python

word_count_dict = {}

for word in nltk.word_tokenize(clean_text):

    if word not in stop_words:

        if word not in word_count_dict.keys():

            word_count_dict[word] = 1

        else:

            word_count_dict[word] += 1

```

6. Normalize the word-frequency dictionary (weighted word count matrix/dictionary)

```python

# Find the total number of terms (not necessarily unique) = sum of values in the word_count_dict

total_terms = sum(word_count_dict.values)

# Normalize the word-frequency dictionary (weighted word count matrix/dictionary)

max_value = max(word_count_dict.values())

for key in word_count_dict.keys():

    word_count_dict[key] = word_count_dict[key]/total_terms

```

7. Assign scores to each sentence

```python

sentence_score_dict = {}

for sentence in sentences:

    for word in nltk.word_tokenize(sentence.lower()):

        if word in word_count_dict.keys():

            if len(sentence.split(' ')) < 25: # 25 taken at random, to remove very long sentences

                if sentence not in sentence_score_dict.keys():

                    sentence_score_dict[sentence] = word_count_dict[word]

                else:

                    sentence_score_dict[sentence] += word_count_dict[word]

```

8. Rank inidivivdual sentences according to weight and extract `n` highest ranked sentences

```python

best_sentences = heapq.nlargest(num_sent, sentence_score_dict, key=sentence_score_dict.get)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/madhavbahl/extractive-text-summarizer

Awesome Lists containing this project

README