https://github.com/abhinavgupta/Extract-News-Summary

Pure python script that takes user query and summarizes news related to it.
https://github.com/abhinavgupta/Extract-News-Summary

Last synced: 4 months ago
JSON representation

Pure python script that takes user query and summarizes news related to it.

Host: GitHub
URL: https://github.com/abhinavgupta/Extract-News-Summary
Owner: abhinavgupta
Created: 2012-07-16T23:47:07.000Z (almost 13 years ago)
Default Branch: master
Last Pushed: 2022-07-06T19:24:26.000Z (about 3 years ago)
Last Synced: 2024-08-01T06:21:28.750Z (12 months ago)
Language: Python
Size: 49.8 KB
Stars: 25
Watchers: 3
Forks: 18
Open Issues: 3
Metadata Files:
- Readme: README.markdown

Awesome Lists containing this project

README

# Extract - News - Summary

## Description

This is a pure python package that returns summary of news articles for a serach term provided by the user. The script automates the process of finding appropriate news links, extracting text and summarizing it.

The final script extracts URLs from Google News, so it works best where the search query is a current affaird topic.

Following are the different modules inside the ENS folder:
- **bsReadability** - This module takes uses BeautifulSoup to extract boilerplate from a given URL

- **lxmlReadability** - This module uses the more faster lxml library to extract boilerplate from a given URL, however the library is less robust to badly formed HTML and encoding

- **GoogleRSSReader** - This module takes in a search query and returns the scraped URLs from the Google RSS reader. The Google RSS reader is not strict in their scraping policies

- **TextRankSummarize** - This module uses a modified PageRank algorithm to mark the most important sentences/phrases in a text. A more graphical and intuitive approach than word-frequency. (NOTE: The parameter defining the number of nodes to be selected for final summary is hardcoded in this script, to play around with it you need to make appropriate changes here)

### Requirements

- lxml
- networkx
- numpy
- sklearn
- BeautifulSoup
- nltk (Remember to install the punkt.tokenzier seperately)

### Install

git clone https://github.com/abhinavgupta/Extract-News-Summary
cd Extract-News-Summary/
sudo sh install.sh

### Use script

There are two versions of the script, one using BeautifulSoup, the other using lxml. This is done specifcially for benchmarking purposes give the pros and cons of both parsers.

To use the BeautifulSoup version via terminal:

ENS_Soup 5 Narendra Modi

To use the lxml version:

ENS_lxml 5 Narendra Modi

### Example

from ENS import Document, fetch_url, textRank, newsSearch

links = newsSearch("RBI Governor", 5)
for link in links:
article = Document(url=link).summary()
article = re.sub(regex, "", article)
article = article.encode('ascii','ignore')
summary = textRank(article)
summary = summary.encode('ascii','ignore')

print article
print "*** SUMMARY ***"
print summary

### Authors

- Abhinav Gupta

### TODO

- Remove networkx and sklearn dependencies
- Add own tokenizer and remove nltk dependency
- Solve encoding issue
- Benchmark the summary
- Improve text extraction

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abhinavgupta/Extract-News-Summary

Awesome Lists containing this project

README