https://github.com/johnbumgarner/newshound

This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.
https://github.com/johnbumgarner/newshound

article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping

Last synced: about 2 months ago
JSON representation

This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.

Host: GitHub
URL: https://github.com/johnbumgarner/newshound
Owner: johnbumgarner
Created: 2021-10-06T14:28:19.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2023-03-14T03:59:41.000Z (almost 3 years ago)
Last Synced: 2025-08-24T23:08:41.089Z (7 months ago)
Topics: article-extracting, article-extractor, data-extraction, data-mining, data-science, datascience, news, news-aggregator, news-crawler, newspaper-crawler, python-newspaper, python3, text-mining, web-scraping, webscraping
Homepage:
Size: 28.3 KB
Stars: 33
Watchers: 14
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: FUNDING.yml
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

README

# Currently under development. BETA will be released soon.
########### ########### ########### ########### ###########

# NewsHound
---

![PyPI](https://img.shields.io/pypi/v/newshound)

![GitHub issues](https://img.shields.io/github/issues/johnbumgarner/newshound)
![GitHub pull requests](https://img.shields.io/github/issues-pr/johnbumgarner/newshound)
[![newshound](https://snyk.io/advisor/python/newshound/badge.svg)](https://snyk.io/advisor/python/newshound)
[![Downloads](https://static.pepy.tech/personalized-badge/newshound?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=Total%20Downloads)](https://pepy.tech/project/newshound)

## Description

NewsHound is a Python 3 module that was designed to perform high quality news and article extraction for sources in multiple languages.

For instance NewsHound cleanly parses article content from the BBC in English, the Dainik Bhaskar in Hindi, the People's Daily in Chinese, the Malayala Manorama in Malayalam and the Khaosod in Thai.

The builtin extraction architecture is designed to systematically parse specific data elements from the underlying navigation structure of either an online web page or an offline file containing HTML content.

These data elements are:

Title/Headline

Description/Summary

Keywords

Name(s) of Author(s)

Main Text/Content

ISO Language

Language Name

Published Date

Modified Date

Canonical HREF

Top Image HREF

## Installation

NewsHound requires Python >=3.6. This package can be installed using pip3.

```python
pip3 install newshound
```

## Usage and Documentation

For detailed information on NewsHound please refer to the documentation.

- Package Dependencies

## Predefined Extraction

The maintainers of NewsHound have developed and tested multiple predefined extraction modules for various news sources around the world. These specific extractors were developed to ensure consistent and accurate parsing from the news sources being queried. Additional sources will be added periodically to this predefined extraction list.

## Development

If you would like to contribute to the NewsHound project please read the contributing guidelines.

Items currently under development:
- TDB after BETA release

## Issues

This repository is actively maintained. Feel free to open any issues related to bugs, coding errors, broken links or enhancements.

You can also contact me at [John Bumgarner](mailto:newshoundproject@gmail.com?subject=[GitHub]%20newshound%20project%20request) with any issues or enhancement requests.

## Sponsorship

If you would like to contribute financially to the development and maintenance of the NewsHound project please read the sponsorship information.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/johnbumgarner/newshound

Awesome Lists containing this project

README