Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oxylabs/news-scraping
A tutorial for scraping news
https://github.com/oxylabs/news-scraping
news news-scraper web-scraping
Last synced: 4 days ago
JSON representation
A tutorial for scraping news
- Host: GitHub
- URL: https://github.com/oxylabs/news-scraping
- Owner: oxylabs
- Created: 2022-02-28T11:59:29.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-04-19T08:28:07.000Z (9 months ago)
- Last Synced: 2024-11-17T02:09:38.863Z (2 months ago)
- Topics: news, news-scraper, web-scraping
- Homepage:
- Size: 5.86 KB
- Stars: 13
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# News Scraping
[![Oxylabs promo code](https://user-images.githubusercontent.com/129506779/250792357-8289e25e-9c36-4dc0-a5e2-2706db797bb5.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)
[](https://github.com/topics/playwright) [](https://github.com/topics/Proxy)
- [Fetch HTML Page](#fetch-html-page)
- [Parsing HTML](#parsing-html)
- [Extracting Text](#extracting-text)This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.
For a detailed explanation, see our [blog post](https://oxy.yt/YrD0).
## Fetch HTML Page
```shell
pip3 install requests
```Create a new Python file and enter the following code:
```python
import requests
response = requests.get(https://quotes.toscrape.com')print(response.text) # Prints the entire HTML of the webpage.
```## Parsing HTML
```shell
pip3 install lxml beautifulsoup4
``````python
from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com')
soup = BeautifulSoup(response.text, 'lxml')title = soup.find('title')
```## Extracting Text
```python
print(title.get_text()) # Prints page title.
```### Fine Tuning
```python
soup.find('small',itemprop="author")
``````python
soup.find('small',class_="author")
```### Extracting Headlines
```python
headlines = soup.find_all(itemprop="text")for headline in headlines:
print(headline.get_text())
```If you wish to find out more about News Scraping, see our [blog post](https://oxy.yt/YrD0).