https://github.com/uva-bi-sdad/rss_scraper_copy
https://github.com/uva-bi-sdad/rss_scraper_copy
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/uva-bi-sdad/rss_scraper_copy
- Owner: uva-bi-sdad
- Created: 2024-03-11T14:18:49.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-21T02:38:23.000Z (about 1 year ago)
- Last Synced: 2025-02-07T10:29:17.619Z (3 months ago)
- Language: Python
- Size: 40 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- Changelog: news-get.py
Awesome Lists containing this project
README
# dspg22_rss-scraper
Description
---
- A toolbox of scripts to retrieve rss data as well as the news article text via keywords and time ranges
- _rss-get.py_ retrieves rss feeds given keywords and time ranges and saves to an output directory
- _news-get.py_ takes inputs of a csv or a directory of csvs with a url column and appends a new column with the specified textrss-get
===
To use
---
```python
usage: rss-get.py [-h] -k KEYWORDS [KEYWORDS ...] [-s SOURCES [SOURCES ...]] [-o OUTDIR] [-sd START_DATE] [-ed END_DATE]
[-v | --verbose | --no-verbose]DSPG pull from RSS sources
options:
-h, --help show this help message and exit
-k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
Keywords to parse
-s SOURCES [SOURCES ...], --sources SOURCES [SOURCES ...]
Sources (['google', 'bing']) to parse from. If None, parses all possible sources
-o OUTDIR, --outdir OUTDIR
-sd START_DATE, --start_date START_DATE
Start date of the search in Y-m-d format. If none and end date provided, or if provided date further than
now, raises error
-ed END_DATE, --end_date END_DATE
End date of the search in Y-m-d format. If none provided, uses the current time
-v, --verbose, --no-verbose
```Examples
---
Run a search on the keywords ```apples, oranges, bananas```, and export the csvs to the output directory "output":```shell
python rss-get.py -k apples oranges bananas -o output
```Run a search on the keywords ```apple``` for everything after March 3rd, 2022 without saving in verbose mode (note _Bing_ does not have time-based searches)
```shell
python rss-get.py -k apples -sd 2022-03-03 -v
```Run a search on the keywords ```apple, bananas, oranges``` for everything after March 14th, 2021 only using google
```shell
python rss-get.py -k apples bananas oranges -sd 2021-03-14 -s google
```Sources check list
---
- _Completion_ is defined as **C**omplete and **V**isitedCompletion|**Source**|**Type**|**Keyword**|**Snippet**|**Notes**|
|--|--|--|--|--|--|
|C|[Google](https://news.google.com/rss/search?q={0})|Search Engine|Y|Y|
- Successful RSS keyword extraction using: https://news.google.com/rss/search?q={0}
- Column _summary_detail.value_ might contain one sentence description of the news
|C|[Bing](https://www.bing.com/news/search?q={0}&format=rss)|Search Engine|Y|Y|
- Successful RSS keyword extraction using: https://www.bing.com/news/search?q={0}&format=rss
- Column _summary_detail.value_ might contain 2-3 sentences of the news
|V|[Baidu](?)|Search Engine|N|N|
- I found a website https://www.baidu.com/search/rss.html that seems to describe the existance of rss functioning. However, upon clicking into the keyword search field, I kept being returned the same news in non-RSS format
|V|[Yahoo News](https://news.yahoo.com/rss)|News Channel|N|N|
- Does not seem to allow keyword searches. If you do a yahoo search with news it automatically returns search.yahoo.com.
- You can manipulate the [URL](https://news.search.yahoo.com/search;_ylt=A2KIbMuIVJpipzUAhiNXNyoA;_ylu=Y29sbwNiZjEEcG9zAzEEdnRpZAMEc2VjA3Nj?p={KEYWORD}&fr=news) to do news page sources, but it does not seem like they will convert it into an RSS format for us
|V|[Yandex](https://yandex.com/)|Search Engine|N|N|
- robots.txt disallows /company/*.rss, /company/search. Returns results in Russian?
- Upon searching [sitemaps](https://yandex.com/support/sitemap.xml), found an rss source at [https://zen.yandex.ru/](https://zen.yandex.ru/search?query={0}). However, to subscribe to any of the feeds require signing in
|V|[Ask](https://www.ask.com/rss)|Search Engine|N|Y|
- Found https://www.ask.com/rss, but so far haven't found a way to add a keyword.
- Looked through: https://www.ask.com/sitemap_index.xml and https://www.ask.com/robots.txt but did not find anything rss-related
- Column _metadescription_ contains 2 sentences
|C|[ABC News](https://abcnews.go.com)|General Media|Y|N|
- Does not have usable rss feed. Web scraping with keyword search is possible, but most of the content is videos without transcript. I think we do NOT need to dig any deeper.
---
news-get
===
Challenges
---
We have implemented the text extractor with _newspaper_, but the following sites cannot be reached or scrapped.
|**Site**|**URL**|**Description**|
|--|--|--|
|Barron's|(https://www.barrons.com)|Access denied|
|Wall Street Journal|(https://www.wsj.com)|Access denied|
|Forbes|(https://www.forbes.com)|Access denied|
|Bloomberg|(https://www.bloomberg.com)|Not-a-robot test|
|Reuters|(https://www.reuters.com)|Account required to view full text|
Notes
---
- https://www.overleaf.com/4259256516zdhfhfqcshjj
Acknowledgement
---
This project was built as part of the 2022 [Data Science for the Public Good (DSPG) internship program](https://biocomplexity.virginia.edu/data-science-public-good-internship-program)