Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/geminidsystems/googlenewsscraper
A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https://pepy.tech/project/GoogleNewsScraper)
https://github.com/geminidsystems/googlenewsscraper
crawler googleautomator googlenews googlenewsscraper googlescraper python scraper scraping selenium web-scraping webcrawler webdriver webscraper
Last synced: about 10 hours ago
JSON representation
A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https://pepy.tech/project/GoogleNewsScraper)
- Host: GitHub
- URL: https://github.com/geminidsystems/googlenewsscraper
- Owner: GeminidSystems
- License: mit
- Created: 2021-07-28T17:46:12.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-28T23:33:04.000Z (over 2 years ago)
- Last Synced: 2024-11-16T02:14:58.709Z (4 days ago)
- Topics: crawler, googleautomator, googlenews, googlenewsscraper, googlescraper, python, scraper, scraping, selenium, web-scraping, webcrawler, webdriver, webscraper
- Language: Python
- Homepage: https://pypi.org/project/GoogleNewsScraper/
- Size: 15.3 MB
- Stars: 11
- Watchers: 2
- Forks: 5
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.txt
- License: LICENSE
Awesome Lists containing this project
README
# googlenewsscraper
## Getting Started
### Installation
```bash
$ pip install GoogleNewsScraper
```# Reference
## Importing
```Python
from GoogleNewsScraper import GoogleNewsScraper
```## Instantiating Scraper
```Python
GoogleNewsScraper(driver)
```**Constructor Parameters**
| Name | Type | Required |
| ------ | ---------- | -------- |
| driver | web driver | no |Possible values:
- `'chrome'`: The driver will default to use this package's chrome driver
- A path to some driver (FireFox, for instance) stored on the user's system## Methods
**This method is both public and private, though it really should only be used by the class**
```Python
locate_html_element(self, driver, element, selector, wait_seconds)
```| Name | Type | Required | Description |
| ------------ | ------------- | -------- | -------------------------------------------------------------------- |
| driver | web driver | yes | A web driver (Chrome, FireFox, etc) |
| element | string | yes | Id or class selector of an HTML element |
| selector | Module import | yes | see below |
| wait_seconds | int | no | Waits a certain number of seconds in order to locate an HTML element |**To configure the 'selector' param**:
First install selenium
```bash
$ pip install selenium
```Then import By
```Python
from selenium.webdriver.common.by import By
```Possible values:
- `By.ID`
- `By.CLASS_NAME`
- `By.CSS_SELECTOR`
- `By.LINK_TEXT`
- `By.NAME`
- `By.PARTIAL_LINK_TEXT`
- `By.TAG_NAME`
- `By.XPATH`---
```Python
GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None
```| Name | Type | Required | Description |
| ------------------------- | ---------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| search_text | str | yes | A series of word(s) that will be inputted into the Google search engine |
| date_range | str | no | Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives |
| pages | str or int | no | Number of pages that should be scraped (defaults to 'max') |
| pagination_pause_per_page | int | no | Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages. |
| cb | function | no | Will return all article data on a single page for every page scraped (defaults to False) |- **Example using 'cb' paramater**:
```Python
def handle_page_data(page_data: list):
# Do something with page_dataGoogleNewsScraper(...args).search(...args, cb=handle_page_data)
```**NOTE**:
- If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
- Each list will contain an object of news article data for every news article on that page**Example of the data that every article-object will contain:**
- `'id'`: A unique id for every article data object
- `'description'`: The preview description of the news article
- `'title'`: The title of the news article
- `'source'`: The source of news article (New York Times, for instance)
- `'image_url'`: The url of the preview news article image
- `'url'`: A link to the news article
- `'date_time'`: A datetime string that represents the date of when the article was published---