{"id":21050560,"url":"https://github.com/geminidsystems/googlenewsscraper","last_synced_at":"2025-08-13T17:03:50.148Z","repository":{"id":57435350,"uuid":"390448440","full_name":"GeminidSystems/GoogleNewsScraper","owner":"GeminidSystems","description":"A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https://pepy.tech/project/GoogleNewsScraper)","archived":false,"fork":false,"pushed_at":"2022-02-28T23:33:04.000Z","size":16007,"stargazers_count":12,"open_issues_count":5,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T23:51:33.723Z","etag":null,"topics":["crawler","googleautomator","googlenews","googlenewsscraper","googlescraper","python","scraper","scraping","selenium","web-scraping","webcrawler","webdriver","webscraper"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/GoogleNewsScraper/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GeminidSystems.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.txt","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-28T17:46:12.000Z","updated_at":"2025-04-10T19:50:44.000Z","dependencies_parsed_at":"2022-09-19T08:10:20.690Z","dependency_job_id":null,"html_url":"https://github.com/GeminidSystems/GoogleNewsScraper","commit_stats":null,"previous_names":["geminidsystems/google_news_scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeminidSystems%2FGoogleNewsScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeminidSystems%2FGoogleNewsScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeminidSystems%2FGoogleNewsScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GeminidSystems%2FGoogleNewsScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GeminidSystems","download_url":"https://codeload.github.com/GeminidSystems/GoogleNewsScraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254425543,"owners_count":22069195,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","googleautomator","googlenews","googlenewsscraper","googlescraper","python","scraper","scraping","selenium","web-scraping","webcrawler","webdriver","webscraper"],"created_at":"2024-11-19T15:34:20.412Z","updated_at":"2025-05-15T21:33:57.508Z","avatar_url":"https://github.com/GeminidSystems.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# googlenewsscraper\n\n## Getting Started\n\n### Installation\n\n```bash\n$ pip install GoogleNewsScraper\n```\n\n# Reference\n\n## Importing\n\n```Python\nfrom GoogleNewsScraper import GoogleNewsScraper\n```\n\n## Instantiating Scraper\n\n```Python\nGoogleNewsScraper(driver)\n```\n\n**Constructor Parameters**\n\n| Name   | Type       | Required |\n| ------ | ---------- | -------- |\n| driver | web driver | no       |\n\nPossible values:\n\n- `'chrome'`: The driver will default to use this package's chrome driver\n- A path to some driver (FireFox, for instance) stored on the user's system\n\n## Methods\n\n**This method is both public and private, though it really should only be used by the class**\n\n```Python\nlocate_html_element(self, driver, element, selector, wait_seconds)\n```\n\n| Name         | Type          | Required | Description                                                          |\n| ------------ | ------------- | -------- | -------------------------------------------------------------------- |\n| driver       | web driver    | yes      | A web driver (Chrome, FireFox, etc)                                  |\n| element      | string        | yes      | Id or class selector of an HTML element                              |\n| selector     | Module import | yes      | see below                                                            |\n| wait_seconds | int           | no       | Waits a certain number of seconds in order to locate an HTML element |\n\n**To configure the 'selector' param**:\n\nFirst install selenium\n\n```bash\n$ pip install selenium\n```\n\nThen import By\n\n```Python\nfrom selenium.webdriver.common.by import By\n```\n\nPossible values:\n\n- `By.ID`\n- `By.CLASS_NAME`\n- `By.CSS_SELECTOR`\n- `By.LINK_TEXT`\n- `By.NAME`\n- `By.PARTIAL_LINK_TEXT`\n- `By.TAG_NAME`\n- `By.XPATH`\n\n---\n\n```Python\nGoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -\u003e list or None\n```\n\n| Name                      | Type       | Required | Description                                                                                                                                                   |\n| ------------------------- | ---------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| search_text               | str        | yes      | A series of word(s) that will be inputted into the Google search engine                                                                                       |\n| date_range                | str        | no       | Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives                                               |\n| pages                     | str or int | no       | Number of pages that should be scraped (defaults to 'max')                                                                                                    |\n| pagination_pause_per_page | int        | no       | Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages. |\n| cb                        | function   | no       | Will return all article data on a single page for every page scraped (defaults to False)                                                                      |\n\n- **Example using 'cb' paramater**:\n\n```Python\ndef handle_page_data(page_data: list):\n  # Do something with page_data\n\nGoogleNewsScraper(...args).search(...args, cb=handle_page_data)\n```\n\n**NOTE**:\n\n- If no argument is provided for 'cb,' the scrape method will return a two-dimensional list\n- Each list will contain an object of news article data for every news article on that page\n\n**Example of the data that every article-object will contain:**\n\n- `'id'`: A unique id for every article data object\n- `'description'`: The preview description of the news article\n- `'title'`: The title of the news article\n- `'source'`: The source of news article (New York Times, for instance)\n- `'image_url'`: The url of the preview news article image\n- `'url'`: A link to the news article\n- `'date_time'`: A datetime string that represents the date of when the article was published\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeminidsystems%2Fgooglenewsscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeminidsystems%2Fgooglenewsscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeminidsystems%2Fgooglenewsscraper/lists"}