Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/lucasfaudman/souper-scraper

A simple web scraper base that combines BeautifulSoup and Selenium to scrape dynamic websites.
https://github.com/lucasfaudman/souper-scraper

beautifulsoup4 selenium web-sc

Last synced: 11 days ago
JSON representation

A simple web scraper base that combines BeautifulSoup and Selenium to scrape dynamic websites.

Host: GitHub
URL: https://github.com/lucasfaudman/souper-scraper
Owner: LucasFaudman
License: mit
Created: 2024-03-18T22:20:01.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-06-26T01:07:32.000Z (8 months ago)
Last Synced: 2025-01-30T19:11:19.239Z (20 days ago)
Topics: beautifulsoup4, selenium, web-sc
Language: Python
Homepage:
Size: 57.6 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # SouperScraper

> A simple web scraper base that combines BeautifulSoup and Selenium to scrape dynamic websites.

## Setup

1. Install with pip

```bash

pip install souperscraper

```

2. Download the appropriate [ChromeDriver](https://sites.google.com/a/chromium.org/chromedriver/downloads) for your Chrome version using [getchromedriver.py](https://github.com/LucasFaudman/souper-scraper/blob/main/src/souperscraper/getchromedriver.py) (command below) or manually from the [ChromeDriver website](https://sites.google.com/a/chromium.org/chromedriver/downloads).

> To find your Chrome version, go to [`chrome://settings/help`](chrome://settings/help) in your browser.

```bash

getchromedriver

```

3. Create a new SouperScaper object using the path to your ChromeDriver

```python

from souperscraper import SouperScraper

scraper = SouperScraper('/path/to/your/chromedriver')

```

4. Start scraping using BeautifulSoup and/or Selenium methods

```python

scraper.goto('https://github.com/LucasFaudman')

# Use BeautifulSoup to search for and extract content

# by accessing the scraper's 'soup' attribute

# or with the 'soup_find' / 'soup_find_all' methods

repos = scraper.soup.find_all('span', class_='repo')

for repo in repos:

    repo_name = repo.text

    print(repo_name)

# Use Selenium to interact with the page such as clicking buttons

# or filling out forms by accessing the scraper's

# find_element_by_* / find_elements_by_* / wait_for_* methods

repos_tab = scraper.find_element_by_css_selector("a[data-tab-item='repositories']")

repos_tab.click()

search_input = scraper.wait_for_visibility_of_element_located_by_id('your-repos-filter')

search_input.send_keys('souper-scraper')

search_input.submit()

```

## BeautifulSoup Reference

- [Quick Start](https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start)

- [Types of Objects](https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-objects)

- [The BeautifulSoup object](https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup)

- [Navigating the HTML tree](https://beautiful-soup-4.readthedocs.io/en/latest/#navigating-the-tree)

- [Searching for HTML Elements](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree)

- [Modifying the tree](https://beautiful-soup-4.readthedocs.io/en/latest/#modifying-the-tree)

## Selenium Reference

- [Quick Start](https://selenium-python.readthedocs.io/getting-started.html)

- [Navigating the Web](https://selenium-python.readthedocs.io/getting-started.html#)

- [Locating HTML Elements](https://selenium-python.readthedocs.io/locating-elements.html)

- [Interacting with HTML elements on the page](https://selenium-python.readthedocs.io/navigating.html#interacting-with-the-page)

- [Filling in Forms](https://selenium-python.readthedocs.io/navigating.html#filling-in-forms)

- [Waiting (for page to load, element to be visible, etc)](https://selenium-python.readthedocs.io/waits.html)

- [Full Webdriver API Reference](https://selenium-python.readthedocs.io/api.html)