Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lucasfaudman/souper-scraper
A simple web scraper base that combines BeautifulSoup and Selenium to scrape dynamic websites.
https://github.com/lucasfaudman/souper-scraper
beautifulsoup4 selenium web-sc
Last synced: about 1 month ago
JSON representation
A simple web scraper base that combines BeautifulSoup and Selenium to scrape dynamic websites.
- Host: GitHub
- URL: https://github.com/lucasfaudman/souper-scraper
- Owner: LucasFaudman
- License: mit
- Created: 2024-03-18T22:20:01.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-06-26T01:07:32.000Z (5 months ago)
- Last Synced: 2024-10-09T20:43:15.713Z (about 1 month ago)
- Topics: beautifulsoup4, selenium, web-sc
- Language: Python
- Homepage:
- Size: 57.6 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SouperScraper
> A simple web scraper base that combines BeautifulSoup and Selenium to scrape dynamic websites.
## Setup
1. Install with pip
```bash
pip install souperscraper
```2. Download the appropriate [ChromeDriver](https://sites.google.com/a/chromium.org/chromedriver/downloads) for your Chrome version using [getchromedriver.py](https://github.com/LucasFaudman/souper-scraper/blob/main/src/souperscraper/getchromedriver.py) (command below) or manually from the [ChromeDriver website](https://sites.google.com/a/chromium.org/chromedriver/downloads).
> To find your Chrome version, go to [`chrome://settings/help`](chrome://settings/help) in your browser.
```bash
getchromedriver
```3. Create a new SouperScaper object using the path to your ChromeDriver
```python
from souperscraper import SouperScraperscraper = SouperScraper('/path/to/your/chromedriver')
```4. Start scraping using BeautifulSoup and/or Selenium methods
```python
scraper.goto('https://github.com/LucasFaudman')# Use BeautifulSoup to search for and extract content
# by accessing the scraper's 'soup' attribute
# or with the 'soup_find' / 'soup_find_all' methods
repos = scraper.soup.find_all('span', class_='repo')
for repo in repos:
repo_name = repo.text
print(repo_name)# Use Selenium to interact with the page such as clicking buttons
# or filling out forms by accessing the scraper's
# find_element_by_* / find_elements_by_* / wait_for_* methods
repos_tab = scraper.find_element_by_css_selector("a[data-tab-item='repositories']")
repos_tab.click()search_input = scraper.wait_for_visibility_of_element_located_by_id('your-repos-filter')
search_input.send_keys('souper-scraper')
search_input.submit()
```## BeautifulSoup Reference
- [Quick Start](https://beautiful-soup-4.readthedocs.io/en/latest/#quick-start)
- [Types of Objects](https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-objects)
- [The BeautifulSoup object](https://beautiful-soup-4.readthedocs.io/en/latest/#beautifulsoup)
- [Navigating the HTML tree](https://beautiful-soup-4.readthedocs.io/en/latest/#navigating-the-tree)
- [Searching for HTML Elements](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree)
- [Modifying the tree](https://beautiful-soup-4.readthedocs.io/en/latest/#modifying-the-tree)## Selenium Reference
- [Quick Start](https://selenium-python.readthedocs.io/getting-started.html)
- [Navigating the Web](https://selenium-python.readthedocs.io/getting-started.html#)
- [Locating HTML Elements](https://selenium-python.readthedocs.io/locating-elements.html)
- [Interacting with HTML elements on the page](https://selenium-python.readthedocs.io/navigating.html#interacting-with-the-page)
- [Filling in Forms](https://selenium-python.readthedocs.io/navigating.html#filling-in-forms)
- [Waiting (for page to load, element to be visible, etc)](https://selenium-python.readthedocs.io/waits.html)
- [Full Webdriver API Reference](https://selenium-python.readthedocs.io/api.html)