https://github.com/rasbeetech/web-scraper

Web scraper via Python
https://github.com/rasbeetech/web-scraper

Last synced: about 1 month ago
JSON representation

Web scraper via Python

Host: GitHub
URL: https://github.com/rasbeetech/web-scraper
Owner: RasbeeTech
Created: 2021-02-11T17:47:23.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2021-03-10T20:59:30.000Z (over 5 years ago)
Last Synced: 2025-03-20T09:42:37.449Z (over 1 year ago)
Language: Python
Size: 7.6 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Web-Scraper

 Web scraper via Python and Selenium.

 

 Python version: 3.8.7  

 Selenium version: 3.141.0  

 Tested using: www.webscraper.io/test-sites  

 ### Description

 This webscraper is used to test scraping functionality.  The goal was to scrape all products found in the test urls.  

 ## Modules

 ### main.py

 Use this module to run/test the program.

 ### threader.py

 **Thread**(*self, thread_id, func, args*): initializes thread.  Declare a thread id(thread_id) for identification.  This custom class also passes a function(func) to and and argument(args).  

 **run**(): When Thread().start is called the initiated function(func) will be called with arguement(arg).  

 ### scraper.py

 **Scraper**(*url*): Initializes scraper with url to be scraped.  When Scraper() is declared, a Selenium Chrome webdriver is created, and the content of the url is scraped.  

 **set_chrome_options**(): Defines the chrome options.  They are already set to headless and to disable gpu.  

 **get_content**(): Scrapes all html content from url address.  

 **get_all_tags**(*tag="h1"*): retrieves all elements with a chosen html tag.  Note: Default is "h1" tag.  

 **get_items**(*product_container="div.thumbnail"*): Retrives all specified containers.  Note: Default is "div.thumbnail".  

 **get_all_products**(*content_container="div.thumbnail"*): gets all products found in url.  Takes the arguement content_container which specifies the container holding each product.  Note: Default is "div.thumbnail".  

 **quit**(): terminate the webdriver. Note: Webdrivers should be terminated at the end of each session or they will stay active and occupy device memory.  

 **save_product_csv**(*all_products*): Takes arguement all_products as dictionary object.  Saves all_product into .csv file in the same directory.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rasbeetech/web-scraper

Awesome Lists containing this project

README