https://github.com/rohanrvpatil/scraping_concepts

This project covers all scraping concepts.
https://github.com/rohanrvpatil/scraping_concepts

beautifulsoup4 httpx python scrapy selenium

Last synced: 44 minutes ago
JSON representation

This project covers all scraping concepts.

Host: GitHub
URL: https://github.com/rohanrvpatil/scraping_concepts
Owner: rohanrvpatil
Created: 2024-09-01T09:55:43.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-09-11T18:07:25.000Z (about 1 year ago)
Last Synced: 2025-01-14T00:28:08.431Z (9 months ago)
Topics: beautifulsoup4, httpx, python, scrapy, selenium
Language: Python
Homepage:
Size: 3.32 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

This GitHub repository includes web scraping projects built with Scrapy, Selenium, BeautifulSoup, httpx.

## basic_scrapy_rugshop:

* **Website:** https://www.therugshopuk.co.uk/rugs-by-room/bedroom-rugs.html
* **Purpose:** Extracts product data of rugs
* **Fields extracted:** name, price, link
* **Scraping tool:** Scrapy
* **Libraries/Methods used:** selectors
* **Exported data:** [output.csv](https://github.com/rohanrvpatil/scraping_concepts/blob/main/basic_scrapy_rugshop/output.csv)

## beautiful_soup_proxies:

* **Website:** https://free-proxy-list.net/
* **Purpose:** Extracting free proxies and verifying them
* **Fields extracted:** proxy(with port)
* **Scraping tool:** BeautifulSoup
* **Libraries/Methods used:** requests
* **Exported data:** [verified_proxies.csv](https://github.com/rohanrvpatil/scraping_concepts/blob/main/scraping_proxies/verified_proxies.csv)

## dynamic_hidden_api_json:

* **Website:** https://www.petsathome.com/
* **Purpose:** Extracts product data of pet toys, accessories, food essentials
* **Fields extracted:** 28 columns of product details
* **Scraping tool:** Fetch/XHR tool in Network tab of Console (Extracted json from API)
* **Libraries/Methods used:** requests
* **Exported data:** [products_data.xlsx](https://github.com/rohanrvpatil/scraping_concepts/blob/main/extracting_json/files/products_data.xlsx), [response_data.json](https://github.com/rohanrvpatil/scraping_concepts/blob/main/extracting_json/files/response_data.json)

## httpx_scraping:

* **Website:** https://www.rei.com/c/downhill-ski-boots
* **Purpose:** Extracts product data of downhill ski-boots
* **Fields extracted:** link, name, product_id, price, rating
* **Scraping tool:** python-httpx
* **Libraries/Methods used:** selectors, urljoin, HTMLParser, dataclasses, export functions for csv/xlsx/json
* **Exported data:** [data.csv](https://github.com/rohanrvpatil/scraping_concepts/blob/main/html_scraping/data_exports/data.csv), [data.json](https://github.com/rohanrvpatil/scraping_concepts/blob/main/html_scraping/data_exports/data.json), [data.xlsx](https://github.com/rohanrvpatil/scraping_concepts/blob/main/html_scraping/data_exports/data.xlsx)

## dynamic_scrapy_splash_beerwulf (test project):

* **Website:** https://www.beerwulf.com/en-gb/c/mixedbeercases
* **Purpose:** Extracts beer product data
* **Fields extracted:** name, price
* **Scraping tool:** scrapy-splash
* **Libraries/Methods used:** None
* **Exported data:** None

## selenium_amazon_products:

* **Website:** [Amazon searching for "dell i7 laptop"](https://www.amazon.com/s?k=dell+i7+laptop&crid=3OIV4GP9RPUT3&sprefix=dell+i7%2Caps%2C687&ref=nb_sb_ss_ts-doa-p_1_7)
* **Purpose:** Extracting product details of laptops related to "dell i7 laptop"
* **Fields extracted:** link, title, price, brand, model name, screen size, about this item, technical details: summary, rating (out of 5)
* **Scraping tool:** Selenium
* **Libraries/Methods used:** user agent rotation, chrome_options
* **Exported data:** [laptop_details.xlsx](https://github.com/rohanrvpatil/scraping_concepts/blob/main/selenium_amazon_products/search_laptop_details/data/laptop_details.xlsx)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rohanrvpatil/scraping_concepts

Awesome Lists containing this project

README