https://github.com/chouaib-629/webscraping

A collection of web scraping projects using Beautiful Soup, Selenium, and mixed approaches. Each project includes Python scripts and CSV files of the scraped data. Perfect for learning and experimenting with static and dynamic web scraping techniques.
https://github.com/chouaib-629/webscraping

automation beautifulsoup beautifulsoup4 browser-automation csv datacollection dataextraction dynamicwebsite html-parser jupyter-notebook python python-script python3 selenium staticwebsite webscraping

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/chouaib-629/webscraping
Owner: chouaib-629
Created: 2024-12-13T20:51:36.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-12-20T17:05:31.000Z (7 months ago)
Last Synced: 2024-12-29T04:56:06.368Z (6 months ago)
Topics: automation, beautifulsoup, beautifulsoup4, browser-automation, csv, datacollection, dataextraction, dynamicwebsite, html-parser, jupyter-notebook, python, python-script, python3, selenium, staticwebsite, webscraping
Language: Jupyter Notebook
Homepage:
Size: 106 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Scraping Projects

This repository contains a collection of websites that I have scraped for learning and experimentation purposes. The scraped data is organized into subfolders, where each subfolder corresponds to a specific website. These websites were scraped using different techniques, including Beautiful Soup (bs4) for static content, Selenium for dynamic content, and a mix of both for certain cases.

## Folder Structure

- **Main Folder**: Contains subfolders, each representing a scraped website.
- **Subfolders**: Named based on the website they were scraped from. Each subfolder contains:
- The Python code used to scrape the website in two formats: `.py` and `.ipynb`.
- The CSV file containing the scraped data.

## Scraping Techniques

1. **Static Websites**:
- Scraped using **Beautiful Soup (bs4)**.
- These websites have static HTML content that can be directly accessed and parsed.

2. **Dynamic Websites**:
- Scraped using **Selenium**.
- These websites load data dynamically through JavaScript, requiring a browser simulation to fetch the content.

3. **Mixed Approach**:
- Some websites required a combination of **Selenium** and **bs4**.
- Selenium was used to render the dynamic content, and Beautiful Soup was used for parsing the HTML.

## Classification of Websites

Below is the list of websites classified by the scraping technique used:

### Beautiful Soup (bs4)

- [yallakora](https://www.yallakora.com/match-center/%d9%85%d8%b1%d9%83%d8%b2-%d8%a7%d9%84%d9%85%d8%a8%d8%a7%d8%b1%d9%8a%d8%a7%d8%aa#nav-menu)
- [wuzzuf](https://wuzzuf.net/jobs/egypt)
- [wikipedia](https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue)

### Selenium

- [cookieClicker](https://orteil.dashnet.org/cookieclicker/)

### Mixed Approach

- [coinMarketCap](https://coinmarketcap.com/)

## Requirements

To replicate or run the scraping scripts used in this project, the following Python libraries are required:

- **Beautiful Soup**: `bs4`
- **Selenium**
- **Requests**
- **lxml**
- **html.parser**

Ensure you have Python installed, along with the necessary libraries. For Selenium, download the appropriate browser driver (e.g., ChromeDriver for Google Chrome).

## Getting Started

1. Clone this repository to your local machine:

```bash
git clone https://github.com/chouaib-629/WebScraping.git
```

2. Navigate to the desired subfolder to inspect the scraped data or associated scripts.

## Notes

- The data scraped from these websites is for educational purposes only. Please adhere to the terms and conditions of the websites before scraping.
- The scripts and data are provided "as is" without warranty of any kind.

## Author

This project is managed by a data science enthusiast and full-stack developer experimenting with web scraping techniques.

## Contact Information

For questions or support, please contact [Me](mailto:[email protected]).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chouaib-629/webscraping

Awesome Lists containing this project

README