Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hayat01sh1da/web-scrapers

This repository contains Python scripts which collect data from websites.
https://github.com/hayat01sh1da/web-scrapers

Last synced: 3 days ago
JSON representation

This repository contains Python scripts which collect data from websites.

Host: GitHub
URL: https://github.com/hayat01sh1da/web-scrapers
Owner: hayat01sh1da
Created: 2024-03-23T16:23:26.000Z (10 months ago)
Default Branch: master
Last Pushed: 2024-04-08T17:07:52.000Z (9 months ago)
Last Synced: 2024-04-08T21:19:06.060Z (9 months ago)
Language: Python
Size: 3.19 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## 1. Environment

- WSL(Ubuntu 24.04.1 LTS)
- Python 3.13.1

## 2. Reference

PythonによるWebスクレイピング \.入門編\. 業務効率化への第一歩 - Udemy

## 3. Sample Websites for Web Scraping

- [ログイン - Webスクレイピング入門](https://scraping-for-beginner.herokuapp.com/login_page)
- [講師情報 - Webスクレイピング入門](https://scraping-for-beginner.herokuapp.com/mypage)
- [ランキング - Webスクレイピング入門](https://scraping-for-beginner.herokuapp.com/ranking/)
- [画像 - Webスクレイピング入門](https://scraping-for-beginner.herokuapp.com/image)

## 4. Install Chrome Browser(WSL Users Only)

This step is required for the webdriver to avoid failure to find binary of Chrome.

```command
$ sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
$ sudo chmod -R +x /dev/null
$ wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/google.gpg > /dev/null
$ sudo apt update && sudo apt install -y google-chrome-stable
```

## 5. Download Chrome Webdriver

```command
# For Linux(WSL) Users
$ wget https://storage.googleapis.com/chrome-for-testing-public/131.0.6778.264/linux64/chromedriver-linux64.zip -P ./webdrivers/ && \\
cd ./webdrivers/ && \\
unzip chromedriver-linux64.zip && \\
mv chromedriver-linux64/chromedriver chromedriver && \\
rm -rf chromedriver-linux64*

# For Mac Users
$ wget https://storage.googleapis.com/chrome-for-testing-public/131.0.6778.264/mac-arm64/chromedriver-mac-arm64.zip -P ./webdrivers/ && \\
cd ./webdrivers/ && \\
unzip chromedriver-mac-arm64.zip && \\
mv chromedriver-mac-arm64/chromedriver chromedriver-for-mac && \\
rm -rf chromedriver-mac-arm64*
```

## 6. Set Path to a Specific WebDriver as an Environment Variable according to Your OS

```bash
# For Linux(WSL) Users
echo 'export PATH_TO_WEBDRIVER="./webdrivers/chromedriver"' >> ~/.bash_profile

# For Mac Users
echo 'export PATH_TO_WEBDRIVER="./webdrivers/chromedriver"' >> ~/.zprofile
```

## 7. Make Webdriver Ready for Web Scraping

```command
$ sudo apt install libnss3-dev
$ webdrivers/chromedriver
Starting ChromeDriver 131.0.6778.264 (52183f9e99a61056f9b78535f53d256f1516f2a0-refs/branch-heads/6778_155@{#7}) on port 0
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully on port 35997.
```

## 8. Bulk Execution of Unit Tests

```command
$ python -m unittest discover ./test
.............../home/hayat01sh1da/.pyenv/versions/3.13.0/lib/python3.13/unittest/suite.py:107: ResourceWarning: unclosed file <_io.BufferedReader name='/mnt/c/Users/binlh/Documents/web/web-scrapers/imgs/bird.jpg'>
for index, test in enumerate(self):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
..../home/hayat01sh1da/.pyenv/versions/3.13.0/lib/python3.13/unittest/suite.py:84: ResourceWarning: unclosed file <_io.BufferedReader name='/mnt/c/Users/binlh/Documents/web/web-scrapers/imgs/bird.jpg'>
return self.run(*args, **kwds)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.....
----------------------------------------------------------------------
Ran 24 tests in 74.834s

OK
```