Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/darideveloper/phone-emails-scraper-multithreading
Project for extract emails and phones from a list of web pages, with multithreading, using requests, bs4, regex and selenium for get more data.
https://github.com/darideveloper/phone-emails-scraper-multithreading
python script web-automation web-scraping
Last synced: 3 months ago
JSON representation
Project for extract emails and phones from a list of web pages, with multithreading, using requests, bs4, regex and selenium for get more data.
- Host: GitHub
- URL: https://github.com/darideveloper/phone-emails-scraper-multithreading
- Owner: darideveloper
- License: mit
- Created: 2023-01-07T04:39:34.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-11-27T05:08:30.000Z (12 months ago)
- Last Synced: 2024-06-28T07:32:47.110Z (5 months ago)
- Topics: python, script, web-automation, web-scraping
- Language: Python
- Homepage:
- Size: 41 KB
- Stars: 6
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Phone Emails Scraper Multithreading
Project for extract emails and phones from a list of web pages, with multithreading, using requests, bs4, regex and selenium for get more data.
Project type: **client**
Table of Contents
# Build with
# Details
This project is for extract emails and phones from a list of web pages, with multithreading, using requests, bs4, regex and selenium for get more data.
The script extract emails and phones from the web pages in the `input .txt` file, and save the output in the `output.csv` file.
The script use multithreading for extract data from the web pages faster.
The script use selenium (google chrome) for get more data from the web pages, because some web pages use javascript to show the data. You can use or not it (see the `USE_SELENIUM` variable in the `.env` file).
You can setup the number of threads in the `.env` file (see the `THREADS` variable).
# Install
## Prerequisites
* [Google chrome](https://www.google.com/intl/es-419/chrome/)
* [Python >=3.10](https://www.python.org/)
* [Git](https://git-scm.com/)## Installation
1. Clone the repo
```sh
git clone https://github.com/darideveloper/phone-emails-scraper-multithreading
```
2. Install python packages (opening a terminal in the project folder)
```sh
python -m pip install -r requirements.txt
```# Settings
1. Set your option in the file `.env`
2. Put the web pages in the `input.csv` file# Run
1. Run the project folder with python:
```sh
python .
```
2. Wait until the script finish, and check the `output.csv` file in the project folder# Roadmap
- [x] Extract email and phone using requests and bs4
- [x] Extract email and phone using regex
- [x] Extract email and phone using selenium
- [x] Multithreading
- [x] `.env` file for options