https://github.com/ddayto21/lead-scraper
Repository contains a web crawler that searches for emails in a webpage, along with a webscraping script that collects leads from various webpages online filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline.
https://github.com/ddayto21/lead-scraper
beautifulsoup4 python requests webcrawler webscraper yellow-pages
Last synced: about 1 month ago
JSON representation
Repository contains a web crawler that searches for emails in a webpage, along with a webscraping script that collects leads from various webpages online filters those links based on some criteria and adds the new links to a queue. All the HTML or some specific information is extracted to be processed by a different pipeline.
- Host: GitHub
- URL: https://github.com/ddayto21/lead-scraper
- Owner: ddayto21
- Created: 2022-07-18T23:10:19.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-07-19T00:22:42.000Z (about 3 years ago)
- Last Synced: 2025-04-30T08:54:05.070Z (6 months ago)
- Topics: beautifulsoup4, python, requests, webcrawler, webscraper, yellow-pages
- Language: Python
- Homepage:
- Size: 50.8 KB
- Stars: 15
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Repository Overview
This repository was built to provide business owners a way to save time by collecting thousands of business leads from Yellow Pages, a website that contains over 27 million businesses in the United States.
We use 'requests', a Python library to collect large amounts of unstructured data from Yellow Pages. Then, we use BeautifulSoup to parse relevant information from HTML format. After this process, we use Pandas to create dataframes and save those leads to .CSV files that can be used for marketing campaigns.
## Install the 'Requests' Library
```
$ pip install requests
```## Import the Requests Library
```pythonimport requests
```
## Send HTTP Request to Server
```pythonresponse = requests.get(url)
```
## Extract Relevant Data from Response
We use BeautifulSoup, a Python library that makes it easy to parse data in HTML files.### Install the Beautiful Soup Library
```
$ pip install beautifulsoup4
```### Import the Beautiful Soup Library
```python
from bs4 import BeautifulSoup
```