An open API service indexing awesome lists of open source software.

https://github.com/hjsblogger/web-crawling-with-python

Demonstration of Web Crawling using Python and Beautiful Soup
https://github.com/hjsblogger/web-crawling-with-python

beautifulsoup beautifulsoup4 lambdatest python python3 web-crawler web-crawling web-crawling-and-scraping

Last synced: 10 months ago
JSON representation

Demonstration of Web Crawling using Python and Beautiful Soup

Awesome Lists containing this project

README

          

# Web Crawling with Python

cover-image

Image generated using Grok


In this 'Web Crawling with Python' repo, we have covered the following scenario:

Unique links from [LambdaTest E-commerce Playground](https://ecommerce-playground.lambdatest.io/) are crawled using Beautiful Soup. Content (i.e., product meta-data) from the crawled content is than scraped with Beautiful Soup. I have a detailed blog & repo on **Web Scraping with Python**, details below:

* [Blog - Web Scraping with Python](https://www.lambdatest.com/blog/web-scraping-with-python/)
* [Repo - Web Scraping with Python](https://github.com/hjsblogger/web-scraping-with-python)

## Pre-requisites for test execution

**Step 1**

Create a virtual environment by triggering the *virtualenv venv* command on the terminal

```bash
virtualenv venv
```
VirtualEnvironment

**Step 2**

Navigate the newly created virtual environment by triggering the *source venv/bin/activate* command on the terminal

```bash
source venv/bin/activate
```

Follow steps(3) and (4) for performing web scraping on LambdaTest Cloud Grid:

**Step 3**

Run the *make install* command on the terminal to install the desired packages (or dependencies) - Beautiful Soup,urrlib3, etc.

```bash
make install
```

make-install

With this, all the dependencies and environment variables are set. We are all set for web crawling with Beautiful Soup (bs4).

## Web Crawling using Beautiful Soup

Follow the below mentioned steps to for crawling the [LambdaTest E-commerce Playground](https://ecommerce-playground.lambdatest.io/)

**Step 1**

Trigger the command ```make clean``` to clean the remove _pycache_ folder(s) and .pyc files

cover-image

**Step 2**

Trigger the ```make crawl-ecommerce-playground``` command on the terminal to crawl the LambdaTest E-Commerce Playground

web-crawling-1

web-crawling-2

As seen above, the content from LambdaTest E-commerce playground was crawled successfully! Fifty five unique product links are now available to be scraped in the exported JSON file (i.e., ecommerce_crawled_urls.json)

**Step 3**

Now that we have the crawled information, trigger the ```make scrap-ecommerce-playground``` command on the terminal to scrap the product information (i.e., product name, product price, product availability, etc.) from the exported JSON file.

web-scraping-1

web-scraping-2

Also, all the 55 links on are scraped without any issues!

## Have feedback or need assistance?
Feel free to fork the repo and contribute to make it better! Email to [himanshu[dot]sheth[at]gmail[dot]com](mailto:himanshu.sheth@gmail.com) for any queries or ping me on the following social media sites:

LinkedIn: [@hjsblogger](https://linkedin.com/in/hjsblogger)

Twitter: [@hjsblogger](https://www.twitter.com/hjsblogger)