Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rahulmoundekar/webscraping-in-python
webscraping in python
https://github.com/rahulmoundekar/webscraping-in-python
beautifulsoup4 bs4 html5lib python-3 requests-module webscraper-website
Last synced: 22 days ago
JSON representation
webscraping in python
- Host: GitHub
- URL: https://github.com/rahulmoundekar/webscraping-in-python
- Owner: rahulmoundekar
- Created: 2020-05-05T08:20:41.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T09:46:30.000Z (about 2 years ago)
- Last Synced: 2024-11-07T11:16:24.261Z (2 months ago)
- Topics: beautifulsoup4, bs4, html5lib, python-3, requests-module, webscraper-website
- Language: Python
- Size: 7.41 MB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Scraping With Python :
![python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)
#### Project Setup
- Making the project as :
```
mkdir webscraping
cd webscraping
```
- Web Scraping installation:
```
open command prompt type
pip install virtualenv
create virtualenv
>>virtualenv web-scraping
we need to activate virtualenv for use
>>web-scraping\scripts\activate
need libraries for Web Scraping :
pip install requests
pip install beautifulsoup4 or install bs4
```
- Create WebsiteScrap.py for development
```python
import requests
from bs4 import BeautifulSoup
url = "https://www.learnpython.org/"
response = requests.get(url)
htmlContent = response.content
formatted_html_content = BeautifulSoup(htmlContent, 'html.parser')
# print(formatted_html_content)
# 1} Get the title of the HTML page
title = formatted_html_content.title
print(title)
# if you want only tag content
print(title.string)
# 2} find All anchor tag on this website and print count
list_anchors = formatted_html_content.find_all('a')
# print all anchor tags
print(list_anchors)
# print count
print("Number of anchor tags on this website : ", len(list_anchors))
# 3} Get first element in the HTML page
print(formatted_html_content.find('head'))
# 4} Get classes of any element in the HTML page
print(formatted_html_content.find('a')['class'])
# 5} find all the elements by class name
print(formatted_html_content.find_all("a", class_="navbar-brand"))
# 6} Get the text from the tags/soup
print(formatted_html_content.find("p").get_text())
# 7} Get all the anchor tags from the page with iteration
list_anchors = formatted_html_content.find_all('a')
all_links = set()
for link in list_anchors:
print(link) # get all anchor tag with links
print(link.get('href')) # get all links
all_links.add(link.get('href')) # want to remove duplicate links
print(all_links)
print(len(all_links))
# 8} find duplicate links
all_web_links_count=len(list_anchors)
after_remove_duplicate_links_count=len(all_links)
print('Number of duplicate links in this website are : ',all_web_links_count-after_remove_duplicate_links_count)
```
- In order to run app:
```
python WebsiteScrap.py
```
- create clone in you system just execute this file
```
1} create virtualenv and just type below command
2} pip install -r .\requirements.txt
```