https://github.com/izzhafeez/spiderman
Enhanced web scraping tool for handling embedded links, tables and lists
https://github.com/izzhafeez/spiderman
python web-scraping
Last synced: about 2 months ago
JSON representation
Enhanced web scraping tool for handling embedded links, tables and lists
- Host: GitHub
- URL: https://github.com/izzhafeez/spiderman
- Owner: izzhafeez
- Created: 2022-09-01T14:33:19.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2022-09-06T01:47:22.000Z (over 2 years ago)
- Last Synced: 2025-02-13T00:37:38.205Z (3 months ago)
- Topics: python, web-scraping
- Language: Python
- Homepage:
- Size: 6.84 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spiderman Web Scraper
After reading through the BeautifulSoup documentation, I realised that many common operations are not in the module. As such, I filled in as many holes as I possibly can, applying OOP principles to boost the extensibility of my web scraper. Among its features are the following:
- Extract all tables from a particular webpage and merge them based on whichever tables have the same column names
- Extract hrefs and insert them into the text itself using delimiters like brackets (same can be done for tables and lists)
- Standardises all hrefs to be complete links, rather than relational onesI use this module most frequently, so I feel that it is the most impactful out of my earlier projects (to me, at least).