https://github.com/izzhafeez/spiderman

Enhanced web scraping tool for handling embedded links, tables and lists
https://github.com/izzhafeez/spiderman

python web-scraping

Last synced: about 2 months ago
JSON representation

Enhanced web scraping tool for handling embedded links, tables and lists

Host: GitHub
URL: https://github.com/izzhafeez/spiderman
Owner: izzhafeez
Created: 2022-09-01T14:33:19.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2022-09-06T01:47:22.000Z (over 2 years ago)
Last Synced: 2025-02-13T00:37:38.205Z (3 months ago)
Topics: python, web-scraping
Language: Python
Homepage:
Size: 6.84 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Spiderman Web Scraper

After reading through the BeautifulSoup documentation, I realised that many common operations are not in the module. As such, I filled in as many holes as I possibly can, applying OOP principles to boost the extensibility of my web scraper. Among its features are the following:

- Extract all tables from a particular webpage and merge them based on whichever tables have the same column names
- Extract hrefs and insert them into the text itself using delimiters like brackets (same can be done for tables and lists)
- Standardises all hrefs to be complete links, rather than relational ones

I use this module most frequently, so I feel that it is the most impactful out of my earlier projects (to me, at least).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/izzhafeez/spiderman

Awesome Lists containing this project

README