https://github.com/carloocchiena/python_url_crawler
A script that starting from a webpage, iterate thru all its link, appending them in a list. Sort of proxy to get all pages in a website
https://github.com/carloocchiena/python_url_crawler
beautifulsoup crawler python python3
Last synced: 4 months ago
JSON representation
A script that starting from a webpage, iterate thru all its link, appending them in a list. Sort of proxy to get all pages in a website
- Host: GitHub
- URL: https://github.com/carloocchiena/python_url_crawler
- Owner: carloocchiena
- Created: 2021-02-22T21:07:01.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2022-11-02T20:19:45.000Z (over 3 years ago)
- Last Synced: 2025-05-30T14:38:18.950Z (about 1 year ago)
- Topics: beautifulsoup, crawler, python, python3
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 2
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# python_url_crawler
A script that starting from a webpage, iterate thru all its link, appending them in a list. Sort of proxy to get all pages in a website.
the old_main is a raw version I made in 1 hours outta a stack overflow questions;
main.py is a quite better version I created from blank, with less code entropy. Seems working decently.
Consider that the script aims to find only urls within the domain, but this could be easily configured tweaking the "cleaner" function