https://github.com/carloocchiena/python_url_crawler

A script that starting from a webpage, iterate thru all its link, appending them in a list. Sort of proxy to get all pages in a website
https://github.com/carloocchiena/python_url_crawler

beautifulsoup crawler python python3

Last synced: 4 months ago
JSON representation

A script that starting from a webpage, iterate thru all its link, appending them in a list. Sort of proxy to get all pages in a website

Host: GitHub
URL: https://github.com/carloocchiena/python_url_crawler
Owner: carloocchiena
Created: 2021-02-22T21:07:01.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2022-11-02T20:19:45.000Z (over 3 years ago)
Last Synced: 2025-05-30T14:38:18.950Z (about 1 year ago)
Topics: beautifulsoup, crawler, python, python3
Language: Python
Homepage:
Size: 7.81 KB
Stars: 2
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# python_url_crawler
A script that starting from a webpage, iterate thru all its link, appending them in a list. Sort of proxy to get all pages in a website.

the old_main is a raw version I made in 1 hours outta a stack overflow questions;

main.py is a quite better version I created from blank, with less code entropy. Seems working decently.

Consider that the script aims to find only urls within the domain, but this could be easily configured tweaking the "cleaner" function

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/carloocchiena/python_url_crawler

Awesome Lists containing this project

README