Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/denisakp/sangsue
Sangsue is a Python-based web scraping tool for website traversal, link discovery, and data extraction within a domain. Explore websites, map structures, extract valuable data, and classify web pages.
https://github.com/denisakp/sangsue
analytics machine-learning python scraping
Last synced: about 1 month ago
JSON representation
Sangsue is a Python-based web scraping tool for website traversal, link discovery, and data extraction within a domain. Explore websites, map structures, extract valuable data, and classify web pages.
- Host: GitHub
- URL: https://github.com/denisakp/sangsue
- Owner: denisakp
- License: mit
- Created: 2023-09-19T18:29:43.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-19T21:46:42.000Z (over 1 year ago)
- Last Synced: 2024-10-17T01:59:24.271Z (3 months ago)
- Topics: analytics, machine-learning, python, scraping
- Homepage: https://denisakp.github.io/sangsue/
- Size: 552 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# SANGSUE
SangSue is a Python-based web scraping and exploration tool designed to traverse websites, discover links, and gather essential information about web pages within a given domain. This versatile tool empowers users to perform in-depth inspections of websites, map out the website structure, extract valuable data for analysis, and even perform Web Page Classification, categorizing pages based on their content.
# Key Features
- Web Crawling: Automatically explore a specified domain, starting from a given URL.
- Depth Control: Define the maximum depth of exploration to focus on specific areas of a website.
- URL Validation: Ensure that only valid URLs are processed to maintain data accuracy.
- Information Gathering: Collect page titles, meta tags, and discovered URLs during the exploration.
- Interactive Visualization: Visualize the website structure as an interactive graph using Plotly.
- Data Export: Export exploration data in various formats such as JSON or CSV for further analysis.
- Custom Filters: Implement filters based on regular expressions, keywords, or content types.
- Error Handling: Identify and report HTTP errors encountered during the exploration.
- Authentication Support: Handle authentication for protected web pages.
- Scheduled Scans: Plan and schedule explorations at specified intervals.
- User-Friendly Interface: Incorporate a graphical user interface (GUI) for ease of use.
- Web page classification: Categorize web pages based on their content, enhancing your analytical capabilities.
- User-Agent Configuration: Customize the user-agent used during exploration for more versatile web crawling.
- Proxy Support: Utilize proxy servers to enhance privacy and control IP access during web crawling.
- Exploration Delay: Define a time delay between the exploration of two URLs to manage web traffic and prevent overloading servers.
- Pause and resume: Pause and resume exploration precisely where you left off# Licence
This project is licensed under the terms of the MIT license.