Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pps-22-scooby/pps-22-scooby
Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.
https://github.com/pps-22-scooby/pps-22-scooby
crawler crawlers internal-dsl scala scraper scrapers web web-crawler web-crawling web-scraper web-scrapers
Last synced: 3 months ago
JSON representation
Scala application that allows web crawling and web scraping of web pages given as input with the use of special rules passed to it through the use of a DSL.
- Host: GitHub
- URL: https://github.com/pps-22-scooby/pps-22-scooby
- Owner: PPS-22-Scooby
- License: apache-2.0
- Created: 2024-06-03T14:53:41.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-08-07T14:12:14.000Z (5 months ago)
- Last Synced: 2024-10-14T20:01:14.482Z (3 months ago)
- Topics: crawler, crawlers, internal-dsl, scala, scraper, scrapers, web, web-crawler, web-crawling, web-scraper, web-scrapers
- Language: Scala
- Homepage: https://pps-22-scooby.github.io/
- Size: 4.3 MB
- Stars: 7
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PPS-22-Scooby 🔍
## Team:
👨💻 Giovanni Antonioni - [email protected]
👨💻 Valerio Di Zio - [email protected]
👨💻 Francesco Magnani - [email protected]
👨💻 Luca Rubboli - [email protected]
## Technologies:
🔄 Scrum
🛠 SBT
🔗 Git
🎯 YouTrack
🚀 Github Actions
## Overview:
PPS-22-Scooby is a web scraping and crawling application. It enables users to extract data from web pages by crawling through links and scraping specific content according to predefined rules.
## Features:
🕷 **Crawling**: The application navigates web pages, follows links, and retrieves content.
🔍 **Scraping**: Relevant data is extracted from HTML/XML pages using XPath, CSS selectors, or regular expressions.
🛠 **Customization**: Users can define custom scraping and crawling rules to suit their specific needs.
⚙️ **Parallel Processing**: Aspects of parallel programming are integrated for efficient execution.
📤 **Export**: Users can export extracted data in various formats according to their preferences.
## Implementation:
PPS-22-Scooby is built using Scala with Actor libraries for concurrency management. The application utilizes Git for version control, YouTrack for project management, and Github Actions for continuous integration.
## Get Started:
To use PPS-22-Scooby, have a look at the section **Get Started** at https://pps-22-scooby.github.io/