https://github.com/notfaad/msl-engine
a powerful web scraping engine that uses a custom Domain Specific Language (DSL) to define scraping pipelines.
https://github.com/notfaad/msl-engine
engine rust rust-lang scraper-api scraper-engine scraping-websites webscraping
Last synced: 5 months ago
JSON representation
a powerful web scraping engine that uses a custom Domain Specific Language (DSL) to define scraping pipelines.
- Host: GitHub
- URL: https://github.com/notfaad/msl-engine
- Owner: notFaad
- Created: 2025-08-06T23:00:56.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-08-06T23:07:35.000Z (6 months ago)
- Last Synced: 2025-08-07T01:05:47.955Z (6 months ago)
- Topics: engine, rust, rust-lang, scraper-api, scraper-engine, scraping-websites, webscraping
- Language: Rust
- Homepage:
- Size: 19.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## (EXPERIMENTAL!!)
# MediaScrapeLang Engine (MSL Engine)
A Rust-based web scraping engine with a custom DSL (Domain Specific Language) for defining scraping pipelines.
## ๐ Features
- **Custom DSL**: Write scraping scripts in a minimal, readable language
- **Link Traversal**: Follow links and extract data from multiple pages
- **Variable Extraction**: Extract text and attributes from HTML elements
- **Media Discovery**: Find and download images, videos, and audio files
- **Filtering**: Filter media by source URL patterns and file extensions
- **Async Processing**: Built with async Rust for efficient concurrent scraping
- **CLI Interface**: Easy-to-use command-line tool
## ๐ DSL Syntax
The MediaScrapeLang (MSL) DSL is designed to be minimal and readable:
```msl
open "https://example.com/users"
click ".user-card a"
set user = text
click ".post-list a"
set post = attr("href").split("/")[-1]
media
image
where src ~ "cdn.example.com"
extensions jpg, png
video
where src ~ "cdn.example.com"
extensions mp4, webm
save to "./media/{user}/{post}"
```
### Commands
- `open "url"` - Navigate to a URL
- `click "selector"` - Click/follow links matching a CSS selector
- `set variable = value` - Extract and store a value
- `media` - Define media extraction blocks
- `save to "path"` - Save extracted media to a path
### Values
- `text` - Extract text content
- `attr("name")` - Extract attribute value
- `attr("name").split("/")[-1]` - Extract and process attribute
### Media Filters
- `where src ~ "pattern"` - Filter by source URL pattern
- `extensions jpg, png` - Filter by file extensions
## ๐ ๏ธ Installation
```bash
# Clone the repository
git clone
cd msl-engine
# Build the project
cargo build --release
# Install globally (optional)
cargo install --path .
```
## ๐ Usage
### Command Line Interface
```bash
# Run a script
msl run script.msl
# Parse and validate a script without executing
msl parse script.msl
# Enable verbose output
msl run script.msl --verbose
```
### Programmatic Usage
```rust
use msl_engine::{run_script, MslEngine};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let script = r#"
open "https://example.com"
click ".user-card a"
set user = text
media
image
where src ~ "cdn.example.com"
extensions jpg, png
save to "./media/{user}"
"#;
run_script(script).await?;
Ok(())
}
```
## ๐๏ธ Architecture
The MSL Engine is built with a modular architecture:
- **Parser** (`src/parser/`): Parses MSL scripts into structured AST
- **Scraper** (`src/scraper/`): Handles HTTP requests and HTML parsing
- **Engine** (`src/engine/`): Orchestrates the scraping process
- **CLI** (`src/cli/`): Command-line interface
### Key Components
- **MslScript**: Represents a parsed MSL script
- **MslEngine**: Main execution engine
- **Scraper**: HTTP client and HTML parser
- **MediaItem**: Represents discovered media files
## ๐งช Testing
```bash
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
```
## ๐ฆ Dependencies
- **reqwest**: HTTP client
- **scraper**: HTML parsing
- **nom**: Parser combinator library
- **tokio**: Async runtime
- **clap**: CLI argument parsing
- **anyhow**: Error handling
- **tracing**: Logging
## ๐ง Development Status
- โ
Parser implementation
- โ
Basic scraper functionality
- โ
Engine orchestration
- โ
CLI interface
- ๐ Variable templating in save paths
- ๐ Advanced media filtering
- ๐ Parallel processing
- ๐ Headless browser support
## ๐ค Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request
## ๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
## ๐ฏ Roadmap
- [ ] Variable templating in save paths
- [ ] Advanced media filtering options
- [ ] Parallel processing for multiple pages
- [ ] Headless browser support (JavaScript rendering)
- [ ] Retry logic and error handling
- [ ] Rate limiting and polite scraping
- [ ] Export to different formats (JSON, CSV)
- [ ] Web interface for script editing